Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5657
Koen Bertels Nikitas Dimopoulos Cristina Silvano Stephan Wong (Eds.)
Embedded Computer Systems: Architectures, Modeling, and Simulation 9th International Workshop, SAMOS 2009 Samos, Greece, July 20-23, 2009 Proceedings
13
Volume Editors Koen Bertels Stephan Wong Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands E-mail: {k.l.m.bertels,j.s.s.m.wong}@tudelft.nl Nikitas Dimopoulos University of Victoria Department of Electrical and Computer Engineering P.O. Box 3055, Victoria, BC, V8W 3P6, Canada E-mail:
[email protected] Cristina Silvano Politecnico di Milano Dipartimento di Elettronica e Informazione P.za Leonardo Da Vinci 32, 20133 Milan, Italy E-mail:
[email protected] Library of Congress Control Number: 2009930367 CR Subject Classification (1998): C, B LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-642-03137-4 Springer Berlin Heidelberg New York 978-3-642-03137-3 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12718269 06/3180 543210
Preface
The SAMOS workshop is an international gathering of highly qualified researchers from academia and industry, sharing ideas in a 3-day lively discussion on the quiet and inspiring northern mountainside of the Mediterranean island of Samos. The workshop meeting is one of two co-located events (the other event being the IC-SAMOS). As a tradition, the workshop features presentations in the morning, while after lunch all kinds of informal discussions and nut-cracking gatherings take place. The workshop is unique in the sense that not only solved research problems are presented and discussed but also (partly) unsolved problems and in-depth topical reviews can be unleashed in the scientific arena. Consequently, the workshop provides the participants with an environment where collaboration rather than competition is fostered. The SAMOS conference and workshop were established in 2001 by Stamatis Vassiliadis with the goals outlined above in mind, and located on Samos, one of the most beautiful islands of the Aegean. The rich historical and cultural environment of the island, coupled with the intimate atmosphere and the slow pace of a small village by the sea in the middle of the Greek summer, provide a very conducive environment where ideas can be exchanged and shared freely. SAMOS IX followed the series of workshops started in 2001 in a new expanded program including three special sessions to discuss challenging research trends. This year, the workshop celebrated its ninth anniversary, and 18 papers were presented which were carefully selected out of 52 submissions resulting in an acceptance rate of 34.6%. Each submission was thoroughly reviewed by at least three reviewers and considered by the international Program Committee during its meeting at Delft in March 2009. Indicative of the wide appeal of the workshop is the fact that the submitted works originated from a wide international community. More in detail, the regular papers come from 19 countries: Austria (2), Belgium (2), Brazil (1), Canada (1), Finland (8), France (6), Germany (5), Greece (3), India (1), Italy (3), Japan(1), Norway (2), Russia (1), Spain (2), Sweden (2), Switzerland (2), The Netherlands (7), UK (1) and USA (2). Additionally, three special sessions were organized on topics of current interest: (1) “Instruction-set Customization”, (2) “Reconfigurable Computing and Processor Architectures”, and (3) “Mastering Cell BE and GPU Execution Platforms”. Each special session used its own review procedures, and was given the opportunity to include some relevant works selected from the regular papers submitted to the workshop in addition to some invited papers. Globally, 14 papers were included in the three special sessions. The workshop program also included one keynote speech by Yale Patt from the University of Texas at Austin.
VI
Preface
A workshop like this cannot be organized without the help of many people. First of all, we would like to thank the members of the Steering and Program Committees and the external referees for their dedication and diligence in selecting the technical papers. The investment of their time and insight was very much appreciated. Then, we would like to express our sincere gratitude to Karin Vassiliadis for her continuous dedication in organizing the workshop. We also would like to thank Carlo Galuzzi for managing the financial issues, Sebastian Isaza for maintaining the Website and publicizing the event, Zubair Nawaz for managing the submission system, and Dimitris Theodoropoulos and Carlo Galuzzi (again) for preparing the workshop proceedings. We also thank Lidwina Tromp for her continuous effort in the workshop organization. We hope that the attendees enjoyed the SAMOS IX workshop in all its aspects, including many informal discussions and gatherings. We trust that you will find this year’s SAMOS workshop proceedings enriching and interesting. July 2009
Koen Bertels Cristina Silvano Nikitas Dimopoulos Stephan Wong
Organization
General Co-chairs N. Dimopoulos S. Wong
University of Victoria, Canada TU Delft, The Netherlands
Program Co-chairs K. Bertels C. Silvano
TU Delft, The Netherlands Politecnico di Milano, Italy
Special Session Co-chairs L. Carro E. Deprettere C. Galuzzi A. Varbanescu S. Wong
UFRGS, Brazil Leiden University, The Netherlands TU Delft, The Netherlands TU Delft, The Netherlands TU Delft, The Netherlands
Proceedings Co-chairs C. Galuzzi D. Theodoropoulos
TU Delft, The Netherlands TU Delft, The Netherlands
Web and Publicity Chair S. Isaza
TU Delft, The Netherlands
Submissions Chair Z. Nawaz
TU Delft, The Netherlands
Finance Chair C. Galuzzi
TU Delft, The Netherlands
Symposium Board S. Bhattacharyya G.N. Gaydadijev
University of Maryland, USA TU Delft, The Netherlands
VIII
Organization
J. Glossner A.D. Pimentel J. Takala
Sandbridge Technologies, USA University of Amsterdam, The Netherlands Tampere University of Technology, Finland (Chairperson)
Steering Committee L. Carro E. Deprettere N. Dimopoulos T. D. H¨ am¨al¨ ainen S. Wong
UFRGS, Brazil Leiden University, The Netherlands University of Victoria, Canada Tampere University of Technology, Finland TU Delft, The Netherlands
Program Committee C. Basto J. Becker M. Berekovic S. Chakraborty F. Ferrandi G. Fettweis J. Flich W. Fornaciari P. French K. Goossens D. Guevorkian R. Gupta C. Haubelt M. H¨ annik¨ ainen D. Iancu V. Iordanov H. Jeschke C. Jesshope W. Karl M. Katevenis A. Koch K. Kuchcinski D. Liu W. Luk J. McAllister D. Milojevic A. Moshovos T. Mudge N. Navarro A. Orailoglu B. Pottier
NXP, USA Karlsruhe University, Germany TU Braunschweig, Germany University of Singapore, Singapore Politecnico di Milano, Italy TU Dresden, Germany Technical University of Valencia, Spain Politecnico di Milano, Italy TU Delft, The Netherlands NXP, The Netherlands Nokia, Finland University of California Riverside, USA University of Erlangen-N¨ uremberg, Germany Tampere University of Technology, Finland Sandbridge Technologies, USA Philips, The Netherlands University of Hannover, Germany University of Amsterdam, The Netherlands University of Karlsruhe, Germany FORTH-ICS and University of Crete, Greece TU Darmstadt, Germany Lund University, Sweden Link¨ oping University, Sweden Imperial College, UK Queen’s University of Belfast, UK Universit´e Libre de Bruxelles, Belgium University of Toronto, Canada University of Michigan, United States Technical University of Catalonia, Spain University of California San Diego, USA Universit´e de Bretagne Occidentale, France
Organization
K. Rudd T. Sauter P-M. Seidel H. Schr¨ oder F. Silla M. Sima G. Theodoridis L. Vintan
Intel, USA Austrian Academy of Sciences, Austria SMU University, USA University of Dortmund, Germany Technical University of Valencia, Spain University of Victoria, Canada Aristotle University of Thessaloniki, Greece University of Sibiu, Romania
Reviewers Aaltonen, Timo Agosta, Giovanni Ali, Zeyshan Alvarez, Mauricio Arnold, Oliver Arpinen, Tero Azevedo, Arnaldo Basto, Carlos Becker, Juergen Becker, Tobias Berekovic, Mladen Blume, Steffen Bournoutian, Garo Buchty, Rainer Chakraborty, Samarjit Chen, MingJing Ciobanu, Catalin Deprettere, Ed Dimitrakopoulos, Giorgos Dimopoulos, Nikitas Ehliar, Andreas Feng, Min Ferrandi, Fabrizio Fettweis, Gerhard Flatt, Holger Flich, Jos´e Flynn, Michael Fornaciari, William French, Paddy Galuzzi, Carlo Gelado, Isaac Glossner, John Goossens, Kees Guevorkian, David
Gupta, Rajiv Hanke, Mathias H¨ annik¨ ainen, Marko Haubelt, Christian Iancu, Daniel Isaza, Sebastian Jeschke, Hartwig Jesshope, Chris Jin, Qiwei Kakarountas, Athanasios Karl, Wolfgang Karlstr¨ om, Per Katevenis, Manolis Kellom¨aki, Pertti Klussmann, Heiko Kuchcinski, Krzysztof Lee, Kwangyoon Limberg, Torsten Liu, Dake Luk, Wayne Mamidi, Suman Martin-Langerwerf, Javier Martorell, Xavier McAllister, John Merino, Julio Milojevic, Dragomir Moshovos, Andreas Mudge, Trevor Nagarajan, Vijay Najjar, Walid Navarro, Nacho Nikolopoulos, Dimitrios Nolte, Norman Norkin, Andrey
IX
X
Organization
Orailoglu, Alex Pilato, Christian Pottier, Bernard Rudd, Kevin Sauter, Thilo Sazeides, Yiannakis Schr¨ oder, Hartmut Schulte, Michael Seo, Sangwon Silla, Federico Silvano, Cristina Sima, Mihai Sima, Vlad-Mihai Spinean, Bogdan
Takala, Jarmo Theodoridis, George Thomas, David Tian, Chen Tsoi, Brittle Vintan, Lucian Westermann, Peter Woh, Mark Wong, Stephan Wu, Di Yang, Chengmo Zaccaria, Vittorio
Table of Contents
Beachnote What Else Is Broken? Can We Fix It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yale Patt
1
Architectures for Multimedia Programmable and Scalable Architecture for Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carlos S. de La Lama, Pekka J¨ a¨ askel¨ ainen, and Jarmo Takala
2
The Abstract Streaming Machine: Compile-Time Performance Modelling of Stream Programs on Heterogeneous Multiprocessors . . . . . . Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade
12
CABAC Accelerator Architectures for Video Compression in Future Multimedia: A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yahya Jan and Lech Jozwiak
24
Programmable Accelerators for Reconfigurable Video Decoder . . . . . . . . . Tero Rintaluoma, Timo Reinikka, Joona Rouvinen, Jani Boutellier, Pekka J¨ a¨ askel¨ ainen, and Olli Silv´en Scenario Based Mapping of Dynamic Applications on MPSoC: A 3D Graphics Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Narasinga Rao Miniskar, Elena Hammari, Satyakiran Munaga, Stylianos Mamagkakis, Per Gunnar Kjeldsberg, and Francky Catthoor Multiple Description Scalable Coding for Video Transmission over Unreliable Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roya Choupani, Stephan Wong, and Mehmet R. Tolun
36
48
58
Multi/Many Cores Architectures Evaluation of Different Multithreaded and Multicore Processor Configurations for SoPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sascha Uhrig
68
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto Giorgi, Zdravko Popovic, and Nikola Puzovic
78
XII
Table of Contents
Implementation of W-CDMA Cell Search on a FPGA Based Multi-Processor System-on-Chip with Power Management . . . . . . . . . . . . . Roberto Airoldi, Fabio Garzia, Tapani Ahonen, Dragomir Milojevic, and Jari Nurmi A Multiprocessor Architecture with an Omega Network for the Massively Parallel Model GCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Sch¨ ack, Wolfgang Heenes, and Rolf Hoffmann
88
98
VLSI Architectures Design Towards Automated FSMD Partitioning for Low Power Using Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nainesh Agarwal and Nikitas J. Dimopoulos
108
Radix-4 Recoded Multiplier on Quantum-Dot Cellular Automata . . . . . . Ismo H¨ anninen and Jarmo Takala
118
Prediction in Dynamic SDRAM Controller Policies . . . . . . . . . . . . . . . . . . . Ying Xu, Aabhas S. Agarwal, and Brian T. Davis
128
Inversion/Non-inversion Implementation for an 11,424 Gate-Count Dynamic Optically Reconfigurable Gate Array VLSI . . . . . . . . . . . . . . . . . Shinichi Kato and Minoru Watanabe
139
Architecture Modeling and Exploration Tools Visualization of Computer Architecture Simulation Data for System-Level Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toktam Taghavi, Mark Thompson, and Andy D. Pimentel
149
Modeling Scalable SIMD DSPs in LISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Westermann and Hartmut Schr¨ oder
161
NoGAP: A Micro Architecture Construction Framework . . . . . . . . . . . . . . Per Karlstr¨ om and Dake Liu
171
A Comparison of NoTA and GENESYS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bernhard Huber and Roman Obermaisser
181
Special Session 1: Instruction-Set Customization Introduction to Instruction-Set Customization . . . . . . . . . . . . . . . . . . . . . . . Carlo Galuzzi
193
Table of Contents
Constraint-Driven Identification of Application Specific Instructions in the DURASE System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kevin Martin, Christophe Wolinski, Krzysztof Kuchcinski, Antoine Floch, and Fran¸cois Charot A Generic Design Flow for Application Specific Processor Customization through Instruction-Set Extensions (ISEs) . . . . . . . . . . . . . . . . . . . . . . . . . . . Kingshuk Karuri, Rainer Leupers, Gerd Ascheid, and Heinrich Meyr Runtime Adaptive Extensible Embedded Processors — A Survey . . . . . . Huynh Phung Huynh and Tulika Mitra
XIII
194
204
215
Special Session 2: The Future of Reconfigurable Computing and Processor Architectures Introduction to the Future of Reconfigurable Computing and Processor Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luigi Carro and Stephan Wong
226
An Embrace-and-Extend Approach to Managing the Complexity of Future Heterogeneous Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rainer Buchty, Mario Kicherer, David Kramer, and Wolfgang Karl
227
Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frederico Pratas and Leonel Sousa
237
Reconfigurable Multicore Server Processors for Low Power Operation . . . Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, and Trevor Mudge
247
Reconfigurable Computing in the New Age of Parallelism . . . . . . . . . . . . . Walid Najjar and Jason Villarreal
255
Reconfigurable Multithreading Architectures: A Survey . . . . . . . . . . . . . . . Pavel G. Zaykov, Georgi K. Kuzmanov, and Georgi N. Gaydadjiev
263
Special Session 3: Mastering Cell BE and GPU Execution Platforms Introduction to Mastering Cell BE and GPU Execution Platforms . . . . . . Ed Deprettere and Ana L. Varbanescu Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard Membarth, Frank Hannig, Hritam Dutta, and J¨ urgen Teich
275
277
XIV
Table of Contents
Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Monakov and Arutyun Avetisyan Experiences with Cell-BE and GPU for Tomography . . . . . . . . . . . . . . . . . Sander van der Maar, Kees Joost Batenburg, and Jan Sijbers Realizing FIFO Communication When Mapping Kahn Process Networks onto the Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dmitry Nadezhkin, Sjoerd Meijer, Todor Stefanov, and Ed Deprettere Exploiting Locality on the Cell/B.E. through Bypassing . . . . . . . . . . . . . . Pieter Bellens, Josep M. Perez, Rosa M. Badia, and Jesus Labarta Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C´edric Augonnet, Samuel Thibault, Raymond Namyst, and Maik Nijhuis Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
289 298
308 318
329
341
What Else Is Broken? Can We Fix It? Yale Patt The University of Texas at Austin
Abstract. The founder and soul of this conference, Professor Stamatis Vassiliadis, always wanted a Keynote on the beach. A keynote without PowerPoint, air conditioning, and all the other usual comforts of keynotes, comforts both for the speaker and for the audience. After all, the great thinkers of this ancient land did their thinking, teaching, and arguing without PowerPoint and without air conditioning. But they were they and we are we, and no sane SAMOS keynote speaker would put himself in the same league with those masters. Nonetheless, Stamatis wanted it, and I never found it easy to say no to Stamatis, so last year at SAMOS VIII, I agreed to give a Keynote on the Beach. It has been subsequently relabeled The Beachnote, and I have been asked to do it again. The question of course is what subject to explore in this setting, where the sound of the speaker’s voice competes with the sounds of the waves banging against the shore, where the image of the speaker’s gestures competes with the image of the blue sky, bright sun, and hills of Samos. I decided last summer to choose a meta-topic, rather than a hard core technical subject: ”Is it broken,” with particular emphasis on professors – are they ready to teach, are they ready to do research, and students – are they learning, is their education preparing them for what is needed after they graduate. My sense is that for this environment, a meta-topic is the right model, and so I propose to visit it again. For example: our conferences and journals. Are they broken? Can we fix them? Somewhat more technical: The interface between the software that people write to solve problems and the hardware that has to run that software. Is it broken? Can we fix it? These are just examples of some of the things we might explore in this year’s Beachnote. As I said last year, I will welcome other suggestions from the audience as to what they think is broken. My hope is to have us all engaged in identifying and discussing some of the fundamental problems that plague our community.
K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, p. 1, 2009.
Programmable and Scalable Architecture for Graphics Processing Units Carlos S. de La Lama1 , Pekka J¨aa¨skel¨ainen2, and Jarmo Takala2 1
Universidad Rey Juan Carlos, Department of Computer Architecture, Computer Science and Artificial Intelligence, C/ Tulip´ an s/n, 28933 M´ ostoles, Madrid, Spain
[email protected] 2 Tampere University of Technology, Department of Computer Systems, Korkeakoulunkatu 10, 33720 Tampere, Finland
[email protected],
[email protected] Abstract. Graphics processing is an application area with high level of parallelism at the data level and at the task level. Therefore, graphics processing units (GPU) are often implemented as multiprocessing systems with high performance floating point processing and application specific hardware stages for maximizing the graphics throughput. In this paper we evaluate the suitability of Transport Triggered Architectures (TTA) as a basis for implementing GPUs. TTA improves scalability over the traditional VLIW-style architectures making it interesting for computationally intensive applications. We show that TTA provides high floating point processing performance while allowing more programming freedom than vector processors. Finally, one of the main features of the presented TTA-based GPU design is its fully programmable architecture making it suitable target for general purpose computing on GPU APIs which have become popular in recent years. Keywords: GPU, GPGPU, TTA, VLIW, LLVM, GLSL, OpenGL.
1
Introduction
3D graphics processing can be seen as a compound of sequential stages applied to a set of input data. Commonly, graphics processing systems are abstracted as so called graphics pipelines, with only minor differences between the various existing APIs and implementations. Therefore, stream processing [1], where a number of kernels (user defined or fixed) are applied to a stream of data of the same type, is often thought as the computing paradigm of graphics processing units. Early 3D accelerating GPUs were essentially designed to perform a fixed set of operations in an effective manner, with no capabilities to customize this process [2]. Later, some vendors started to add programmability to their GPU K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 2–11, 2009. c Springer-Verlag Berlin Heidelberg 2009
Programmable and Scalable Architecture for Graphics Processing Units
3
products, leading to standardization of “shading languages”. Both of the major graphics APIs (OpenGL and DirectX) proposed their own implementation of such languages. DirectX introduced the High Level Shading Language [3], while OpenGL defined the OpenGL Shading Language (GLSL) [4], first supported as an optional extension to OpenGL 1.4 and later becoming part of the standard in OpenGL 2.0. GLSL is similar to the standard C language, but includes some additional data types for vectors and matrices, and library functions to perform the common operations with the data types. Programs written in GLSL (called shaders) can customize the behavior of two specific stages of the OpenGL graphics pipeline (dashed boxes in Figure 1) [5]. Vertex shaders are applied to the input points defining the vertices of the graphics primitives (such as points, lines or polygons) in a 3D coordinate system called model space. Depending on the type of primitive being drawn, the rasterizer then generates a number of visible points between the transformed vertices. These new points are called fragments. Each drawn primitive usually produces the equal number of fragments as there are covered pixels on the screen. The rasterizer interpolates several attributes, such as color or texture coordinates, between vertices to find the corresponding value (called varying) for each fragment, and the programmable fragment shader can postprocess and modify those values. The movement to allow programming of parts of the graphics pipeline led to GPU vendors providing custom APIs for using their GPUs for more general purpose computing (GPGPU) [6], extending the application domain of GPUs to a wide range of programs with highly parallelizable computation. Finally, in the end of 2008, a vendor-neutral API for programming heterogeneous platforms (which can include also GPU-like resources) was standardized. The OpenCL standard [7] was welcomed by the GPGPU community as a generic alternative to platform-specific GPGPU APIs such as NVIDIA’s CUDA. [8] This paper presents a design work in progress of a programmable and scalable GPU architecture based on the Transport Triggered Architecture (TTA), a class of VLIW architectures. The proposed architecture we call TTAGPU is fully programmable and implements all of the graphics pipeline in software. TTAGPU can be scaled at the instruction and task level to produce GPUs with varying size/performance ratio, enabling its use in both embedded and desktop systems. Furthermore, the full programmability allows it to be adapted for GPGPU style of computation, and, for example, to support the OpenCL API. While common practice in GPU design goes through the intensive use of data-parallel models,
Vertices
Vertex Shader
Transformed vertices
Rasterizer
Fragments
To screen
Framebuffer
Colored fragments
Fragment Shader
Fig. 1. Simplified view of the customizable OpenGL pipeline
4
C.S. de La Lama, P. J¨ aa ¨skel¨ ainen, and J. Takala
our approach tries to exploit parallelism at instruction level, thus avoiding the programmability penalty caused by SIMD operations. The rest of the paper is organized as follows. Section 2 discusses briefly the related work, Section 3 describes the main points in the TTAGPU design, Section 4 provides some preliminary results on the floating point scalability of the architecture, and Section 5 concludes the paper and discusses the future directions.
2
Related Work
The first generation of programmable GPUs included specialized hardware for vertex processing and fragment processing as separate components, together with texture mapping units and rasterizers, set up on a multi-way stream configuration to exploit the inherent parallelism present on 3D graphic algorithms. As modern applications needed to customize the graphic processing to a higher degree, it became obvious that such heterogeneous architectures were not the ideal choice. Therefore, with the appearance of the unified shader model in 2007 [9], the differences between vertex and fragment shaders begun to disappear. Newer devices have a number of unified shaders that can do the same arithmetic operations and access the same buffers (although some differences in the instruction sets are still present). This provides better programmability to the graphic pipeline, while the fixed hardware on critical parts (like the rasterizer) ensures high performance. However, the stream-like connectivity between computing resources still limits the customization of the processing algorithm. The major GPU vendors (NVIDIA & ATI) follow this approach in their latest products [10,11]. The performance of the unified shader is evaluated in [12] by means of implementing a generic GPU microarchitecture and simulating it. The main conclusion of the paper is that although graphical performance improves only marginally with respect to non-unified shader architectures, it has real benefits in terms of efficiency per area. The shader performance analysis in the paper uses shaders implemented with the OpenGL ARB assembly-like low level language. Although already approved by the Architecture Review Board, this is still an extension to the OpenGL standard, while GLSL is already part of it, which is why we have used it as our shader program input language. Furthermore, new trends on parallel non-graphical computations on GPUs are geared towards using high level languages. A different approach to achieve GPU flexibility is being proposed by Intel with its Larrabee processor [13]. Instead of starting from a traditional GPU architecture, they propose a x86-compatible device with additional floating point units for enhanced arithmetic performance. Larrabee includes very little specific hardware, the most notable exception being the texture mapping unit. Instead, the graphics pipeline is implemented in software, making it easier to modify and customize. Larrabee is to be deployed as a “many-core” solution, with number of cores in 64 and more. Each core comprises a 512-bit vector FPU capable of 16 simultaneous single-precision floating point operations.
Programmable and Scalable Architecture for Graphics Processing Units
3
5
TTAGPU Architecture
The goal of the TTAGPU design is to implement an OpenGL-compliant graphics API which is accelerated with a customized TTA processor, supports the programming of the graphic pipeline as described in the OpenGL 2.1 specification [14] (GLSL coded shaders) and allows high-level language programmability especially with support for OpenCL API in mind. Therefore, the design follows a software-based approach, similar to Larrabee, with additional flexibility provided through programmability. However, as not being tied to the x86 architecture, the datapath resource set can be customized more freely to accelerate the GPU application domain. 3.1
Transport Triggered Architectures
VLIWs are considered interesting processor alternatives for applications with high requirements for data processing performance[15] and with limited control flow, such as graphics processing. Transport Triggered Architectures (TTA) is a modular processor architecture template with high resemblance to VLIW architectures. The main difference between TTAs and VLIWs can be seen in how they are programmed: instead of defining which operations are started in which function units (FU) at which instruction cycles, TTA programs are defined as data transports between register files (RF) and FUs of the datapath. The operations are started as side-effects of writing operand data to the “triggering port” of the FU. Figure 2 presents a simple example TTA processor.[16] The programming model of VLIW imposes limitations for scaling the number of FUs in the datapath. Upscaling the number of FUs has been problematic in VLIWs due to the need to include as many write and read ports in the RFs as there are FU operations potentially completed and started at the same time. Additional ports increase the RF complexity, resulting in larger area and critical path delay. Also, adding an FU to the VLIW datapath requires potentially new bypassing paths to be added from the FU’s output ports to the input ports of the other FUs in the datapath, which increases the interconnection network complexity. Thanks to its programmer-visible interconnection network, TTA datapath
Fig. 2. Example of a TTA processor
6
C.S. de La Lama, P. J¨ aa ¨skel¨ ainen, and J. Takala
can support more FUs with simpler RFs [17]. Because the scheduling of data transports between datapath units are programmer-defined, there is no obligation to scale the number of RF ports according to the number of FUs [18]. In addition, the datapath connectivity can be tailored according to the application at hand, adding only the bypassing paths that benefit the application the most. In order to support fast automated design of TTA processors, a toolset project called TTA-based Codesign Environment (TCE) was started in 2003 in Tampere University of Technology [19]. TCE provides a full design flow from software written in C code down to parallel TTA program image and VHDL implementation of the processor. However, as TTAGPU was evaluated only at architectural level for this paper, the most important tools used in the design were its cycleaccurate instruction set simulator and the compiler, both of which automatically adapt to the set of machine resources in the designed processors. Because TTA is a statically scheduled architecture with high level of detail exposed to the programmer, the runtime efficiency of the end results produced with the design toolset depends heavily on the quality of the compiler. TCE uses the LLVM Compiler Infrastructure [20] as the backbone for its compiler tool chain (later referred to as ’tcecc’), thus benefits from its global optimizations such as aggressive dead code elimination and link time inlining. In addition, the TCE code generator includes an efficient instruction scheduler with TTA-specific optimizations, and a register allocator optimized to produce better instructionlevel parallelism for the post-pass scheduler. 3.2
Scaling on the Instruction Level
The TTAGPU OpenGL implementation is structured into two clearly separated parts. First part is the API layer, which is meant to run on the main CPU on the real scenario. It communicates with the GPU by a command FIFO, each command having a maximum of 4 floating-point arguments. Second part is the software implementation of the OpenGL graphics pipeline running in the TTA. We have tried to minimize the number of buffers to make the pipeline stages as long as possible, as this gives the compiler more optimization opportunities. The OpenGL graphics pipeline code includes both the software implementation of the pipeline routines itself, in addition to the user defined shader programs defined with GLSL. For the graphics pipeline code, we have so far implemented a limited version capable of doing simple rendering, allowing us to link against real OpenGL demos with no application code modification. Because tcecc already supports compilation of C and C++, it is possible to compile the user-defined GLSL code with little additional effort by using C++ operator overloading and a simple preprocessor, and merge the shader code with the C implementation of the graphics pipeline. Compiling GLSL code together with the C-based implementation of the graphics pipeline allows user-provided shaders to override the programmable parts, while providing an additional advantage of global optimizations and code specialization that is done after the final program linking. For example, if a custom shader program does not use a result produced by some of the fixed functionality of the
Programmable and Scalable Architecture for Graphics Processing Units
7
for i = 1...16 do f = produce_fragment() // the rasterizer code f = glsl_fragment_processor(f) write_to_framebuffer_fifo(f) Fig. 3. Pseudocode of the combined rasterizer/fragment shader loop body
graphics pipeline code, the pipeline code will be removed by the dead code elimination optimization. That is, certain types of fragment shader programs compiled with the pipeline code can lead, to higher rasterizer performance. Preliminary profiling of the current software graphics pipeline implementation showed that the bottleneck so far is on the rasterizer, and, depending on its complexity, on the user-defined fragment shader. This makes sense as the data density on the pipeline explodes after rasterizing as usually a high number of fragments are generated by each primitive. For example, a line can be defined using two vertices from which the rasterizer produces fragments enough to represent all the visible pixels between the two vertices. Thus, in TTAGPU we concentrated on optimizing the rasterizer stage by creating a specialized rasterizer loop which processes 16 fragments at a time. The combined rasterizer/custom fragment shader loop (pseudocode shown in Fig. 3) is fully unrolled by the compiler, implementing effectively a combined 16-way rasterizer and fragment processor on software. The aggressive procedure inlining converts the fully unrolled loop to a single big basic block with the actual rasterizer code producing a fragment and the user defined fragment shader processing it without the need for large buffers between the stages. In addition, the unrolled loop bodies can be often made completely independent from each other, improving potential for high level of ILP exposed to the instruction scheduler. In order to avoid extra control flow in the loop which makes it harder to extract instruction level parallelism (ILP) statically, we always process 16 fragments at a time “speculatively” and discard the possible extra fragments at the end of computation. 3.3
Scaling on the Task Level
In order to achieve scalability on the task level, we placed hardware-based FIFO buffers at certain points in the software graphics pipeline. The idea is to add “frontiers” at suitable positions of the pipeline allowing multiple processors to produce and process the FIFO items arbitrarily. It should be noted, however, that it is completely possible in this configuration that the same processor produces and processes the items in the FIFOs. In this type of single core setting, the hardware FIFO merely reduces memory accesses required to pass data between the graphics pipeline stages. The guidelines followed when placing these buffers were: 1) separate stages with different data densities, 2) place the FIFOs in such positions that the potential for ILP at each stage is as high as possible, and 3) compile the user-defined shader code and related graphics pipeline code together to maximize code specialization and ILP.
8
C.S. de La Lama, P. J¨ aa ¨skel¨ ainen, and J. Takala
Command FIFO
Vertex processing
Vertex FIFO
OpenGL
OpenGL
Rasterization /
API
State
Fragment proc.
Clipping
Fragment FIFO
Framebuffer writing TTAGPU driver tasks GPU
Fig. 4. High-level software structure
These three points are met by placing two hardware FIFOs in the pipeline. One after the vertex processing, as the number of processed vertices needed for primitive rasterizing changes with the different rendering modes (points, lines or polygons), resulting in varying data density. This FIFO allows vertex processing to proceed until enough vertices for primitive processing are available. It also serves as an entry point for new vertices generated during clipping. The second FIFO is placed after fragment processing, and before the framebuffer writing stage. Framebuffer writing has some additional processing to perform (ownership test, blending, etc.) that cannot be performed completely on per-fragment basis as they depend on the results of previous framebuffer writes. This FIFO allows us to create the highly parallelizable basic block block performing rasterization and fragment processing with no memory writes as the frame buffer writing is done with a custom operation accessing the FIFO. The hardware-supported FIFOs have a set of status registers that can be used to poll for FIFO emptiness and fullness. This enables us to use light weight cooperative multithreading to hide the FIFO waiting time with processing of elements from the other FIFOs. Software implementation structure is shown in Figure 4. The clean isolation between stages allows the system to connect sets of processors that access the FIFO elements as producers and/or consumers making the system flexible and scalable at the task level. Scaling at the task level can be done simply by adding either identical TTAs or even processors with completely different architectures to the system. The only requirement placed for the added processors is the access to the hardware FIFOs.
4
Results
In order to evaluate the ILP scalability of the TTAGPU in the combined rasterizer/fragment processor loop, we implemented a simple example OpenGL
Programmable and Scalable Architecture for Graphics Processing Units
9
Table 1. Resources in the TTAGPU variations resource floating point units 32 bit x 32 register files 1 bit boolean registers transport buses integer ALUs 32 bit load-store units 32 bit shifters
1 FPU 2 FPU 4 FPU 1 2 4 1 2 4 2 4 8 3 6 12 1 1 1 1 1 1 1 1 1
8 FPU 8 8 16 24 1 1 1
16 FPU 16 16 32 48 1 1 1
application that renders number of lines randomly to the screen and colors them with a simple fragment shader. The goal of this experiment was to see how well the single TTAGPU cores scale at the instruction level only by adding multitudes of resource sets to the architecture and recompiling the software using tcecc. The resource set we used for scaling included a single FPU, three transport buses, and a register file with 32 general purpose 32 bit registers. The resources in the benchmarked TTAGPU variations are listed in Table 1. In order to produce realistic cycle counts for floating point code, we used the pipeline model of the MIPS R4000 floating point units of which description was available in literature [21]. The unit includes eight floating-point operations that share eleven different pipeline resources. However, our benchmark used only addition, division, multiplication and comparison of floating point values. The benchmark was executed using the TCE cycle-accurate processor architecture simulator for TTAGPUs with the different number of resource sets. Figure 5 shows the speedup improvements in the unrolled rasterizer loop from just adding multiples of the “scaling resource sets” to the machine and recompiling the code. This figure indicates that the ILP scalability of the heavily utilized rasterizer loop is almost linear thanks to the aggressive global optimizations and a register allocator that avoids the reuse of registers as much as possible, reducing the number of false dependencies limiting the parallelization between the 12
11.5x
10
speedup
8
7.2x
6 3.8x
4 2 1.8x 1.0x 0 1
2
4
8 # of FPU resource sets
16
Fig. 5. Scalability of the rasterizer loop with different number of floating point resources
10
C.S. de La Lama, P. J¨ aa ¨skel¨ ainen, and J. Takala
loop iterations. The scaling gets worse when getting closer to the 16 FPUs version because a hard limit of about 500 general purpose registers in our compiler, and because the loop was implemented with only 16 iterations. With a larger iteration count there would be more operations with which to hide the latencies of the previous iterations.
5
Conclusions
In this paper we have proposed a mainly software-based implementation of a graphics processing unit based on the scalable TTA architecture. We have shown TTA is an interesting alternative to be used for applications where high data processing performance is required, as is the case with GPUs. TTA provides improved scalability at the instruction level in comparison to VLIWs, due to its programmer-visible interconnection network. The scalability of the proposed TTAGPU on both the task and the instruction level makes the system an interesting platform also to be considered for other data parallel applications designed to be executed on GPU-type platforms. Evaluating the proposed TTAGPU platform for supporting applications written using the OpenCL 1.0 standard [7] is being worked on. Additional future work includes completing the OpenGL API implementation, evaluating the multi-core performance of TTAGPU and implementing an actual hardware prototype. Acknowledgments. This research was partially funded by the Academy of Finland, the Nokia Foundation and Finnish Center for International Mobility (CIMO).
References 1. Stephens, R.: A survey of stream processing. Acta Informatica 34(7), 491–541 (1997) 2. Crow, T.S.: Evolution of the Graphical Processing Unit. Master’s thesis, University of Nevada, Reno, NV (December 2004) 3. St-Laurent, S.: The Complete Effect and HLSL Guide. Paradoxal Press (2005) 4. Kessenich, J.: The OpenGL Shading Language. 3DLabs, Inc. (2006) 5. Luebke, D., Humphreys, G.: How GPUs work. Computer 40(2), 96–100 (2007) 6. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Kr¨ uger, J., Lefohn, A.E., Purcell, T.J.: A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum 26(1), 80–113 (2007) 7. Khronos Group: OpenCL 1.0 Specification (Februrary 2009), http://www.khronos.org/registry/cl/ 8. Halfhill, T.R.: Parallel Processing with CUDA. Microprocessor Report (January 2008) 9. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro. 28(2), 39–55 (2008) 10. Wasson, S.: NVIDIA’s GeForce 8800 graphics processor. Tech. Report (November 2007)
Programmable and Scalable Architecture for Graphics Processing Units
11
11. Wasson, S.: AMD Radeon HD 2900 XT graphics processor: R600 revealed. Tech Report (May 2007) 12. Moya, V., Gonz´ alez, C., Roca, J., Fern´ andez, A., Espasa, R.: Shader Performance Analisys on a Modern GPU Architecture. In: 38th IEEE/ACM Int. Symp. Microarchitecture, Barcelona, Spain, November 12-16. IEEE Computer Society, Los Alamitos (2005) 13. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: A Many-Core x86 Architecture for Visual Computing. ACM Transactions on Graphics 27(18) (August 2008) 14. Segal, M., Akeley, K.: The OpenGL Graphics System: A Specification. Silicon Graphics, Inc. (2006) 15. Colwell, R.P., Nix, R.P., O’Donnell, J.J., Papworth, D.B., Rodman, P.K.: A VLIW architecture for a trace scheduling compiler. In: ASPLOS-II: Proc. second int. conf. on Architectual support for programming languages and operating systems, pp. 180–192. IEEE Computer Society Press, Los Alamitos (1987) 16. Corporaal, H.: Microprocessor Architectures: from VLIW to TTA. John Wiley & Sons, Chichester (1997) 17. Corporaal, H.: TTAs: missing the ILP complexity wall. Journal of Systems Architecture 45(12-13), 949–973 (1999) 18. Hoogerbrugge, J., Corporaal, H.: Register file port requirements of Transport Triggered Architectures. In: MICRO 27: Proc. 27th Int. Symp. Microarchitecture, pp. 191–195. ACM Press, New York (1994) 19. J¨ aa ¨skel¨ ainen, P., Guzma, V., Cilio, A., Takala, J.: Codesign toolset for applicationspecific instruction-set processors. In: Proc. Multimedia on Mobile Devices 2007, pp. 65070X–1 — 65070X–11 (2007), http://tce.cs.tut.fi/ 20. Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis & transformation. In: Proc. Int. Symp. Code Generation and Optimization, Palo Alto, CA, March 20-24, p. 75 (2004) 21. Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2003)
The Abstract Streaming Machine: Compile-Time Performance Modelling of Stream Programs on Heterogeneous Multiprocessors Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade Barcelona Supercomputing Center, C/Jordi Girona, 31, 08034 Barcelona, Spain {paul.carpenter,alex.ramirez,eduard.ayguade}@bsc.es
Abstract. Stream programming offers a portable way for regular applications such as digital video, software radio, multimedia and 3D graphics to exploit a multiprocessor machine. The compiler maps a portable stream program onto the target, automatically sizing communications buffers and applying optimizing transformations such as task fission or fusion, unrolling loops and aggregating communication. We present a machine description and performance model for an iterative stream compilation flow, which represents the stream program running on a heterogeneous multiprocessor system with distributed or shared memory. The model is a key component of the ACOTES open-source stream compiler currently under development. Our experiments on the Cell Broadband Engine show that the predicted throughput has a maximum relative error of 15% across our benchmarks.
1
Introduction
Many people [1] have recognized the need to change the way software is written to take advantage of multi-core systems [2] and distributed memory [3,4,5]. This paper is concerned with applications such as digital video, software radio, signal processing and 3D graphics, all of which may be represented as block diagrams, in which independent blocks communicate and synchronize only via regular streams of data. Such applications have high task and data parallelism, which is hidden when the program is written in C or a similar sequential programming language, requiring the programmer to apply high level optimizations such as task fusion, fission and blocking transformations by hand. Recent work on stream programming languages, most notably StreamIt [6] and Synchronous Data Flow (SDF) [7], has demonstrated how a compiler may potentially match the performance of hand-tuned sequential or multi-threaded code [8]. This work is part of the ACOTES project [9], which is developing a complete open-source stream compiler for embedded systems. This compiler will automatically partition a stream program to use task-level parallelism, size communications buffers and aggregate communications through blocking. This paper describes the Abstract Streaming Machine (ASM), which represents the target system to this compiler. Figure 1 shows the iterative compilation flow, with a K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 12–23, 2009. c Springer-Verlag Berlin Heidelberg 2009
The Abstract Streaming Machine: Compile-Time Performance Modelling
13
source + SPM pragmas
Task fusion Allocation
Mercurium
source + acolib
gcc
Blocking
ICI plugin
executable
trace
Search algorithm
ASM simulator
Fig. 1. The ACOTES iterative stream compiler
search algorithm determining the candidate mapping, which is compiled using Mercurium [10] and GCC. The Mercurium source-to-source convertor translates from the SPM source language [11,12], and performs task fusion and allocation. The resulting multi-threaded program is compiled using GCC, which we are extending within the project to perform blocking to aggregate computation and communication. Running the executable program generates a trace, which is analysed by the search algorithm to resolve bottlenecks. An alternative feedback path generates a trace using the ASM simulator, which is a coarse-grain model of the ASM. This path does not require recompilation, and is used when resizing buffers or to approximate the effect of fission or blocking.
2
Stream Programming
There are several definitions of stream programming, differing mostly in the handling of control flow and restrictions on the program graph topology [13]. All stream programming models, however, represent the program as a set of kernels, communicating only via unidirectional streams. The producer has a blocking push primitive and the consumer has a blocking pop primitive. This programming model is deterministic provided that the kernels themselves are deterministic, there is no other means of communication between kernels, each stream has one producer and one consumer, and the kernels cannot check whether a push or pop would block at a particular time [14]. When the stream program is compiled, one or more kernels are mapped to each task, which is executed in its own thread. The communications primitives
14
P.M. Carpenter, A. Ramirez, and E. Ayguade
are provided by the ACOTES run-time system, acolib, which also creates and initializes threads at the start of the computation, and waits for their completion at the end. The run-time system supports two-phase communication, and can be implemented for shared memory, distributed memory with DMA, and hardware FIFOs. On the producer side, pushAcquire returns a pointer to an empty array of np elements; the np parameter is equal to the producer’s blocking factor, and is supplied during stream initialization. When the task has filled this buffer with new data, it calls pushSend to request that acolib delivers the data to the consumer. On the consumer side, popAcquire returns a pointer to the next full block of nc elements. When the consumer has finished with the data in this block, it calls popDiscard to mark the block as empty.
3
ASM Machine Description
The target is represented as a bipartite graph of processors and memories in one partition, and interconnects in the other. Figure 2 shows the topology of two example targets. Each processor and interconnect is defined using the parameters summarized in Figures 3 and 4, and described below. The machine description defines the machine visible to software, which may not exactly match the physical hardware. For example, the OS in a Playstation 3 makes six of the eight SPEs available to software. We assume that the processors used by the stream program are not time-shared with other applications while the program is running. Each processor is defined using the parameters shown in Figure 3(a). The details of the processor’s ISA and micro-architecture are described internally to the back-end compiler, so are not duplicated in the ASM. The processor description includes the costs of the acolib library calls. The costs of the pushSend and popAcquire primitives are given by a staircase function; i.e. a fixed cost, a
PPE
LS1
LS3
LS5
LS7
SPE1
SPE3
SPE5
SPE7
$1
$2
$3
P1
P2
P3
EIB Bus
Mem
SPE0
SPE2
SPE4
SPE6 Mem
LS0
LS2
LS4
LS6
(a) Cell-based system Processor
(b) Shared-memory system Memory
Interconnect
Fig. 2. Topology of two example targets
The Abstract Streaming Machine: Compile-Time Performance Modelling
Parameter
Description
15
Value
Unique name in platform namespace ‘SPEn ’ Clock rate, in GHz 3.2 True if the processor can perform IO False List of the physical memories addressable by this [(LSn,0)] processor and their virtual address pushAcqCost Cost, in cycles, to acquire a producer buffer (before 448 waiting) pushSendFixedCost Fixed cost, in cycles, to push a block (before wait1104 ing) pushSendUnit Number of bytes per push transfer unit 16384 pushSendUnitCost Incremental cost, in cycles, to push pushUnit bytes 352 popAcqFixedCost Fixed cost, in cycles, to pop a block (before waiting) 317 popAcqUnit Number of bytes per pop transfer unit 16384 popAcqUnitCost Incremental cost, in cycles, to pop popUnit bytes 0 popDiscCost Cost, in cycles, to discard a consumer buffer (before 189 waiting)
name clockRate hasIO addressSpace
(a) Definition of a processor Parameter name
Description
Value
Unique name in platform namespace ‘EIB’ clockRate Clock rate, in GHz 1.6 elements List of the names of the elements (processors and [‘PPE’,‘SPE0’, memories) on the bus · · · , ‘SPE7’] interfaceDuplex If the bus has more than one channel, then define [True, · · · , for each processor whether it can transmit and True] receive simultaneously on different channels interfaceRouting Define for each processor the type of routing [None, · · · , from this bus: storeAndForward, cutThrough, or None] None startLatency Start latency, L, in cycles 80 startCost Start cost on the channel, S, in cycles 0 bandwidthPerCh Bandwidth per channel, B in bytes per cycle 16 finishCost Finish cost, F , in cycles 0 numChannels Number of channels on the bus 3 multiplexable False for a hardware FIFO that can only support True one stream
(b) Definition of an interconnect Fig. 3. Processor and interconnect parameters of the Abstract Streaming Machine and values for the Cell Broadband Engine
16
P.M. Carpenter, A. Ramirez, and E. Ayguade
block size, and an incremental cost for each complete or partial block after the first. This variable cost is necessary both for FIFOs and for distributed memory with DMA. For distributed memory, the size of a single DMA transfer is often limited by hardware, so that larger transfers require additional processor time in pushSend to program multiple DMA transfers. The discontinuity at 16K in Figure 5 is due to this effect. The addressSpace and hasIO parameters provide constraints on the compiler mapping, but are not required to evaluate the performance of a valid mapping. The former defines the local address space of the processor; i.e. which memories are directly accessible and where they appear in local virtual memory, and is used to place stream buffers. The model assumes that the dominant bus traffic is communication via streams, so either the listed memories are private local stores, or they are shared memories accessed via a private L1 cache. In the latter case, the cache should be sufficiently effective that the cache miss traffic on the interconnect is insignificant. The hasIO parameter defines which processors can perform system IO, and is a simple way to ensure that tasks that need system IO are mapped to a capable processor. Each interconnect is defined using the parameters shown in Figure 3(b). The system topology is given by the elements parameter, which for a given interconnect lists the adjacent processors and memories. Each interconnect is modelled as a bus with multiple channels, which has been shown to be a good approximation to the performance observed in practice when all processors and memories on a single link are equidistant [15]. Each bus has a single unbounded queue to hold the messages ready to be transmitted, and one or more channels on which to transmit them. The compiler statically allocates streams onto buses, but the choice of channel is made at runtime. The interfaceDuplex parameter defines for each resource; i.e. processor or memory, whether it can simultaneously read and write on different channels. The bandwidth and latency of each channel is controlled using four parameters: the start latency (L), start cost (S), bandwidth (B), and finish cost (F ). In transferring a message of size n bytes, the latency of the link is given by n n L + S + B and the cost incurred on the link by S + B + F . This model is natural for distributed memory machines, and amounts to the assumption of cache-to-cache transfers on shared memory machines. Hardware routing is controlled using the interfaceRouting parameter, which defines for each processor whether it can route messages from this interconnect: each entry can take the value storeAndForward, cutThrough or None. Each memory is defined using the parameters shown in Figure 4. The latency and bandwidth figures are currently unused in the model, but may be used by the compiler to refine the estimate of the run time of each task. The memory definitions are used to determine where to place communications buffers, and provide constraints on blocking factors.
The Abstract Streaming Machine: Compile-Time Performance Modelling
Parameter Description name size clockRate latency bandwidth
17
Value
Unique name in platform namespace ‘LSn ’ Size, in bytes 262144 Clock rate, in GHz 3.2 Access latency, in cycles 2 Bandwidth, in bytes per cycle 128
Fig. 4. Memory parameters of the Abstract Streaming Machine and values for the Cell Broadband Engine
4
ASM Program Description
The compiled stream program is a connected directed graph of tasks and pointto-point streams, as described in Section 2. All synchronization between tasks happens in the blocking acolib communications primitives described above. A task may have complex data-dependent or irregular behaviour. The basic unit of sequencing inside a task is the subtask, which pops a fixed number of elements from each input stream and pushes a fixed number of elements on each output stream. In detail, the work function for a subtask is divided into three consecutive phases. First, the acquire phase obtains the next set of full input buffers and empty output buffers. Second, the processing phase works locally on these buffers, and is modelled using a fixed processing time, determined from a trace. Finally, the release phase discards the input buffers, and sends the output buffers, releasing the buffers in the same order they were acquired. This three-stage model is not a deep requirement of the ASM, and was introduced as a convenience in the implementation of the simulator, since our compiler will naturally generate subtasks of this form. A stream is defined by the size of each element, and the location and length of either the separate producer and consumer buffers (distributed memory) or the single shared buffer (shared memory). These buffers do not have to be of the same length. If the producer or consumer task uses the peek primitive, then the buffer length should be reduced to model the effective size of the buffer, excluding the elements of history that share the buffer. The Finite Impulse Response (FIR) filters in the GNU radio benchmark of Section 6 are described in this way. It is possible to specify a number of elements to prequeue on the stream before execution begins.
5
Implementation and Methodology
We use a small suite of benchmarks and target platforms, which have been translated by hand into the description files. The benchmarks were evaluated on an IBM QS20 blade, which has two Cell processors. The producer-consumer benchmark is used to determine basic parameters, and has two actors: a producer, and
5
10
20 15 10 5 0
+++++++++++++++++++++++++++++++ +********************************
Gigabytes per second
1.1
*
+* +*+* ++* +*+** + ++** ++* ++** ++** +* +** ++* ++** ++++++++** +*********
Measured Simulated
0.9 0.7
usecs per firing
+
25
P.M. Carpenter, A. Ramirez, and E. Ayguade 1.3
18
15
20
25
+++++++ ++++++******** ++++++++******** +*+*+****** ++* ++*+* +*+** ++*+*+** + ++** * ++** ++** +*+** + ++** ++** ++** +*+** + +* + Measured ++** ++** * Simulated ++** +*+**
30
5
(a) Time per iteration
10
15
20
25
30
(b) Throughput
Fig. 5. Results for producer-consumer benchmark
n=4 n=3
6
7
n=8
n=4
n=5
0.9
n=2
0.7
0.7
0.9
n=2
1.3
7 6 n=5
1.1
n=8
1.1
1.3
consumer, with two buffers at each end. The chain benchmark, is a linear pipeline of n tasks, and is used to characterize bus contention. The chain2 benchmark is used to model latency and queue contention, and is a linear pipeline, similar to chain, but with an extra cut stream between the first and last tasks. The number of blocks in the consumer-side buffer on the cut stream is a parameter, c. For all benchmarks, the number of bytes per iteration is denoted b. Figure 5 shows the time per iteration for producer-consumer, as a function of b. The discontinuity at b = 16K is due to the overhead of programming two DMA transfers. For b < 20.5K, the bottleneck is the computation time of the producer task, as can be seen in Figure 7(a) and (b), which compares real and simulated traces for b = 8K. For b > 20.5K, the bottleneck is the interconnect, and the
10
15
20
25
30
1.3
7
n=6
5
10
15
20
25
30
(b) Chain: averaged real results 5
(a) Chain: real results
0
n=5
n=8
1.1
n=4 3
4
5
+ *
Measured Simulated
3
0
+*
+* +*
+* + +**
+* +* + +**
+* +* +*
+* +* +**
+* c=1 +* c=2 +* c=3 * +** c=6
0
0.7
1
2
0.9
n=2
+* +*
0
5
10
15
20
25
(c) Chain: simulated results
30
2
4
6
8
(d) Chain2: time per iteration
Fig. 6. Time per iteration for the chain and chain2 benchmarks
The Abstract Streaming Machine: Compile-Time Performance Modelling
19
slope of the line is the reciprocal of the bandwidth: 25.6GB/s. Figure 7(c) and (d) compares real and simulated traces for b = 24K. The maximum relative error for 0 < b < 32K is 3.1%. Figure 6 shows the time per iteration for chain, as a function of n, the number of tasks, and b, the block size. Figure 6(a) shows the measured performance on the IBM QS20 blade, when tasks are allocated to SPEs in increasing numerical order. The EIB on the Cell processor consists of two clockwise and two anticlockwise rings, each supporting up to three simultaneous transfers provided that they do not overlap. The drop in real, measured, performance from n = 4 to n = 5 and from n = 7 to n = 8 is due to contention on particular hops of the EIB, which the ASM does not attempt to model. As described in Section 3, the ASM models an interconnect as a set of parallel buses connecting an (unordered) set of processors. Figure 6(b) shows the average of the measured performance of three random permutations of the SPEs. The simulated results in Figure 6(c) are hence close to the expected results, in a probabilistic sense, when the physical ordering of the SPEs is not known. Figure 6(d) shows the time per iteration for chain2, as a function of the number of tasks, n, and the size of the consumer-side buffer of the shortcut stream between the first and last tasks, denoted c. The bottleneck is either the computation time of the first task (1.27us per iteration) or is due to the latency of the chain being exposed due to the finite length of the queue on the shortcut
1
1
2
2
1.0ms
1.002ms
1.004ms
1.006ms
1.008ms
(a) Compute bound (real)
1.004ms
1.006ms
1.008ms
1
2
2 1.002ms
1.004ms
1.006ms
1.008ms
1.0ms
(c) Comm. bound (real)
1.002ms
1.004ms
1.006ms
1.008ms
(d) Comm. bound (simulated)
1
1
2
2
3
3
4
4
5
5
6
6
7 1.5ms
1.002ms
(b) Compute bound (simulated)
1
1.0ms
1.0ms
7 1.507ms
1.514ms
1.521ms
1.528ms
1.535ms
(e) Queuing bound (real)
0.74ms
0.747ms
0.754ms
0.761ms
0.768ms
0.775ms
(f ) Queuing bound (simulated)
Processing
Pop wait
Push remote wait
Pop work
Push local wait
Push work
Fig. 7. Comparison of real and simulated traces
20
P.M. Carpenter, A. Ramirez, and E. Ayguade
stream. Figure 7(e) and (f) shows real and simulated traces for the latter case, with n = 7 and c = 2.
6
Validation
This section describes the validation work using our GNU radio benchmark, which is based on the FM stereo demodulator in GNU Radio [16]. Table 1(a) shows the computation time and multiplicity per kernel, the latter being the number of times it is executed per pair of l and r output elements. Four of the kernels, being FIR filters, peek backwards in the input stream, requiring history as indicated in the table. Other than this, all kernels are stateless. Table 1 shows two possible mappings of the GNU radio benchmark onto the Cell processor, being the mapping of kernel to task and blocking factors. The first allocates one task per kernel, using a total of seven of the eight available SPEs. Based on the resource utilization, the Carrier kernel was split into two worker tasks and the remaining kernels were partitioned onto two other SPEs. This gives 79% utilization of four processors, and approximately twice the throughput of the unoptimized mapping, at 7.71ms per iteration, rather than 14.73ms per iteration. The throughput and latency from the simulator are within 0.5% and 2% respectively. Table 1. Kernels and mappings of the GNU radio benchmark Kernel Demodulation Lowpass (middle) Bandpass Carrier Frequency shift Lowpass (side) Sum
Multiplicity History Time per % of total buffer firing (us) load 8 n/a 398 1.7% 1 1.6K 7, 220 3.8% 8 1.6K 7, 246 30.4% 8 3.2K 14, 351 60.2% 8 n/a 12 0.1% 1 1.6K 7, 361 3.9% 1 n/a 13 0.0%
(a) Kernels Task 1 2 3 4 5 6 7
Kernel
Blocking factor Demodulation 512 Lowpass (middle) 128 Bandpass 1024 Carrier 1024 Frequency shift 1024 Lowpass (side) 128 Sum 128
(b) Naive mapping
Task 1 2 3 4
Kernel
Blocking factor Demodulation 1024 Bandpass 1024 Carrier (even) 1024 Carrier (odd) 1024 Lowpass (middle) 128 Frequency shift 1024 Lowpass (side) 128 Sum 128
(c) Optimized mapping
The Abstract Streaming Machine: Compile-Time Performance Modelling
7
21
Related Work
Most work on machine description languages for retargetable compilers has focused on describing the ISA and micro-architecture of a single processor. Among others, the languages ISP, LISA, and ADL may be used for simulation, and CODEGEN, BEG, BURG, nML, EXPRESSION, Maril and GCC’s .md machine description are intended for code generation (see; e.g. [17]). The ASM describes the behaviour of the system in terms of that of its parts, and is designed to co-exist with these lower-level models. The Stream Virtual Machine (SVM) is an intermediate representation of a stream program, which forms a common language between a high-level and lowlevel compiler [18,19]. Each kernel is given a linear computation cost function, comprised of a fixed overhead and a cost per stream element consumed. There is no model of irregular dataflow. The SVM architecture model is specific to graphics processors (GPUs), and characterizes the platform using a few parameters such as the bandwidth between local and global memory. The PCA Machine Model [20], by the Morphware Forum, is an XML definition of a reconfigurable computing device, in terms of resources, which may be processors, DMA engines, memories and network links. The reconfigurable behaviour of a target is described using ingredients and morphs. Unlike the ASM, the PCA Machine Model describes the entire target, including low-level information about each processor’s functional units and number of registers. ORAS is a retargetable simulator for design-space exploration of streambased dataflow architectures [21]. The target is specified by the architecture instance, which defines the hardware as a graph of architecture elements, similar to the resources of the ASM. Since the purpose is performance analysis rather than compilation, the system is specified to a greater level of detail than the ASM. Gordon et al. present a compiler for the StreamIt language targeting the Raw Architecture Workstation, and applying similar transformations to those discussed in this paper [22]. As the target is Raw, there is no general machine model similar to the ASM. The compiler uses simulated annealing to minimize the length, in cycles, of the critical path. Our approach has higher computational complexity in the compiler’s cost model, but provides retargetability and greater flexibility in the program model. Gedae is a proprietary stream-based graphical programming environment for signal processing applications in the defense industry. The developer specifies the mapping of the stream program onto the target, and the compiler generates the executable implementation [23]. There is no compiler search algorithm or cost model. A version of Gedae has been released for the Cell processor.
Acknowledgements The researchers at BSC-UPC were supported by the Spanish Ministry of Science and Innovation (contract no. TIN2007-60625), the European Commission in the
22
P.M. Carpenter, A. Ramirez, and E. Ayguade
context of the ACOTES project (contract no. IST-34869) and the HiPEAC Network of Excellence (contract no. IST-004408). We would also like to acknowledge our partners in the ACOTES project for the insightful discussions on the topics presented in this paper.
References 1. Sutter, H., Larus, J.: Software and the concurrency revolution. Queue 3(7), 54–62 (2005) 2. Parkhurst, J., Darringer, J., Grundmann, B.: From single core to multi-core: preparing for a new exponential. In: Proc. ICCAD 2006, pp. 67–72. ACM Press, New York (2006) 3. Chaoui, J., Cyr, K., Giacalone, J., Gregorio, S., Masse, Y., Muthusamy, Y., Spits, T., Budagavi, M., Webb, J.: OMAP: Enabling Multimedia Applications in Third Generation (3G) Wireless Terminals. In: SWPA001 (2000) 4. Chen, T., Raghavan, R., Dale, J., Iwata, E.: Cell Broadband Engine Architecture and its first implementation. IBM developer Works (2005) 5. ClearSpeed (2005), http://www.clearspeed.com/docs/resources/ClearSpeed_ -Architecture_Whitepaper_Feb07v2.pdf CSX Processor Architecture 6. Thies, W., Karczmarek, M., Amarasinghe, S.: StreamIt: A Language for Streaming Applications. ICCC 4 (2002) 7. Lee, E., Messerschmitt, D.: Synchronous data flow. Proceedings of the IEEE 75(9), 1235–1245 (1987) 8. Gummaraju, J., Rosenblum, M.: Stream Programming on General-Purpose Processors. In: Proc. MICRO 38, Barcelona, Spain (November 2005) 9. ACOTES IST-034869, Advanced Compiler Technologies for Embedded Streaming, http://www.hitech-projects.com/euprojects/ACOTES/ 10. Balart, J., Duran, A., Gonzalez, M., Martorell, X., Ayguade, E., Labarta, J.: Nanos Mercurium: a Research Compiler for OpenMP. In: Proceedings of the European Workshop on OpenMP, vol. 2004 (2004) 11. Carpenter, P., Rodenas, D., Martorell, X., Ramirez, A., Ayguad´e, E.: A streaming machine description and programming model. In: Vassiliadis, S., Berekovi´c, M., H¨ am¨ al¨ ainen, T.D. (eds.) SAMOS 2007. LNCS, vol. 4599, pp. 107–116. Springer, Heidelberg (2007) 12. ACOTES: IST ACOTES Project Deliverable D2.2 Report on Streaming Programming Model and Abstract Streaming Machine Description Final Version (2008) 13. Stephens, R.: A survey of stream processing. Acta Informatica 34(7), 491–541 (1997) 14. Kahn, G.: The semantics of a simple language for parallel processing. Information Processing 74, 471–475 (1974) 15. Girona, S., Labarta, J., Badia, R.: Validation of Dimemas communication model for MPI collective operations. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, pp. 39–46. Springer, Heidelberg (2000) 16. GNU Radio, http://www.gnu.org/software/gnuradio/ 17. Ramsey, N., Davidson, J., Fernandez, M.: Design principles for machine-description languages. ACM Trans. Programming Languages and Systems (1998) 18. Labonte, F., Mattson, P., Thies, W., Buck, I., Kozyrakis, C., Horowitz, M.: The stream virtual machine. In: Proc. PACT 2004, pp. 267–277 (2004)
The Abstract Streaming Machine: Compile-Time Performance Modelling
23
19. Mattson, P., Thies, W., Hammond, L., Vahey, M.: Streaming virtual machine specification 1.0. Technical report (2004), http://www.morphware.org 20. Mattson, P.: PCA Machine Model, 1.0. Technical report (2004) 21. Kienhuis, B.: Design Space Exploration of Stream-based Dataflow Architectures: Methods and Tools. Delft University of Technology, Amsterdam, The Netherlands (1999) 22. Gordon, M., Thies, W., Amarasinghe, S.: Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In: Proc. ASPLOS 2006, pp. 151–162 (2006) 23. Lundgren, W., Barnes, K., Steed, J.: Gedae: Auto Coding to a Virtual Machine. In: Proc. HPEC (2004)
CABAC Accelerator Architectures for Video Compression in Future Multimedia: A Survey Yahya Jan and Lech Jozwiak Faculty of Electrical Engineering Eindhoven University of Technology, The Netherlands {Y.Jan,L.Jozwiak}@tue.nl
Abstract. The demands for high quality, real-time performance and multi-format video support in consumer multimedia products are ever increasing. In particular, the future multimedia systems require efficient video coding algorithms and corresponding adaptive high-performance computational platforms. The H.264/AVC video coding algorithms provide high enough compression efficiency to be utilized in these systems, and multimedia processors are able to provide the required adaptability, but the algorithms complexity demands for more efficient computing platforms. Heterogeneous (re-)configurable systems composed of multimedia processors and hardware accelerators constitute the main part of such platforms. In this paper, we survey the hardware accelerator architectures for Context-based Adaptive Binary Arithmetic Coding (CABAC) of Main and High profiles of H.264/AVC. The purpose of the survey is to deliver a critical insight in the proposed solutions, and this way facilitate further research on accelerator architectures, architecture development methods and supporting EDA tools. The architectures are analyzed, classified and compared based on the core hardware acceleration concepts, algorithmic characteristics, video resolution support and performance parameters, and some promising design directions are discussed. The comparative analysis shows that the parallel pipeline accelerator architecture seems to be the most promising. Keywords: RC hardware architectures, accelerators, multimedia processing, UHDTV, video compression, H.264/AVC, CABAC.
1
Introduction
The real-time performance requirement of modern multimedia applications, like: video conferencing, video telephony, camcoders, surveillance, medical imaging, and especially High Definition Television (HDTV) and new emerging Ultra HDTV (UHDTV) in video broadcasting domain, demand for highly efficient computational platforms. The problem is amplified by the quickly growing requirements of higher and higher quality, especially in the video broadcast domain, what results in a huge amount of data processing for the new standards of digital TV, like UHDTV that requires a resolution of (7680x4320)∼ 33Megapixel K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 24–35, 2009. c Springer-Verlag Berlin Heidelberg 2009
CABAC Accelerator Architectures for Video Compression
25
with a data rate of 24Gbps. Additionally, the latest standards video coding algorithms are much more complex due to the digital multimedia convergence and specifically access of multimedia through a variety of networks and different coding formats used by a single device, as well as, the slow vanishing of the old video coding standards (e.g. MPEG-2) and widespread adaptation of the new standards (e.g. H.264/AVC, VC1 etc). The computational platforms for multimedia are also required to be (re-)configurable, to enable their adaptation to the various domains, accessing networks, standards and work modes. Hardware accelerators constitute the kernel of such (re-)configurable high-performance platforms. Despite of spectacular advances in microelectronic industry, the future multimedia systems cannot be realized using the conventional processor architectures or the existing multimedia processors. They require highly efficient specialized hardware architectures to satisfy the stringent functional and non-functional requirements, and be flexible enough to support multiple domains, standards and modes, and have to be implemented with SoC platforms involving embedded (re-)configurable hardware. In particular, (re-)configurable hardware accelerators are indispensable for the development of these specialized and demanding systems, as well as, new design and design automation methodologies to support development of such accelerators. H.264/AVC [1] is the latest multi-domain video coding standard that provides the compression efficiency of almost 50% higher than former standards (e.g. MPEG-2) due to its advance coding tools. However, its computational complexity is about four times higher compared to its predecessors, and induces the necessity of the real-time video coding through a sophisticated dedicated hardware design. H.264/AVC supports two entropy coding modes: Context Adaptive Variable Length Coding (CAVLC) and Context-based Adaptive Binary Arithmetic Coding (CABAC) [2]. CAVLC covers Baseline profile of H.264/AVC for low-end applications, like video telephony, while CABAC targets Main and High profiles for high-end applications, like HDTV. CABAC improves the compression efficiency 9%-14% as compared to CAVLC at the cost of an increase in complexity of 25-30% and 12% for encoding and decoding, respectively, in terms of access frequency [2][3][4]. Its purely software based implementation results in an unsatisfactory performance even for a low quality and resolution video (e.g. 30-40 cycles are required on average for a single bin decoding on DSP [3]). The situation is much worse for High Definition (HD) video as the maximum bin rate requirement of HD (level 3.1 to 4.2) in H.264/AVC, averaged across a coded picture, ranges from 121 Mbins/s to 1.12 Gbins/s [5]. This makes the software based implementation inadequate to achieve the real-time performance for HD video as a multi-giga hertz RISC processor would be required for HD encoding in realtime [6]. Moreover, the serial nature of CABAC paralyzes the other processes in video codec that could be performed in parallel, making CABAC a bottleneck in the overall codec performance. Consequently, to achieve the required performance, flexibility, low cost and low energy consumption, a sophisticated (re-)configurable hardware accelerator for CABAC is an absolute necessity.
26
Y. Jan and L. Jozwiak
However, the bitwise serial processing nature of CABAC, the strong dependencies among the different partial computations, a substantial number of memory accesses, and variable number of cycles per bin processing put a huge challenge on the design of such an effective and efficient hardware accelerator. Numerous research groups from academia and industry all over the world have proposed different hardware architectures for CABAC using different hardware acceleration concepts and schemes. Our work reported in this paper is performed in the framework of a research project that aims to develop an adequate design methodology and propose supporting EDA tools for development of demanding (re-)configurable hardware accelerators. This paper surveys several most interesting recently proposed hardware accelerator architectures for CABAC. Its main purpose is to deliver a critical insight in the proposed hardware accelerator solutions, and this way facilitate our own and other researchers further work on (re-)configurable accelerator architectures for future complex multimedia applications, architecture development methods and supporting EDA tools. The architectures are analyzed, classified and compared based on the core hardware acceleration concepts, algorithmic characteristics, video resolution support and performance parameters in the hardware accelerator domain, like throughput, frequency, resource utilization and power consumption. Based on the critical architecture comparisons some promising design directions are discussed in view of the requirements of current and future digital multimedia applications. The rest of the paper is organized as follows. Section 2 introduces CABAC. Section 3 covers the main hardware accelerator concepts and classification. Using them, Section 4 presents a critical review of hardware accelerator architectures for CABAC, comparison of various architectures and discusses some promising design directions. Section 5 concludes the paper.
2
Introduction to CABAC
CABAC utilizes three elementary processes to encode a syntax element (SE), i.e. an element of data (motion data, quantized transform coefficients data, control data) represented in the bitstream to be encoded. The processes are: binarization, context modeling and binary arithmetic coding, as shown in Figure 1. The binarization maps a non-binary valued SE to a unique binary representation referred to as bin string. Each bit of this binary representation is called a bin. The reduction of the SE alphabet size to binary in binarization not only minimizes the complexity of arithmetic coder, but also enables the subsequent context modeling stage to more efficiently model the statistical behavior of the syntax elements (SEs). Four basic binarization schemes are used in CABAC [2]. The context modeling process determines the probabilities of the bins using pre-defined context (probability) models, before they are encoded arithmetically. The context models are selected taking into account the neighboring information of the bins/SEs referred to as context. CABAC defines 460 unique context models, each of which correspond to a certain bin or several bins of a SE, and are
CABAC Accelerator Architectures for Video Compression
27
Fig. 1. Block Diagram of CABAC Encoder
updated after bin encoding in order to adopt the models to the varying statistics of the video data. Each context model comprises of the 6-bit probability state index (pStateIdx) and the most probable symbol (MPS) value of the bin [2]. CABAC utilizes the table-based binary arithmetic coder [7] to avoid the costly multiplication process in probability calculation. The binary arithmetic coding engine consists of two sub-engines: regular and bypass, as shown in Figure 1. The regular coding engine utilizes adaptive probability models, but the bypass coding engine assumes a uniform probability model to speed up the encoding process. To encode a bin, the regular coding engine requires the probability model (pStateIdx, MPS) and the corresponding interval range (width) R and base (lower bound) L of the current code interval. The interval is then divided into two subintervals according to the probability estimate (ρLP S ) of the least probable symbol (LPS). Then, one of the subintervals is chosen as the new interval based on whether the bin is equal to MPS or LPS, as given in the following equations [2]. Rnew = R − RLP S Rnew = RLP S
and
and
Lnew = L
if
Lnew = L + R − RLP S
bin = MPS if
bin = LPS
(1) (2)
where RLP S = R . ρLP S represents the size of the subinterval associated with the LPS. The probability model is then updated, and the renormalization takes place to keep R and L within their legal ranges. The process repeats for the next bin. In bypass encoding the probability estimation and update processes are bypassed, because uniform probability is assumed for a bypass bin.
3
Main Concepts of Hardware Acceleration
Hardware accelerator is an application-specific hardware sub-system that can implement a given function more effectively and efficiently than in software running on a conventional processor. A good example is the graphic accelerator. The main concepts of hardware acceleration can be summarized as follows:
28
Y. Jan and L. Jozwiak
– Parallelism exploitation for execution of a particular computation instance due to availability of multiple application-specific operational resources working in parallel; – Parallelism exploitation for execution of several different computation instances at the same time due to pipelining; – Application-specific processing units with tailored processing and data granularity. More specifically these concepts can be oriented towards the data parallelism, functional parallelism and their mixture. In data parallelism the multiple data instance of the same type are processed in parallel, provided the application allows and the resources are available. The functional parallelism simultaneously performs different operations on (possibly) different data instances. Also, the speculative execution can be used to enable for more parallelism. To design a high quality hardware accelerator, it is necessary to perform a thorough analysis of the application algorithms and exploit specific computational characteristics inherent to these algorithms. Dependent on the different characteristics discovered and accounted for result in different approaches to the design of hardware accelerators, and therefore, in the past a number of different basic architecture types were proposed: – – – – –
Straightforward datapath/controller hardware architecture Parallel hardware architecture Pipeline hardware architecture Parallel pipeline hardware architecture General purpose processor (GPP) augmented by loosely coupled hardware accelerator resulting from the HW/SW co-design approach – Extensible/Customizable Application Specific Instruction Set Processor (ASIP) with basic accelerators in the form of instruction set extensions (ISE) These basic architectures will be used to categorize the CABAC accelerators.
4
Overview of Hardware Accelerators for CABAC
The accelerator architectures are analyzed here in a systematic conceptual way, when studying the computational characteristics of CABAC, and thus differently than in the sporadic fragmentary comparisons that can be found in the literature. Moreover, we focus on the main problems and solutions that drastically effect the achieved results. We will not actually consider the mixed HW/SW solution for CABAC, i.e. accelerated GPP augmented by loosely coupled hardware accelerator, because this option is not promising regarding the real-time requirements satisfaction due to the strong dependencies in the computations and the resultant high communication overhead. The performance of different approaches is analyzed and the results are compared, when focusing on the throughput, maximum frequency and area. In almost all of the reviewed papers
CABAC Accelerator Architectures for Video Compression
29
no systematic analysis is provided or methods proposed on how to integrate the CABAC accelerator in a complete H.264/AVC en-/decoder. Before considering the accelerator architectural approaches, we have to give a brief overview of the main implementation issues in CABAC. Five memory operations are involved in the en-/decoding of a single bin and two blocking dependencies that hampers the parallel and pipeline approaches. The first dependency is relevant to the context model update. Unless the context model is not updated for the current bin, the next bin processing cannot be started, because the same context model may be used to en-/decode the next bin. Other dependency involves the interval range (R) and base (L) update. Unless both are not renormalized in the renormalization stage, which involves multiple branches and a variable number of cycles, the next bin processing cannot be initiated, because the probability estimation of the next bin depends on the current interval range. These strong dependencies are some of the main challenges in the accelerator design, and a number of solutions are proposed to tackle these problems. 4.1
Straightforward Datapath/Controller Accelerators
The straightforward datapath/controller approach relies on the data flows in the algorithm of the software based implementation. This accelerates the computations to some degree, but does not exploit the true (parallel) nature of the application algorithm and improvement achievable using the hardware acceleration approach. This approach is followed in some CABAC accelerators, in the sense that processing is performed sequentially on a per bin basis, and possibilities are not explored for a multi-bin parallel processing. This always limits the performance to maximally 1 bin/cycle, as the simple serial hardware implementation without any optimizations takes as many as 14 cycles to encode a single bin [8]. Some optimization technique like pre-fetching and simple parallelism [8] etc. were proposed that enables to process one bin in 5 cycles. Chen et al. [9] proposed an FSM and a memory scheme for neighboring SEs, which results in the decoding throughput of 0.33∼0.50 bin/cycle. However, it decodes only CIF video at 30fps. 4.2
Parallel Hardware Accelerators
The inefficiency of the straightforward acceleration approaches to en-/decode in real-time HD video motivated the research community to exploit some alternative approaches to the design of CABAC accelerators. The most promising approach to achieve real-time performance for high resolution video is to process more than one bin/cycle, i.e. to utilize a parallel approach. However, in the en-/decoding of even a single bin complex interdependencies have to be resolved as discussed before, and consequently, the algorithm cannot be parallelized in its true basic nature. Utilizing the static and dynamic characteristics of the SEs that can be discovered through an analysis of CABAC algorithm for real video sequences, the parallelism can be achieved up to some level for some specific SEs, what can result in processing of more than one bin/cycle. However, in parallel
30
Y. Jan and L. Jozwiak
en-/decoding of two or more regular bins the context models have to supplied to the coding engines. Due to the blocking dependencies, this cannot be performed in parallel. Also the context model fetching takes a substantial time. The details of these characteristics of SEs and the corresponding parallel schemes are discussed below. Yu et al. [3] proposed the first parallel architecture for CABAC decoding. Unlike the conventional approaches [8][9] that take a number of cycles to decode a single bin, this architecture decodes 1∼3 bin/cycle. The parallelism in this architecture is achieved through a cascade of the arithmetic decoding engines: two regular ones and two bypass. This enables the decoding of 1 Regular Bin (1RB), 1RB with 1 Bypass Bin (1BB), 2RB with 1BB and 2BB bins in parallel for frequently occurring SEs, like residual data. To reduce the context memory accesses, relevant context models of a SE or group of SEs are accessed in blocks and are stored in a high speed register bank. However, it results in an extra cost of the register bank. The architectures [10][11][12][13][14] are based on the same concept, but after some specific extensions are capable to en-/decode HD video. In [15] sixteen cascaded regular decoding units are used for more speed up for frequent SEs. However, due to dependencies the throughput remains less than 1 bin/cycle, and it causes an increase in the critical path latency and circuit area. In [16] five different architectures for CABAC encoder are designed and analyzed for area/performance tradeoff. Results show that two regular with bypass bins architectures perform better for the high quality video than the others. A predictive approach is employed by Kim et al. [17]. Unlike the architectures [3][10][11][12][14][15], in which there is a latency due to the cascaded arithmetic coding engine, this architecture initiates decoding of two bins simultaneously by prediction. However, due to mis-prediction only 0.41 bin/cycle is achieved, although with a high frequency of 303MHz. Algorithmic optimizations can expose far more parallelism than available in the original application algorithm. A novel algorithm is proposed by Sze et al. [5] which is fundamentally parallel in nature and deterministically en-/decode several (N) bins with different context at the same time. The context models for different bins are determined simultaneously using conditional probabilities, what is different than in the predictive strategy [17] and the cascaded approaches [3][10][11][12][14][15]. The two possible context models that could be used for the second bin are determined by taking into account the two possible values of the first bin (0 or 1). Its software implementation (N=2) enables 2 bins/cycle at a cost of 0.76% increase in bit rate compared to the original CABAC algorithm. However, such optimizations require 3 to 4 multiplications for two bins en-/decoding as well comparators, which could make its hardware implementation costly in resources. 4.3
Pipeline Hardware Accelerators
Although, the parallelism in the form of multi-bin processing in CABAC outperforms the conventional approach, it increases the complexity of the architecture, specially in the renormalization and context management. The cascaded multiple
CABAC Accelerator Architectures for Video Compression
31
processing engines also increase the critical path delay. Moreover, the hardware resources are much increased with not much gain from acceleration [4]. In addition, the multi-bin processing only accelerates the decoding of certain frequent SEs, is not equally well effective for all SEs, and the number of cycles per bin processing varies. Therefore, the pipeline concept of hardware acceleration is also utilized in CABAC, with the prime goal of achieving the real-time performance for HD video. A number of pipeline schemes are proposed to effectively overcome the problems and complexities of other schemes discussed earlier. However, the pipeline hazards appear as a byproduct of pipelining due to the tight dependencies in the CABAC algorithm. There are two pipeline hazards in CABAC: data and structural. A data hazard occurs when the same context model is used for the next bin as for the current bin, which is a read after write (RAW) data hazard. A structural hazard occurs when the context memory is accessed at the same time due to the context model write for the current bin and context model read for the next bin. These hazards cause the pipeline stalls that decrease the throughput of the purely pipelined architecture from the maximum of 1 bin/cycle to a lower value. Below the details of the pipeline schemes, solutions for pipeline hazards and performance of proposed pipeline accelerators are discussed. Zheng et al. [18] proposed a two stage pipeline decoding architecture for residual SEs only. The stalls in the pipeline are eliminated using standard look ahead (SLA) technique, to determine the context model for the next bin using both possible values of the current bin. The proposed architecture supports HD1080i video. This SLA approach is also used in pipeline architectures [19][20][21]. Yi et al. [22] proposed a two stage pipeline decoding architecture, instead of 4 usual stages, to reduce the pipeline latency and to increase the throughput. The data hazard are removed using the forwarding approach, and the structural hazards by using a context model reservoir (CMR) with context memory. However, the SE switching causes stalls due to CMR update, and this limits the throughput to an average of 0.25 bin/cycle. This problem is solved in [23] by using a SE predictor that increases the throughput to 0.82 bin/cycle. Li et al. [4] proposed a three stage dynamic pipeline codec architecture. The pipeline is dynamic in the sense that the pipeline latency varies between one and two cycles depending on the bin type. No pipeline stalls occur for the BB and the RB of value MPS with the interval range (R) in its limit. For data hazards removal a pipeline bypass scheme is used and for structural hazards a dual-port SRAM. The bin processing rate of [18][23] is higher than of [4], because of the coarse pipeline stages with efficient context management. Tian et al. [24] proposed a three stage pipeline encoding architecture. Two pipeline buffers are introduced to resolve the pipeline hazards and the latency issue of [4], what results in the throughput of exactly 1 bin/cycle. Chang [25] proposed a three stage pipeline architecture that combines together the different speed up methods earlier proposed like: pipeline stalls reduction due to SEs switching, context model clustering for decreasing context memory access, and two-bin arithmetic decoding engine. The architecture achieves the average throughput of 0.63 bin/cycle at a comparatively high frequency, as shown in Table 1.
32
Y. Jan and L. Jozwiak Table 1. Comparison of Different Hardware Accelerator Architectures
Design Approach Datapath/Control [8] Codec [9] Decoder Parallel [3] Decoder [14] Encoder [15] Decoder [17] Decoder Pipeline [4] Codec [18] Decoder [22] Decoder [24] Encoder [25] Decoder Parallel pipeline [21] Decoder ASIP/ISE [31] Decoder
Freq. Throughput VLSI Tech. MHz Bin(s)/Cycle TSMC(µm)
Circuit Area (gates)
Resolution Support
80,000(Inc.)∗ 138,226(Inc.)
SD480i@30fps CIF@30fps
30 200
0.2 0.33∼0.5
Virtex-II 0.13
149 186 45 303
1∼3 1.9∼2.3 range -rLPS ?
3 LPS
MPS
calculate new value because zero point changes. new range is range - rLPS
YES is state zero?
get next state from MPSnextState table new range is rLPS
Change meaning of MPS
No range < 0x100 ? get next state from LPSnextState table
No
5 YES Renormalization process
Done
shift range left and read new bit to value
Fig. 6. The H.264 CABAC decoding process
4.1
Application Specific Processor Implementation
Our variable length decoding processor design shown in Figure 7 is based on Transport Triggered Architecture template, and could ultimately include a critical path function unit for each variable length code. However, we decided to employ a more general design that includes a look-up-table for Huffman codes and CABAC style renormalizations, while a multiplier is provided to enable coping with range coding schemes such as the Boolean used with On2 VP6 codec [17]. The cost of flexibility is the longer critical path that was acceptable due to the availability of earlier code-specific hardwired designs. Table 5 illustrates the performance and silicon area of hardwired, software and an application specific TTA processor implementation of the CABAC decoders.
Fig. 7. VLC processor; LSU (Load and Store Unit), Renormalize unit for CABAC, Mul (16 bit multiplier for range codes), RLPS (table look-up unit), GCU (Global Control Unit), ALU (Arithmetic Logic Unit)
Programmable Accelerators for Reconfigurable Video Decoder
45
Table 5. The performances and silicon areas of CABAC arithmetic decoder implementations in 90 nm CMOS technology General purpose Our Hardwired processor (ARM926EJ-S) design accelerator Cycles/bit 172-248 51-66 1-4 Area (mm2 ) 1.40 [18] 0.085 (est.) 0.093
The hardwired decoder is a commercial footprint-optimized (33k gates) design used with a monolithic HD resolution video decoder. It performs best by a wide margin, but is only intended for CABAC. The code for the programmable processors has written in C language. The application specific processor runs CABAC about three times faster than an embedded microcontroller, but at realistic clock rates they both fall short of decoding HD resolution bitstreams that in bursts can go up to 20 Mbit/s. However, even with the current design, the TTA processor, when clocked at 166 MHz, decodes 3 Mbit/s that is satisfactory for mobile applications. Assembly coding could speed up both implementations approximately by factor 2-3. In the mass market devices the silicon area of the individual designs can be an important cost issue. The CABAC software implementation on a general purpose RISC processor, such as ARM9, uses most of its resources, and consumes a relatively large silicon area. The silicon needs of the hardwired CABAC accelerator and the TTA processor are close to each other, and the inclusion of the CABAC critical path as a special function unit to the TTA processor would not result in a significant increase in area. In fact, the area of a separate CABAC functional unit would be at the same level with the dedicated hardwired accelerator. Unfortunately, the power figures are currently not available.
5
Conclusions
Programmable accelerators based on application specific processors make it possible to implement multi-standard video decoders that have sufficient performance to process high definition video without losing power efficiency. At the same time it is possible to maintain flexibility needed to support reconfigurability for several video coding standards. These qualities are hard to obtain with plain hardware or software solutions even though performance improvements of application and DSP processors targeted to mobile multimedia devices have been substantial in the last few years. Based on our simulations, the designed accelerator for motion compensation offers almost the same level of performance as pure hardware solutions and it is capable of supporting other video coding standards in addition to H.264. Furthermore, our accelerator for arithmetic code decoding can offer 3-4 times better performance in comparison to a software implementation on ARM9 processor. However, there are still opportunities for improvement. To achieve performance
46
T. Rintaluoma et al.
close to monolithic hardware would require including the hardwired CABAC critical path in the programmable solution.
Acknowledgments This research has been funded by the Finnish Technology Developement Agency (TEKES). We also wish to thank professor Jarmo Takala for his contributions.
References 1. ISO/IEC 14496-2:2004: Information technology - Coding of audio-visual objects Part 2: Visual. ISO/IEC. Third edn. (June 2004) 2. ISO/IEC 14496-10:2005; Recommendation ITU-T H.264: SERIES H: AUDIOVISUAL AND MULTIMEDIA SYSTEMS Infrastructure of audiovisual services Coding of moving: Advanced video coding for generic audiovisual services video. ITU-T (November 2005) 3. Fitzek, F.H.P., Reichert, F.: Mobile Phone Programming: and its Application to Wireless Networking. Springer, Heidelberg (2007) 4. SMPTE 421M-2006: VC-1 Compressed Video Bitstream Format and Decoding Process. SMPTE (February 2006) 5. Jer-Min, H., Chun-Jen, T.: Analysis of an soc architecture for mpeg reconfigurable video coding framework. In: IEEE International Symposium on Circuits and Systems, ISCAS 2007, May 27-30, pp. 761–764 (2007) 6. Eeckhaut, H., Christiaens, M., Stroobandt, D., Nollet, V.: Optimizing the critical loop in the h.264/avc cabac decoder. In: IEEE International Conference on Field Programmable Technology, FPT 2006, December 2006, pp. 113–118 (2006) 7. J¨ a¨ askel¨ ainen, P., Guzma, V., Cilio, A., Pitk¨ anen, T., Takala, J.: Codesign toolset for application-specific instruction-set processors, vol. 6507. SPIE (2007) 65070X 8. On2 Technologies (2008), http://www.on2.com 9. ISO/IEC JTC1/SC29/WG11 N8069: Reconfigurable Video Coding Requirements v.2.0. ISO/IEC (November 2006) 10. Richardson, I., Bystrom, M., Kannangara, S., Frutos, D.: Dynamic configuration: Beyond video coding standards. In: IEEE System on Chip Conference. IEEE, Los Alamitos (2008) 11. Lucarz, C., Mattavelli, M., Thomas-Kerr, J., Janneck, J.: Reconfigurable media coding: A new specification model for multimedia coders. In: IEEE Workshop on Signal Processing Systems, October 2007, pp. 481–486 (2007) 12. Wang, S.Z., Lin, T.A., Liu, T.M., Lee, C.Y.: A new motion compensation design for h.264/avc decoder. In: IEEE International Symposium on Circuits and Systems, ISCAS 2005, May 2005, vol. 5, pp. 4558–4561 (2005) 13. Tsai, C.Y., Chen, T.C., Chen, T.W., Chen, L.G.: Bandwidth optimized motion compensation hardware design for h.264/avc hdtv decoder. In: 48th Midwest Symposium on Circuits and Systems, August 2005, vol. 2, pp. 1199–1202 (2005) 14. Zatt, B., Ferreira, V., Agostini, L.V., Wagner, F.R., Susin, A.A., Bampi, S.: Motion compensation hardware accelerator architecture for h.264/avc. In: Mery, D., Rueda, L. (eds.) PSIVT 2007. LNCS, vol. 4872, pp. 24–35. Springer, Heidelberg (2007)
Programmable Accelerators for Reconfigurable Video Decoder
47
15. Li, Y., He, Y.: Bandwidth optimized and high performance interpolation architecture in motion compensation for h.264/avc hdtv decoder. J. Signal Process. Syst. 52(2), 111–126 (2008) 16. Marpe, D., Schwarz, H., Wiegand, T.: Context-based adaptive binary arithmetic coding in the h.264/avc video compression standard. IEEE Transactions on Circuits and Systems for Video Technology 13(7), 620–636 (2003) 17. On2 Technologies (2008), http://www.on2.com 18. ARM (2008), http://www.arm.com/products/cpus/arm926ej-s.html
Scenario Based Mapping of Dynamic Applications on MPSoC: A 3D Graphics Case Study Narasinga Rao Miniskar1,3 , Elena Hammari2 , Satyakiran Munaga1,3 , Stylianos Mamagkakis1, Per Gunnar Kjeldsberg2 , and Francky Catthoor1,3 1 IMEC, Kapeldreef 75, Leuven 3001 {miniskar,satyaki,mamagka,catthoor}@imec.be 2 NTNU, 7491 Trondheim, Norway {hammari,per.gunnar.kjeldsberg}@iet.ntnu.no 3 K.U. Leuven, ESAT Dept., Leuven 3001
Abstract. Modern multimedia applications are becoming increasingly dynamic. The state-of-the-art scalable 3D graphics algorithms are able to adapt at run-time their hardware resource allocation requests according to input, resource availability and a number of quality metrics. Additionally, the resource management mechanisms are becoming more dynamic themselves and are able to cope efficiently at run-time with these varying resource requests, available hardware resources and competing requests from other applications. In this paper, we study the dynamic resource requests of the Wavelet Subdivision Surfaces (WSS) based scalable 3D graphics application. We also show how to schedule its computational resources at run-time with the use of the Task Concurrency Management (TCM) methodology and the System Scenario based approach on MPSoC platform with very heterogeneous Processing Elements (including RISC, VLIW and FPGA accelerator resources).
1
Introduction
It is common for embedded hardware platforms, nowadays, to feature multiple Processing Elements (PEs) and thus to be able to execute highly complex multimedia algorithms with big computational resource requirements. Scalable 3D graphics algorithms can demonstrate multimedia content that is created once and then scaled each time to match the characteristics of the embedded system, where it is deployed (e.g., according to different consumer device displays) [7]. Therefore, such software applications have very dynamic computational resource requests, because the timing and size of each request is not known at designtime and is only determined at run-time based on the user actions and dynamic scalability response of the algorithm itself according to the quality targeted [13]. Additionally, it is very likely that other software applications will be executing concurrently on the embedded platform, sharing the available resources dynamically, as they are loaded and unloaded at run-time (e.g., you receive an email as you play a 3D game on your mobile phone). Therefore, scheduling the K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 48–57, 2009. c Springer-Verlag Berlin Heidelberg 2009
Scenario Based Mapping of Dynamic Applications on MPSoC
49
tasks of these software applications is very challenging, because it is not known at design-time: (i) the computational resource requests of each task, (ii) the number of tasks executed by each software application and (iii) the combination of software applications that will be executed concurrently. The solution given today to this problem is calculating the worst case resource request of any combination of tasks and software applications and allocating resources accordingly at design-time. This solution requires a huge amount of computational resources and is very energy inefficient as it can not adapt the scheduling according to the actual run-time software application real-time demands. Therefore the scheduler not only pre-allocates but also uses the maximum amount of processing resources. In this paper, we characterize the run-time computational resource needs of a very demanding and dynamic scalable 3D graphics application [13], which executes on a heterogenous Multiple Processor System on Chip (MPSoC) concurrently with other applications. For this case study, we implement the Task Concurrency Management (TCM) scheduling methodology [9] to schedule its tasks both at design-time and run-time and also implement the System Scenario approach [5] to avoid using one worst-case resource requirement solution. It is the first time a hardware accelerator is included in a MPSoC platform used for run-time scheduling algorithms like TCM. The computation demanding software of Wavelet Subdivision Surfaces (WSS) for 3D scalable graphics applications as well as three more Task Graphs generated from the TGFF[2] will be scheduled on the aforementioned platform. The structure of this paper is as follows. The next section describes the related work. In section 3, we describe the Task Concurrency Management (TCM) methodology and the System Scenario approach. In section 4, we describe the Wavelet Subdivision Surfaces (WSS) case study, implement the methodology and show some experimental results. Finally, in section 5, we draw our conclusions.
2
Related Work
In the context of scenarios, scenario-based design[1] has been used for some time in both hardware[10] and software design[3] of embedded systems. They use case diagrams[4] which enumerate, from a functional and timing point of view, all possible user actions and the system reactions that are required to meet a proposed system function. These scenarios are called use-case scenarios and do not concentrate on the resources required by a system to meet its constraints. In this paper, we concentrate on a different and complementary type of scenarios, which we call system scenarios [5]. These are derived from the combination of the behavior of the application and its mapping on the system platform. These scenarios are used to reduce the system cost by exploiting information about what can happen at run-time to make better design decisions at design-time, and to exploit the time-varying behavior at run-time. While use-case scenarios classify the application’s behavior based on the different ways it can be used, system scenarios classify the behavior based on the multi-dimensional cost
50
N.R. Miniskar et al.
trade-off during the implementation trajectory. In the context of this paper, whenever scenarios are mentioned they imply System Scenarios. In the context of task scheduling, a good overview of early scheduling algorithms can be found in [11]. This paper uses the terminology task scheduling for both the ordering and the assignment. Scheduling algorithms can be roughly divided into dynamic and static scheduling. In a multiprocessor context, when the application has a large amount of non-deterministic behavior, dynamic scheduling has the flexibility to balance the computation load of processors at run-time and make use of the extra slack time coming from the variation from Worst Case Execution Time (WCET). In the context of this paper, we have selected the Task Concurrency Management methodology [9] to implement a combination of design-time and run-time scheduling, which can balance performance versus energy consumption trade-offs.
3
Task Concurrency Management and System Scenarios Overview
As can be seen in Fig. 1, we use the Task Concurrency Management (TCM) methodology and System Scenarios approach to do the task scheduling of WSS and any other applications that might be executing concurrently on the MPSoC platform. The two phases shown on the left are carried out at design-time and the two phases on the right side are carried out at the run-time. All four phases are part of the System Scenario approach [5] and are instantiated for the TCM scheduling [9]. – At the Scenario identification phase, we define a number of scenarios for each application executed on the selected MPSoC platform. These scenarios are defined based on typical application inputs and their impact on the control flow and data flow of the software application. For each one of these scenarios, we evaluate the execution time and energy consumption of each task if it would be mapped on any of the Processing Elements (PEs) of the selected MPSoC platform. We extract these energy and timing values at design-time via simulation and profiling and insert them in addition to the application’s task-graph in the Grey box model for TCM scheduling [9]. – At the Scenario exploitation phase, we produce at design-time a set of Pareto-optimal schedules using the Grey Box model in the TCM methodology. Each one of these schedules is a specific ordering of one scenario’s tasks and their assignment on specific PEs. Each schedule is represented by a Pareto point on an energy vs performance Pareto curve and each scenario is represented by the Pareto curve itself. – At the Scenario detection phase, we monitor and detect at run-time the pre-calculated scenarios. Each time that a scenario is detected, the Scenario switching mechanism is initiated. Additionally, the energy and real-time constraints are monitored in order to select the optimal performance vs energy consumption trade-off.
Scenario Based Mapping of Dynamic Applications on MPSoC
51
Fig. 1. Four phase System Scenario based TCM methodology
– At the Scenario switching phase, when a scenario is detected, the TCM run-time scheduler selects and implements one of the pre-calculated Paretooptimal schedules, thus switching from one Pareto curve to another. If the TCM run-time scheduler evaluates that a constraint (e.g., real time) will not be met, then it switches from one Pareto point to another (within the same Pareto curve) in order to meet this constraint in the expense of another resource (e.g., energy). The final result at every moment at run-time is the execution of one Pareto-optimal, pre-selected schedule.
4 4.1
WSS Characterization and Task Scheduling Software Application Description
3D content made available on the Internet, such as X3D/VRML and MPEG-4 content, is transported over and consumed on a lot of different networks (wireless and wired) and terminals (mobile phones, PDA’s, PCs, . . . ). A WSS based scalable 3D graphics framework is proposed in [13], where the object’s quality dynamically adapts to the viewing conditions while respecting constraints such as platform resources. Based on the user-commands (Ex: Move forward, Move
52
N.R. Miniskar et al.
backward, rotate, pitch, yaw, etc...), the best triangle budget and the related Level Of Detail (LOD) settings for each visible object will be decided (online) at run-time. Once the triangles are obtained for each visible object, they will passed to Renderer. The abruptness in the no.of visible objects based on the user-commands, the object-level LOD settings will be varied. In the implemented framework, the over-all scene level quality is fixed. The software modules responsible for the processing of each 3D scene content are given below. – Initialization Module • Process Scene Description: Reads the 3D scene information and all 3D objects base meshes, texture, pre-processed quality error details, and scenario based visibility and area data, and holds it in its internal data structure, shown as Mesh Database(DB). This is a one-time initialization process of 3D content. • WSS-Decoder: The WSS based decoding[13] is very computation heavy task (∼1.8secs on TI-C64X+ and ∼33secs on StrongARM processors), thus this task is mapped to a hardware accelerator and all the objects are decoded to the maximum LOD in the initialization phase itself. The triangles for all LOD settings are held in storage memory. – Frame based invocation modules: These modules will be called for each frame when the user presses a command like rotate, yaw, pitch etc.... • Check Visibility: This modules checks the visibility of each object based on the bounding box concept explained in [13] and outputs the visibility information in the Mesh DB. • Build Global Pareto curve: This module generates the global Paretocurve(trade-off curve) based on the error contributed by each visible object and its corresponding triangle budget. It uses the gradient descent algorithm to build this[13].After building the global Pareto plot, this module calls the decoder based on the changes in quality parameters. • Converter(Prepare for Renderer): Finally this module converts the triangles into vertices on the 3d scene and later this will be passed to renderer [13]. For each frame, the application calls the above mentioned modules as shown in the sequence above. In this application there is an object level parallelism, where we can potentially handle the objects in parallel which can enable the multiple parallelization versions of this application. However in the context of this paper, we still would like to show the benefits of TCM scheduling even without exploring parallelism (i.e., this is listed as future work). 4.2
Target Platform
The generic MPSoC platform that we decide to use in this paper is shown in Fig. 2. More specifically, we have considered a heterogeneous platform with two RISC processors (Strong ARM 1100x), two VLIWs with six FUs each
Scenario Based Mapping of Dynamic Applications on MPSoC
53
(TI-C64X+) and one RISC processor(Strong ARM 1100x) with an FPGA Hardware accelerator (Virtex-5 XC5VSX95t speed grade-3) as a co-processor. This is for handling control functions, and data and instruction level parallelism respectively. The StrongARM 1100x processors run at 1.48V, 0.1632A, with clock frequency of 133MHz and the TI-C64X+ processors run at 1.2V with clock frequency of 500MHz and the FPGA runs at 100MHz and 1.0V. We assume a cache memory (C) for each processor and a single shared global memory (SM). The Hardware accelerator is implemented with a Virtex-5 SXT FPGA from Xilinx for the purpose of accelerating the GlobalParetoBuildTime software module of the WSS task. The communication topology considered is a cross-bar switch. Any potential implications of communication or memory latency bottlenecks are not calculated in the context of this paper and are subject to future study.
Fig. 2. Target platform
Fig. 3. Gray-box-model of WSS
In order to apply the TCM Methodology we require Energy and Timing information of each task (see Thread Frames and Thread Nodes in [9]) for the Gray-box model. For these experiments we have used profiling information from the StrongARM and TI-C64X+ processors simulators (see SimItARM Simulator and TI CCStudio v3.3 evolution, respectively). For getting energy information for tasks running on StrongARM processor we have used JouleTrack [12]. For the TI-C64X+ energy profiling, we have used the functional level power analysis model of TI-C6X[8], modified for the functional units, instruction set, and memory hierarchy of the TI-C64X+. Parts of the heterogeneous platform has been used for experiments in earlier papers[9]. It is however, the first time an FPGA-based hardware accelerator is included. We have explored performance and energy consumption of parts of the GlobalParetoBuild task, when implemented on a Virtex-5 SXT FPGA from Xilinx. Selected parts for the hardware acceleration are division operation and its combinations with square root, multiplication and subtraction operations, that appear in the application code but do not have dedicated instructions in the StrongARM and TI-C64X+ processsors. Ready-to-use division, square root, multiplication and subtraction IP blocks from the library of the high-level
54
N.R. Miniskar et al.
hardware development tool Xilinx System Generator[14] were utilized. Low-level hardware description for downloading to FPGA was generated using Xilinx ISE Design Suite and its Timing Analyzer and XPower Analyzer tools along with Modelsim simulator were employed to find the execution time and energy consumption of the obtained hardware accelerator. 4.3
Inputs to 3D-WSS and Scenarios
In the experiments below, we consider WSS in a scalable 3D graphics engine for 3D games. In the game example there is a input representing 3 rooms (R1, R2 and R3), and there are 13 objects(fruits) in R1, 17 objects in R2 and 22 objects in R3. There are 4 cameras(C1, C2, C3 and C4) fixed in 4 positions in each room to see the objects from different angles. As an in-game character enters from one room to another, the best camera position can be chosen by the application controller according to the in-game cinematic effects. In each room of the game we have applied sequence of User commands that define the in-game character movement and perspective. We have applied the Systematic Scenario identification methodology proposed in [5] based on the in-depth analysis of the application. We have identified few parameters that are causing variation in the software modules of this application (room no., and selected camera position by the application). Based on the room no. in which the in-game is present and the camera position, we can derive 12 scenarios(i.e., 3 rooms times, 4 camera-positions). For each scenario we will get one Gray-box-model for the TCM scheduling. For each scenario, we have considered the worst case estimation on the no.of objects, no.of pareto points in all objects, and no.of triangles, and computed the Worst case estimation time(WCET) for all three software modules of this application. There is still a dynamism in execution time in each scenario, because of the varying no.of visible objects(Ex.: 8 to 13 in R1), varying no.of pareto points(Ex.: 200 to 350 in R1), and varying no.of triangles(Ex.: 47000 to 50000 in R1), in each room. However we still don’t have a methodology to deal the dynamism due to these kind of variables (Data variables), which has large range of values for each variable in the program execution. In this paper we are focusing only on the parameters which can be handled through Systematic Scenario Identification methodology. As we have computed the WCET using profiling, we couldn’t give any hard-guarantees based on this execution-times. 4.4
Gray Box Modelling
The task graph of the WSS application is shown in Fig. 3, which has 3 Tasks, CheckVisibility(T1), Build-Pareto & Selection(T2), and Prepare Triangles(T3). The Gray-box-model for the 3D-WSS is constructed based on this task graph and the profiling information (Energy, Execution Time) captured. For this we need to consider the WCET for each scenario and for each module of WSS. The WCETs of 3 modules for StrongARM & TI are shown in Table 1. However, not all scenarios are required due to similarity in their WCET. We have clustered the WCETs manually according to Systematic clustering approach [5,6] and they
Scenario Based Mapping of Dynamic Applications on MPSoC
55
are represented in colors for each module in Table 1. For the GlobalParetoBuild module the clusters are (C1,C2,C4) and (C3) for each room, and hence total 6 scenarios instead of 12. Like this, the end scenarios derived after clustering for CheckVisibility module are 3 for all rooms, and for PrepareRender module are 8 for all rooms. We have repeated the clustering for TI processor WCET too. However, we have obtained the same clusters as for StrongARM processor. Each cluster is treated as one Scenario for the module. The Scenarios obtained at this point are at sub-task level. However, we need the scenarios at the complete taskgraph level, using which we need to optimize pareto optimal mappings at runtime. Based on the clusters we have obtained from sub-tasks, we have selected the sub-task clusters which has maximum clusters among all. We have checked whether all other clusters fall under this, if not we can further cluster the set. In the case of WSS, the maximum clusters obtained from PrepareRender module are sufficient for the whole task-graph level too. At the end, we have obtained total 8 scenarios for the task-graph. Table 1. Scenario clustering of WCETs(m.secs) for StrongARM & TI Ri, Ci CheckVisibility GlobalParetoBuild ARM TI ARM TI R3-C1 45.036 0.97 106.91 4.72 R3-C2 44.772 0.96 101.97 4.6 R3-C3 45.587 1.01 35.63 1.72 R3-C4 45.656 1.00 102.76 4.62 R2-C1 34.832 0.74 79.43 3.49 R2-C2 34.626 0.74 73.08 3.31 R2-C3 35.198 0.78 30.71 1.45 R2-C4 35.266 0.77 73.57 3.32 R1-C1 26.624 0.57 52.11 2.31 R1-C2 26.509 0.57 50.23 2.27 R1-C3 26.892 0.59 18.47 0.88 R1-C4 26.905 0.58 45.36 2.07
PrepareRender ARM TI 386.08 25.23 426.97 29.08 174.44 10.85 389.35 24.71 362.65 23.13 374.22 25.10 154.33 9.43 359.25 22.33 268.52 17.09 274.12 18.02 99.51 5.93 210.06 14.39
After the Scenario identification phase, the Gray-box model is constructed based on the obtained WCET profiling information for the 8 scenarios. An example of the Gray-box model generated per scenario is shown in Fig. 3. 4.5
TCM Design Time Scheduling
At the Scenario exploitation phase, the TCM Design-time scheduler[15] produces 8 Pareto curves (i.e., one per scenario). In Fig. 4, we show as an example 4 of these Pareto curves. Each Pareto curve has a number of Pareto points which represent a pareto optimal ordering and assignment of the WSS tasks to the PEs. In Fig. 4, we show 4 of these schedules (P1-P4) belonging to two Pareto curves (i.e., scenarios). Additionally, we have applied the same for TGFF [2] generated taskgraphs to demonstrate the schedule that would be produced for a random
56
N.R. Miniskar et al.
Fig. 4. Four Pareto points representing 4 TCM Design-time schedule examples (P1P4) of the WSS application. The four Pareto curves represent four scenarios (C1-C4 in R3). Q1 represents the schedule of an another application competing for resources (generated automatically with TGFF).
application sharing the MPSoC resources. The TGFF application schedule is represented by Q1. Each pareto curve is occupying ˜1.3KBytes of memory and there are approximately 8 to 15 points in each task each scenario pareto curve. 4.6
TCM Run Time Scheduling
At the Scenario switching phase, the TCM Run-time scheduler implements a Pareto point selection algorithm [15], to select at run-time the optimal schedule for each application, while meeting the real time constraints and minimizing the energy consumption. The deadline for the WSS application is set initially at 0.5 secs (i.e., the frame will be updated by input user commands coming at a rate of a 0.5 secs period). The TGFF application also has the same frame rate. If the deadline is changed at 0.115 secs (e.g., because user commands come faster) and both applications are sharing resources, then the TCM runt-time scheduler switches to a faster schedule point P1. However if the deadline is increased to 0.22 secs, then the TCM Run-time scheduler chooses schedule point P2 for WSS saving 10% less energy, as in StateB, without missing any real-time constraints. At the Scenario detection phase, if the Camera position switches from C2 to C3 then the TCM Run Time Scheduler is activated again and selects a schedule from the corresponding Pareto curve. In this particular case it would move from schedule P2 to schedule P4, thus saving an additional 55% of energy consumption, without missing any real-time constraints. From the heterogeneity experiments, we have identified that there is 10% performance gain and energy gain just with a small Hardware accelerator for 2 kind of instructions. This will be more, if we explore it for the whole GlobalParetoBuild module. The slacked gained from the Hardware accelerator is used by the TCM run-time scheduler and provided the 16% energy efficient solution at the end for the WSS application.
Scenario Based Mapping of Dynamic Applications on MPSoC
5
57
Conclusions and Future Work
In this paper, we have characterized the computational and energy resource requests of the Wavelet Subdivision Surfaces (WSS) algorithm for scalable 3D graphics on a MPSoC platform. Moreover, this paper demonstrates the implementation of a number of experimental design-time and run-time techniques at various abstraction design levels and shows how they can be combined for a single case study. More specifically, we have demonstrated WSS at the application level, System Scenarios at the system level and Task Concurrency Management at the middleware (resource management) level. In our future work, we plan to parallelize certain tasks of the WSS algorithm, do a more thorough exploration of the available WSS scenarios and MPSoC platform options and calculate the switching overhead between the TCM schedules at run-time. We have also explored the TCM methodology for the heterogeneous platform which has Hardware Accelerator and shown more gains from the TCM methodology.
References 1. Carroll, J.M.: Scenario-based design: envisioning work and technology in system development. John Wiley and Sons, Chichester (1995) 2. Dick, R.P.: Tgff: task graphs for free. In: CODES/CASHE 1998 (1998) 3. Douglass, B.P.: Real Time UML: Advances in the UML for Real-Time Systems, 3rd edn. Addison-Wesley Professional, Reading (2004) 4. Fowler, M., Scott, K.: UML Distilled: A Brief Guide to the Standard Object Modeling Language, 3rd edn. Addison-Wesley Professional, Reading (2003) 5. Gheorghita, S.V., Palkovic, M., Hamers, J., Vandecappelle, A., Mamagkakis, S., Basten, T., Eeckhout, L., Corporaal, H., Catthoor, F., Vandeputte, F., Bosschere, K.D.: System-scenario-based design of dynamic embedded systems, vol. 14, pp. 1–45. ACM, New York (2009) 6. Hamers, J.: Resource prediction for media stream decoding. In: DATE (2007) 7. ISO/IEC. ISO/IEC 14496 Information technology-Coding of audio-visual objects - Part 10: Advanced Video Coding (2004) 8. Laurent, J.: Functional level power analysis: An efficient approach for modeling the power consumption of complex processors. In: DATE (2004) 9. Ma, Z., Scarpazza, D.P., Catthoor, F.: Run-time task overlapping on multiprocessor platforms. In: ESTImedia, pp. 47–52 (2007) 10. Paul, J.M.: Scenario-oriented design for single-chip heterogeneous multiprocessors. IEEE Transactions on VLSI Systems (2006) 11. Ramamritham, K., Fohler, G., Adan, J.M.: Issues in the static allocation and scheduling of complex periodic tasks. IEEE Computer Society, Los Alamitos (1993) 12. Sinha, A., Chandrakasan, A.: Jouletrack - a web based tool for software energy profiling. In: DAC (2001) 13. Tack, N., Lafruit, G., Catthoor, F., Lauwereins, R.: Pareto based optimization of multi-resolution geometry for real time rendering. In: Web3D. ACM, New York (2005) 14. Xilinx Inc. System Generator for DSP User Guide. Version 10.1 (March 2008) 15. Yang, P., Catthoor, F.: Pareto-optimization-based run-time task scheduling for embedded systems. In: ISSS, California, USA (2003)
Multiple Description Scalable Coding for Video Transmission over Unreliable Networks Roya Choupani1,2 , Stephan Wong1 , and Mehmet R. Tolun2 1 2
Computer Engineering Department, TUDelft, Delft, The Netherlands Computer Engineering Department, Cankaya University, Ankara Turkey ¯
[email protected],
[email protected],
[email protected] Abstract. Developing real time multimedia applications for best effort networks such as the Internet requires prohibitions against jitter delay and frame loss. This problem is further complicated in wireless networks as the rate of frame corruption or loss is higher in wireless networks while they generally have lower data rates compared to wired networks. On the other hand, variations of the bandwidth and the receiving device characteristics require data rate adaptation capability of the coding method. Multiple Description Coding (MDC) methods are used to solve the jitter delay and frame loss problems by making the transmitted data more error resilient, however, this results in reduced data rate because of the added overhead. MDC methods do not address the bandwidth variation and receiver characteristics differences. In this paper a new method based on integrating MDC and the scalable video coding extension of H.264 standard is proposed. Our method can handle both jitter delay and frame loss, and data rate adaptation problems. Our method utilizes motion compensating scheme and, therefore, is compatible with the current video coding standards such as MPEG-4 and H.264. Based on the simulated network conditions, our method shows promising results and we have achieved up to 36dB for average Y-PSNR. Keywords: Scalable Video Coding, Multiple Description Coding, Multimedia Transmission.
1
Introduction
Communications networks, both wireless and wired, offer variable bandwidth channels for video transmission [1], [3]. Display devices have a variety of characteristics ranging from low resolution screens in small mobile terminals to high resolution projectors. The data transmitted for this diverse range of devices and bandwidths have different sizes and should be stored on media with different capacity. Moreover, an encoding which makes use of a single encoded data for all types of bandwidth channels and displaying devices capacities could be of a remarkable significance in multimedia applications. Scalable video coding (SVC) schemes are intended to be a solution for the Internet heterogeneity and receiver K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 58–67, 2009. c Springer-Verlag Berlin Heidelberg 2009
Multiple Description Scalable Coding
59
display diversity problem by encoding the data at the highest quality but enabling the transmitter or receiver to utilize it partially depending on the desired quality or available bandwidth and displaying capacities. The main drawback of the available scalable video coding methods is that they are not suitable for non-reliable environments with a high rate of frame loss or corruption such as wireless networks. This problem stems from the fact that the methods are based on the motion compensated temporal filtering scheme and frames are coded as difference with a (generally prior) reference frame. In case that a reference frame is lost or corrupted, the whole chain of difference frames depending on it becomes unrecoverable. To increase the error resilience of the video coding methods, Multiple Description Coding (MDC) methods have been introduced [4], [5], [7]. These methods improve the error resilience of the video with the cost of adding redundancy to the code. In case that a frame is lost or corrupted, the redundancy is used to replace it with an estimated frame. Franchi, et al., proposed a method to send a video by utilizing independent multiple descriptions. Their method however, does not combine scalability features with multiple description coding and therefore only addresses frame loss or corruption and variations of bandwidth have not been dealt with [16]. The combination of scalable video coding methods and multiple description coding has attracted the interest of researchers recently [2], [3], [13]. The introduction of scalable extension of H.264 standard recently, which relaxes some of the restrictions of other video coding schemes such as using immediate prior frame as reference frame, provides a suitable framework for combining scalability of H.264 with error resistance of MDC schemes. This paper describes a new method which is a combination of the SVC extension of H.264 standard with MDC schemes in a way that no redundancy in the form of extra bits is introduced during the video coding. The remainder of this paper is organized as follows. Section 2 introduces the main multiple description coding methods. Section 3 explores the scalability features of H.264 standard which are used in our proposed method. Section 4 describes the details of our proposed method. In Section 5, we introduce the theoretical base of our performance evaluation method and provide the experimental results and finally, in Section 6, we draw the conclusions.
2
Multiple Description Coding
As a way of encoding and communicating visual information over lossy packet networks, multiple descriptions have attracted a lot of attention. A multiple description coder divides the video data into several bit-streams called descriptions which are then transmitted separately over the network. All descriptions are equally important and each description can be decoded independently from other descriptions which means that the loss of some of them does not affect the decoding of the rest. The accuracy of the decoded video depends on the number of received descriptions. Descriptions are defined by constructing P non-empty sets summing up to the original signal f. Each set in this definition corresponds to a description. The sets however, are not necessarily disjoint. A signal sample
60
R. Choupani, S. Wong, and M.R. Tolun
may appear in more than one set to increase error resilience property of the video. Repeating a signal sample in multiple descriptions is also a way for assigning higher importance to some parts/signals of the video. The more a signal sample is repeated the more reliably it is transmitted over the network. The duplicate signal values increases the redundancy and hence the data size which results in reduced efficiency. Designing descriptions as partition does not necessarily mean that there is no redundancy in the data. In fact, designing the descriptions as a partition prevents extra bits to be added to the original data for error resilience but still the correlation between the spatially or temporally close data can be used for estimating the lost bits. The estimation process is commonly referred to as error concealment and relies on the the preserved correlation in constructing the descriptions. Fine Granular Scalability (FGS)-based MDC schemes partition the video into one base layer and one or several enhancement layers [8]. The base layer can be decoded independently from enhancement layers but it provides only the minimum spatial, temporal, or signal to noise ratio quality. The enhancement layers are not independently decodable. An enhancement layer improves the decoded video obtained from the base layer. MDC schemes based on FGS puts base layer together with one of the enhancement layers at each description. This helps to partially recover the video when data from one or some of the descriptions are lost or corrupt. Repeating base layer bits in each descriptor is the overhead added for a better error resilience. In Forward Error Correction (FEC)-based MDC methods, it is assumed that the video is originally defined in a multi-resolution manner [6], [9]. This means if we have M levels of quality, each one is adding to the fidelity of the video to the original one. This concept is very similar to the multi-layer video coding method used by FGS scheme. The main difference, however, is that there exist a mandatory order in applying the enhancements. In other words, it is sensitive to the position of the losses in the bitstream, e.g., a loss early in the bitstream can render the rest of the bitstream useless to the decoder. FEC-based MDCs aim to develop the desired feature that the delivered quality become dependent only on the fraction of packets delivered reliably. One method to achieve this is Reed Solomon block codes. Mohr, et.al., [15] used Unequal Loss Protection (ULP) to protects video data against packet loss. ULP is a system that combines a progressive source coder with a cascade of Reed Solomon codes to generate an encoding that is progressive in the number of descriptions received, regardless of their identity or order of arrival. The main disadvantage of the FEC-based methods is the overhead added by the insertion of error correction codes. Discrete Wavelet Transform (DWT)-based video coding methods are liable for applying multiple description coding. In the most basic method, wavelet coefficients are partitioned into maximally separated sets, and packetized so that simple error concealment methods can produce good estimates of the lost data [2], [10], [11]. More efficient methods utilize Motion Compensated Temporal Filtering (MCTF) which is aimed at removing the temporal redundancies of video sequences. If a video signal f is defined over a domain D, then the domain can be expressed as a collection of sub-domains {S1;..;Sn} where the union of these
Multiple Description Scalable Coding
61
sub-domains is a cover of D. Besides, a corrupt sample can be replaced by an estimated value using the correlation between the neighboring signal samples. Therefore, the sub-domains should be designed in a way that the correlation between the samples is preserved. Domain-based multiple description schemes are based on partitioning the signal domain. Each partition, which is a subsampled version of the signal, defines a description. Chang [8] utilizes the even-odd splitting of the coded speech samples. For images, Tillo, et.al., [11] propose splitting the image into four subsampled versions prior to JPEG encoding. There, domain partitioning is performed first, followed by discrete cosine transform, quantization and entropy coding. The main challenge in domain-based multiple description methods is designing sub-domains so that the minimum distance between values inside a domain (inter-domain distance) is maximized while preserving the auto-correlation of the signal.
3
Scalable Video Coding Extension of H.264
As a solution to the unpredictability of traffic loads, and the varying delays on the client side problem, encoding the video data is carried out in a rate scalable form which enables adaptation to the receiver or network capacities. This adaptation can be in the number of frames per second (temporal scalability), frame resolution (spatial scalability), and number of bits allocated to each pixel value (signal to noise ratio scalability). In this section, we briefly review the scalability support features of H.264 standard which are used in our proposed method. The scalability support features of H.264 standard were introduced based on an evaluation of the proposals carried out by MPEG and the ITU-T groups. Scalable video coding (SVC) features were added as an amendment to H.264/MPEG4AVC standard [14]. 3.1
Temporal Scalability
Temporal scalability is achieved by dropping some of the frames in a video to reach the desired (lower) frame rate. As the motion compensated coding used in video coding standards encodes the difference of the blocks of a frame with its reference frame (the frame coming immediately before it), dropping frames for temporal scalability can cause some frames to become unrecoverable. H.264 standard relaxes the restriction of choosing the previous frame as the reference frame for current frame. This makes it possible to design hierarchical prediction structures to avoid reference frame loss problem when adjusting the frame rate. 3.2
Spatial Scalability
In supporting spatial scalable coding, H.264 utilizes the conventional approach of multilayer coding, however, additional inter-layer prediction mechanisms are incorporated. In inter-layer prediction the information in one layer is used in the other layers. The layer that is employed for inter-layer prediction is called
62
R. Choupani, S. Wong, and M.R. Tolun
reference layer, and its layer identifier number is sent in the slice header of the enhancement layer slices [12]. Inter-layer coding mode is applied when the macroblock in the base layer is inter-coded. To simplify encoding and decoding macro-blocks in this mode, a new block type named base mode block was introduced. This block does not include any motion vector or reference frame index number and only the residual data is transmitted in the block. The motion vector and reference frame index information are copied from those of the corresponding block in the reference layer.
4
Our Proposed Method
Our proposed method involves using the scalability features of the H.264 standard. To make the video resilient against frame loss or corruption error we define multiple descriptions. However, to achieve a high performance which is comparable to single stream codes, we do not include any error correction code in the descriptions. The error concealment in our proposed method is based on the autocorrelation of the pixel values which is a decreasing function of spatial proximity. Generally, the differences among the pixels values about a given point are expected to be low. Based on this idea we have considered four descriptions D1 to D4 representing four spatial sub-sets of the pixels in a frame as depicted in Figure 1. Each description correspond to a subset Si for i = 1..4. The subsets define a partition as no overlap exists in the subsets and they sum up to the initial set. Si Sj = ∅ f or i = 1, .., 4 and i = j 4
Si = D
i=1
Each description is divided into macro-blocks, motion compensated, and coded independently. The decoder extracts frames and combines them as depicted in Figure 1. When a description is lost or is corrupted, the remaining three
Fig. 1. Organization of the pixels in the descriptions
Multiple Description Scalable Coding
63
Fig. 2. Pixels used (blue) for interpolating the value of a missing pixel (red)
Fig. 3. Multiple Description Schemes with a) 9 Descriptions, b) 16 Descriptions
descriptions provide nine pixel values around each pixel of the lost description for interpolation during error concealment. Figure 2 depicts the pixel values utilized for interpolating a pixel value from a lost description. For interpolation, we are using a weighted interpolation where the weights are normalized by the Euclidean distance of each pixel from the center as given below. We have assumed √
2 2
√
1 22 1 × √1 0 √1 6.828 2 1 22 2
the residue values and motion vectors and other meta-data in a macroblock is transmitted as a data transmission unit and hence are not available when the data packet is lost. The succeeding frames which utilize the estimated frame as their reference frame, will suffer from the difference between the reconstructed frame and the original one. The error generated in this way is propagated till
64
R. Choupani, S. Wong, and M.R. Tolun City, CIF 30Hz 37 Single Layer Multiple Description
36
Average Y−PSNR [db]
35 34 33 32 31 30 29
0
100
200
300
400 500 bit rate [kbit/s]
600
700
800
900
Fig. 4. Coding efficiency comparison between single layer and our proposed method using City video segment Foreman, CIF 30Hz 40 39 38
Average Y−PSNR[db]
37 36 35 34 33 32 31 Single Layer Multiple Description
30 29
0
100
200
300
400 500 bit rate [kbit/s]
600
700
800
900
Fig. 5. Coding efficiency comparison between single layer and our proposed method using Foreman video segment
the end of the GOP. However, if no other frame from the same GOP is lost, the error is not accumulated. The multilayer hierarchical frame structure of H.264 reduces the impact of frame loss to at most log2 n succeeding frames where n is the number of frames in a GOP. Our proposed method has the following features.
Multiple Description Scalable Coding
65
– Multiple description coding is combined with video scalable coding methods with no redundant bits added. – Each description is independent from the rest and the base-enhancement relationship does not exist between them. This feature comes without the extra cost of forward error correction bits added to the descriptions. Any lost or corrupted description can be concealed regardless of its position or order with respect to the other descriptions. – The proposed method is compatible with the definition of the multi-layer spatial scalability of H.264 standard. This compatibility is due to the possibility of having the same resolution in two different layers in H.264 and using inter-coding at each layer independently. We have not set the motion prediction flag and let each description to have its own motion vector. This is because of the independent coding of each description. Setting the motion prediction flag can speed up encoder but it reduces the coding efficiency slightly as the most similar regions are not always happen at the same place in different descriptions. – The proposed method is expandable to more number of descriptions if the error rate of the network is high, a higher level of fidelity with the original video is required, or higher levels of scalability are desired.
5
Experimental Results
For evaluating the performance of our proposed method, we have considered measuring Peak Signal to Noise Ratio of the Y component of the macroblocks (Y-PSNR). Equations 1 and 2 describe Y-PSNR used in our implementation mathematically. M axI P SN R = 20 log10 √ (1) M SE M SE =
m−1 n−1 1 ||I(i, j) − I (i, j)||2 3mn i=0 j=0
(2)
where M axI indicates the largest possible pixel value, I is the original frame and I is the decoded frame at the receiver side. Y-PSNR is applied to all frames of video segments listed in Table 1 by comparing the corresponding frames of the original video segment and after using our multiple description coding method. We have considered the case where one of the descriptions is lost and interpolated. We have randomly selected the erroneous description. We put 32 frames in each GOP and a diadic hierarchical temporal structure has been used for motion compensated coding. We have furthermore imposed the same reference frame for all macroblocks of a frame for simplicity although H.264 supports utilizing different reference frame for macroblocks of a frame. In additionally, we have restricted the number of descriptions lost to one for each GOP. This means at most one forth of a frame is estimated during error concealment step. The location of the lost description in the GOP is selected randomly and the Y-PSNR is obtained for the average of each video segment. The average Y-PSNR values
66
R. Choupani, S. Wong, and M.R. Tolun Table 1. Average Y-PSNR values when loss is in only one frame of each GOP Sequence Name Foreman Stefan & Martin City
Resolution Frame rate Average Y-PSNR (db) 352 × 288 30 36.345 768 × 576 30 33.110 704 × 576 60 34.712
are reported in Table 1. The second set of evaluation tests considers the average Y-PSNR value change for each video segment with respect to the number of frames affected by the lost description. Still however, we are assuming only one description is lost each time and the GOP length is 32. Figure 3 depicts the result of multiple frame reconstruction for three video segments. Despite having multiple frames affected by the loss or corruption problems, the results indicates that the ratio of peak signal to noise ratio is relatively high. As a benchmark to evaluate the efficiency of our algorithm, we have compared average Y-PSNR value of Foreman and City video segments with single layer video coding. Figure 4 and 5 depict the comparison results.
6
Conclusion
A new method for handling the data loss during the transmission of video streams has been proposed. Our proposed method is based on multiple description coding however, coding efficiency is not sacrificed as no extra bit data redundancy is introduced for increasing resilience of the video. The proposed method has the capability of being used as a scalable coding method and any data loss or corruption is reflected as reduction in the quality of the video slightly. Except for the case when all descriptions are lost, the video streams do not experience jitter at play back. The compatibility of the proposed method with H.264 standard simplifies the implementation process. Our proposed method is based on spatial scalability features of H.264 however, a reasonable extension of the work is inclusion of SNR scalability.
References 1. Conklin, G., Greenbaum, G., Lillevold, K., Lippman, A., Reznik, Y.: Video Coding for Streaming Media Delivery on the Internet. IEEE Transaction on Circuits and Systems for Video Technology (March 2001) 2. Andreopoulos, Y., van der Schaar, M., Munteanu, A., Barbarien, J., Schelkens, P., Cornelis, J.: Fully-scalable Wavelet Video Coding using in-band Motioncompensated Temporal Filtering. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 417–420 (2003) 3. Ohm, J.: Advances in Scalable Video Coding. Proceedings of the IEEE 93(1) (January 2005) 4. Goyal, V.K.: Multiple Description Coding: Compression Meets the Network. Signal Processing Magazine 18(5), 74–93 (2001)
Multiple Description Scalable Coding
67
5. Wang, Y., Reibman, A.R., Shunan, L.: Multiple Description Coding for Video Delivery. Proceedings of IEEE 93(1) (January 2005) 6. Puri, R., Ramchandran, K.: Multiple Description Source Coding using Forward Error Correction Codes. Signals, Systems, and Computers 1, 342–346 (1999) 7. Venkataramani, R., Kramer, G., Goyal, V.K.: Multiple Description Coding with many Channels. IEEE Transaction on Information Theory 49(9), 2106–2114 (2003) 8. Chang, S.K., Sang, L.: Multiple Description Coding of Motion Fields for Robust Video Transmission. IEEE Transaction on Circuits and Systems for Video Technology 11(9), 999–1010 (2001) 9. Wang, Y., Lin, S.: Error-resilient Video Coding using Multiple Description Motion Compensation. IEEE Transaction on Circuits and Systems for Video Technology 12(6), 438–452 (2002) 10. Xuguang, Y., Ramchandran, K.: Optimal Subband Filter Banks for Multiple Description Coding. IEEE Transaction on Information Theory 46(7), 2477–2490 (2000) 11. Tillo, T., Olmo, G.: A Novel Multiple Description Coding Scheme Compatible with the JPEG 2000 Decoder. IEEE Signal Processing Letters 11(11), 908–911 (2004) 12. Wiegand, T., Sullivan, G.J., Bjontegaard, G., Luthra, A.: Overview of the H.264/AVC Video Coding Standard. IEEE Transaction on Circuits and Systems for Video Technology 13(7) (July 2003) 13. Schwarz, H., Marpe, D., Wiegand, T.: Overview of the Scalable Video Coding Extension of the H.264/AVC Standard. IEEE Transaction on Circuits and Systems for Video (2007) 14. Hewage, C., Karim, H., Worrall, S., Dogan, S., Kondoz, A.: Comparison of Stereo Video Coding Support in MPEG-4 MAC, H.264/AVC and H.264/SVC. In: Proceeding of the 4th Visual Information Engineering Conference, London (July 2007) 15. Mohr, A.E., Riskin, E.A., Ladner, R.E.: Unequal Loss Protection: Graceful Degradation of Image Quality over Packet Erasure Channels through Forward Error Correction. IEEE Journal of Selected Areas in Communications 18(6), 819–828 (2000) 16. Franchi, N., Fumagalli, M., Lancini, R., Tubaro, S.: A Space Domain Approach for Multiple Description Video Coding. In: ICIP 2003, vol. 2, pp. 253–256 (2003)
Evaluation of Different Multithreaded and Multicore Processor Configurations for SoPC Sascha Uhrig Institute of Computer Science University of Augsburg 86159 Augsburg Germany
[email protected] Abstract. Multicore processors get more and more popular, even in embedded systems. Unfortunately, these types of processors require a special kind of programming technique to offer their full performance, i.e. they require a high thread-level parallelism. In this paper we evaluate the performance of different configurations of the same processor core within an SoPC: a single threaded single core, a multithreaded single core, a single threaded multicore, and a multithreaded multicore. The used core is the jamuth core, a multithreaded Java processor able to execute Java bytecode directly in hardware. The advantage of Java in a multicore environment is that it brings the threading concept for free, i.e. the software developers are familiar with the threading concept. Our evaluations show that the cores within a multicore processor should be at least two-threaded to bridge the higher memory delays caused by the contention at the shared bus.
1 Introduction In recent years, multicore processors gain more and more importance in high performance applications as well as in embedded systems. We concentrate especially on processor architectures for embedded systems. Unfortunately, it is not trivial to design an application that can benefit optimally from multithreaded and multicore processor architectures. This is because both architectures require a certain amount of thread level parallelism to utilize the offered functional units best. In this paper we present a multicore-enabled version of the jamuth core, a multithreaded Java processor core for Systems-on-Programmable-Chip (SoPC). A Java Virtual Machine (JVM) dealing with the special multicore topics is also available. The great advantage of Java in multicore systems is the thread concept deeply embedded within Java which enables ordinary Java applications to fit well into the programming paradigms of multicores. The jamuth processor is a multithreaded core, which executes most Java bytecodes directly within the processor in a single clock cycle or as sequences of microoperations in several clock cycles (see [1]). More complex instructions are executed with the help of trap routines. Additionally, all required device drivers are also implemented in Java. Hence, the overhead of an operating system layer and the need of a (JIT) compiler or K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 68–77, 2009. c Springer-Verlag Berlin Heidelberg 2009
Evaluation of Different Multithreaded and Multicore Processor Configurations
69
interpreter is eliminated. These circumstances lead to a Java system, which is able to work with very limited hardware resources. We evaluated different parameters of the multithreading and multicore design space: A single threaded single core deals as baseline processor, which is extended to a multithreaded processor with different number of thread slots as well as to a multicore with different amount of cores and a combination of both. We measured the performance of the different architectures in terms of overall and single thread performance together with the processor utilization. This paper is organized as follows: In section 2 we show several related multithreaded and multicore environments. Section 3 presents our single core baseline jamuth Java processor core followed by several extensions for multicore support and the design of a multicore SoPC in section 4. The evaluations are shown in section 5, before section 6 concludes the paper.
2 Related Work In recent years, several multithreaded and multicore processor architectures have been established on the market of server, desktop and portable computers. A comparison of single and multicore desktop architectures based on Intel’s Smithfield and Prescott architectures is given by Sibai [2]. A simulator for simultaneous multithreaded single core processors was developed by D.M. Tullsen [3]. He and Seng reported in several publications [4][5] a formidable speedup of multithreaded architectures. Donald et al. [6] presented a new methodology to extend single core processor simulators to multicore simulators by using one software thread per simulated core. Besides this methodology, the more important topic of that work is that the new multicore simulators profit well from the execution on multicore and multithreaded processors. Gerdes et al. [7] described the MERASA project. Its focus is the development of a multithreaded multicore processor for embedded real-time systems. The multicore processor will be modelled with SystemC and an FPGA prototype should be the outcome of the project. Pitter et al. [8] presented the design of a single threaded Java multicore processor for FPGAs. They evaluated a multicore containing up to 8 Java cores. The cores are interconnected by the so-called SimpCon bus [9] and synchronized by a special synchronization logic. Except for the latter one, no other work evaluates multicore architectures for SoPC. Most work has been done concerning desktop and server applications but embedded systems are rarely the focus of multicore research. The contribution of this paper is the simple design and the evaluation of a multithreaded multicore environment especially for embedded systems based on FPGAs.
3 Architecture of the Multithreaded jamuth Single Core Processor The processor core (see figure 1) contains a multithreaded five stage pipeline (instruction fetch, instruction decode, operand fetch, execute, and the stack cache). The
70
S. Uhrig
Fig. 1. Block diagram of the jamuth core
integrated real-time scheduler shown in the figure is responsible for a real-time capable thread schedule if the core is used in real-time applications. Most of the Java integer as well as several long, float, and double bytecodes are executed directly in hardware, mostly within a single execution cycle. Instructions with medium complexity are realized by microcodes and the complex operations, such as new, athrow, monitorenter, and most floating point commands, call trap routines. As operand stack, a 2k-entry stack cache is integrated within the pipeline which is shared between the hardware thread slots. During the initialization phase, different portions of the stack cache can be assigned to the thread slots so that one thread entirely works on the stack cache while the data of other threads has to be swapped in and out. As an example, the garbage collection should run without swapping due to a continuous progress without any additional and unnecessary memory accesses for swapping. Because of predictability, real-time threads should also run without swapping. Each thread slot possesses an instruction window (IW) into which up to six bytecodes can be prefetched. These instruction windows decouple fetching and decoding of the bytecodes. Instructions can be fetched from three different sources: external memory, instruction cache, and scratch RAM. The instruction cache and the scratch RAM are integrated within each processor core whereas the memory has to be connected via appropriate interfaces. The integrated real-time scheduler supports two scheduling schemes: a simple fixed priority preemptive (FPP) scheduling scheme and the so-called guaranteed percentage (GP [10]) scheduling scheme. Each hardware thread slot is assigned to one of
Evaluation of Different Multithreaded and Multicore Processor Configurations
71
these scheduling schemes. For the FPP scheduling a priority in the range from 0 to #threadslots − 1 and for GP a percentage between 0 to 100 is required. A special jamuth Java class supports the handling of the hardware thread slots and the scheduling policies. In our case, the scheduler performs the simple fixed priority scheduling for the thread slots inside a single core. The IP core contains three additional units: the Debug unit, the Java timer, and an IRQ controller. The first unit is responsible for debug and observing functionalities, the timer is required for the JVM (System.currentMillis() and the Sleep methods), and the IRQ controller translates interrupt requests from the peripheral components to wake-up signals for the thread-slots resp. interrupt signals for already running threads.
4 Building a Multicore In this section we firstly describe the extensions of the single core processor to make it suitable for a multicore environment. Secondly, we show how anyone can assemble a multicore for FPGAs using the Altera SoPC builder [11] together with our processor core. 4.1 Extensions to the Original Core One of the most interesting topics of multicore designs is the memory connection of the processor cores. Because our system targets at a small number of cores (i.e. 2 to 4 cores), a simple bus connection suites well for this architecture. A big advantage of the bus structure is that the standard Avalon switch fabric for Altera FPGAs can be used for the core connections. Besides the connection structure, two further topics must be handled in multicore designs: the software synchronization between the cores and the cooperation of the cores. Because of the real-time capability of the original multithreaded single core processor, the used core does not support bus locking. Instead of atomic read-modify-write memory accesses, the core offers a pair of atomlock/atomunlock instructions for the synchronization of the integrated hardware thread slots. This technique is extended to synchronize multiple cores by an external lock token ring. The lock token is sent from one core to the next in a ring. If a core requires the lock, is takes the token from the ring until it releases the atomic lock. This solution is simple to implement and works well for a small number of cores (about 2–4), because the maximum latency to get the token if it is on the ring grows linear with the number of cores. Unfortunately, this solution is not suitable for real-time requirements because no priorities are taken into account. The second topic, the cooperation of the cores, is required for the system software (i.e. the JVM). Because of the design of the processor core, it is not possible to access the control registers and the Java runtime stacks from outside the core. Hence, one core cannot initialize a thread slot within any other core nor it can resume a thread slot or change its scheduling parameters. To deal with these restrictions, we introduced a central event interrupt. The JVM contains a todo list for each core which is handled at the occurrence of the central event. Possible items in the list are create thread, resume thread slot, and set scheduling parameters.
72
S. Uhrig
Fig. 2. Architecture of a Multicore SoPC (Example)
4.2 Multicore SoPC Design The processor core is designed as an IP core (a component) for Altera’s System-onProgrammable-Chip (SoPC) builder with two standard Avalon bus master interfaces. Hence, it can easily be combined with other Altera, third party, or customer-made components to a SoPC. As external system components, only a memory device like a SDRAM is required. For the design of a multicore, several of our processor cores have to be connected to the memory bus of the SoPC. The peripheral components like a MAC, a UART, or a TFT controller should be combined using a peripheral bus which is connected to all processor cores by a Pipelined Avalon Bridge (a standard component of the SoPC builder) to reduce the size of the switch fabric. Figure 2 illustrates an example system architecture with several peripheral components and two processor cores.
5 Evaluation We evaluated different combinations of multicore and multithreaded configurations. Using each of the configurations, we measured the overall performance, the performance of a single highest priority thread, and the processor’s overall utilization. 5.1 Methodology As benchmark, we used modified versions of three benchmarks out of the JOP embedded benchmark suite [12]: the Sieve benchmark, the Kfl benchmark, and the UdpIp benchmark. The modifications were necessary because the original benchmarks cannot be executed in parallel. Hence, we had to change all static variables into object fields. Additionally, the central loops within the benchmarks are modified to unbounded loops
Evaluation of Different Multithreaded and Multicore Processor Configurations
73
which are executed as long as a special run flag is set. The benchmark results are obtained by counting the loop iterations until the run flag is cleared. Because of these changes, the values presented here must not be compared to any results of the JOP benchmarks of other publications. Depending on the processor configuration, we started several identical benchmark threads in parallel within fixed thread slots of the core(s). In contrast to normal Java thread execution, we bypassed the thread scheduler and adhered the benchmark threads to predefined thread slots. Additionally, we assigned fixed priorities to the thread slots enabling us to pick a special thread out to deal as highest priority, i.e. most important thread. Its performance is considered as the single thread performance in our multithreaded and/or multicore environment. After the initialization phase, we executed the benchmark threads 100 million clock cycles and afterwards we cleared the run flag. The number of loop iterations is measured for each thread independently. Hence, we are able to present the performance of a single thread and the overall performance of the system as well as the performance of a selected core. We executed all three benchmarks with different configurations sequentially 100 million clock cycles and recorded their performance values as well as the pipeline utilization. The values presented here are the average values of all three benchmarks. As configurations, we used all possible combinations of one to three cores and one to three threads per processor. The fourth thread in each core was required for starting and stopping the benchmark threads and for the measurements. Former studies with only one core showed that the fourth thread brings no mentionable performance gain. As test platform we used an FPGA prototype board (DBC3C40 from Devboards [13]) with an Altera Cyclone III EP3C40F484C7 FPGA within which it is possible to implement three processor cores. The test results concerning the pipeline utilization are measured by the Debug module of the processor core. It counts all clock cycles a new instruction is decoded within the decode unit, the number of latency cycles of the threads, and the overall number of cycles. For the multicore configurations, all cores are connected to a single standard SDRAM controller from Altera. The cores got all the same priority, i.e. requests from the cores are serviced round robin. If a core does not request a memory access in his cycle, a possible request of the next core is serviced. 5.2 Evaluation Results The most important characteristic of a processor is its performance. So first, we present the overall performance as average of the three benchmarks in figure 3. The values shown are normalized to the performance of a single threaded single core processor. The number of active threads per core is shown at the x-axis and the number of active cores can be seen at the y-axis. A big performance improvement can be observed at the change from one to two in both directions, i.e. from a single processor to a dual core as well as from a single thread to two threads per core. Towards the third core, a mentionable performance gain is also available but the third thread leads only to a marginal increase. A second, very important factor is the performance of a single highest priority thread. It is important because a high number of applications can only partly be parallelized
74
S. Uhrig
3,5 3
Performance
2,5 2 1,5 1 0,5
so rs
3 2 1
Pr oc es
0
1 2
3
Threads
Fig. 3. Speedup of different multithreaded and multicore configurations normalized to the single threaded single core version
1 Thread/Core 2 Threads/Core 3 Threads/Core
1 0,9
Relative Performance
0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1
2 Processors
3
Fig. 4. Performance impact of multithreaded and multicore execution on a single highest priority thread
(if at all), i.e. several parts of the applications must be executed sequentially. Hence, the performance of these sequential parts have still a high impact. Figure 4 shows the single thread performance of the highest priority thread of one core. Because all cores
Evaluation of Different Multithreaded and Multicore Processor Configurations
Single Thread Performance Loss
1,07
75
Overall Performance Gain
Performance Factor
1,06
1,05
1,04
1,03
1,02
1,01
1 1
2
3
Processors
Fig. 5. Performance factor between two and three threads per core
1 Thread/Core 2 Threads/Core 3 Threads/Core
0,6
Pipeline utilization
0,5
0,4
0,3
0,2
0,1
0 1
2 Processors
3
Fig. 6. Pipeline utilization of the first core
have the same priority at the SDRAM interface, the performance of the highest priority threads of all cores are identical (not shown here) but the performance of one core in a dual core environment is reduced compared to a single core. Because of conflicts at the memory interface, it is easy to understand why the performance drops in the case of additional threads, independent of the core which executes the other thread. The fact that two threads on a single core harm the first thread’s performance more than a thread on another core can be explained in two different ways: 1. Threads on one core use the same shared instruction cache whereas different cores use different caches. Hence, cache thrashing can occur if two threads are executed on the same core even if they perform the same program because in general the program counters point to different areas of code.
76
S. Uhrig
2. The conflicts on memory accesses at the SDRAM interface are handled in a pipelined way. In contrast, a conflict concerning a memory access within a core cannot be pipelined because of the real-time capability of the original core and the unpredictable period of the accesses. Concerning the single thread performance a similar behavior can be observed as at the overall performance: the impact of the third thread per core is marginal. But with the overall performance in mind it is even worse: figure 5 shows the reduction of single thread performance compared to the gain of the overall performance if the number of active threads per core is increased from two to three. Using a single or a dual core, the overall performance gain is still higher than the performance loss of the single highest priority thread. But if we use a triple core system, we reduce the single thread performance more than we can profit from the third thread on each core. Another interesting characteristic of a processor is its efficiency. We measured the efficiency of a processor core in terms of its pipeline utilization. Figure 6 presents the utilization of the first core’s pipeline depending on the number of active threads within one core and on the number of active cores. Because the used processor pipeline is a single issue pipeline without any data cache, the IPC (instructions per cycle) of the single threaded execution is relatively low (about 0.35). If we increase the number of processor cores and, hence, the memory access latencies are increased, too, the utilization of a single core is further reduced (down to about 0.31). As a general approach to bridge memory latencies is multithreading, we executed two and three threads per core. The measured pipeline utilization shows the same characteristic like the overall performance presented in figure 3: Introducing the second thread leads to a strong increase of the utilization (factor between 1.23 to 1.45) whereas the third thread enlarges the efficiency only by the factor in the range of 1.01 to 1.04.
6 Conclusions We presented a simple method to build a multicore System on Programable Chip (SoPC) using Altera’s SoPC builder together with our multithreaded Java processor core. A suitable Java Virtual Machine (JVM) with additional functionalities to support multithreading as well as a multicore environment is also available. We evaluated different configurations of multithreading and multicore processors. The focus of our analysis is the number of threads per core and the number of cores. An Altera Cyclone III EP3C40 FPGA is used to implement up to three cores, each with three threads. A standard Altera Avalon bus serves as interconnection network and an SDRAM as main memory. Our results show that on the one hand a higher performance gain can be reached by increasing the number of cores instead of a higher number of threads per core. On the other hand, the resource consumption increases linearly but their utilization is disappointing. A better utilization can be reached by inserting multithreading into the cores. Adding a second thread increases pipeline utilization in the range from 23% to 45% and performance in the range from 26% to 45%. The integration of a third thread causes only a performance gain of up to 6%. As an overall result, we conclude that multicore
Evaluation of Different Multithreaded and Multicore Processor Configurations
77
processors should support two thread slots per core: just one thread suffers too much from conflicts at the common memory interface whereas the gain of a third thread is to small compared to its effort.
References 1. Kreuzinger, J., Brinkschulte, U., Pfeffer, M., Uhrig, S., Ungerer, T.: Real-time Eventhandling and Scheduling on a Multithreaded Java Microcontroller. Microprocessors and Microsystems 27, 19–31 (2003) 2. Sibai, F.N.: Evaluating the performance of single and multiple core processors with pcmark 2005 and benchmark analysis. In: ACM SIGMETRICS Performance Evaluation Review archive, pp. 62–71 (2008) 3. Tullsen, D.M.: Simulation and modeling of a simultaneous multithreading processor. In: The 22nd Annual Computer Measurement Group Conference (1996) 4. Tullsen, D.M., Eggers, S.J., Emer, J.S., Levy, H.M., Lo, J.L., Stamm, R.L.: Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In: 23rd International Symposium on Computer Architecture (ISCA 1996), Philadelphia, PA, USA, pp. 191–202 (1996) 5. Seng, J., Tullsen, D., Cai, G.: Power-sensitive multithreaded architecture. In: 2000 IEEE International Conference on Computer Design: VLSI in Computers and Processors, Austin, TX, USA, pp. 199–206 (2000) 6. Donald, J., Martonosi, M.: An efficient, practical parallelization methodology for multicore architecture simulation. Computer Architecture Letters 5 (2006) 7. Gerdes, M., Wolf, J., Zhang, J., Uhrig, S., Ungerer, T.: Multi-Core Architectures for Hard Real-Time Applications. In: ACACES 2008 Poster Abstracts, L’Aquila, Italy (2008) 8. Pitter, C., Schoeberl, M.: Performance evaluation of a java chip-multiprocessor. In: 3rd IEEE Symposium on Industrial Embedded Systems, Montpellier, France (2008) 9. Schoeberl, M.: Simpcon - a simple and efficient soc interconnect. In: 15th Austrian Workhop on Microelectronics, Graz, Austria (2007) 10. Kreuzinger, J., Schulz, A., Pfeffer, M., Ungerer, T., Brinkschulte, U., Krakowski, C.: Realtime Scheduling on Multithreaded Processors. In: 7th International Conference on Real-Time Computing Systems and Applications (RTCSA 2000), Cheju Island, South Korea, pp. 155– 159 (2000) 11. Altera: (Quartus II Handbook Volume 4: SOPC Builder (version 8.0) (June 2008) 12. Schoeberl M.: (JavaBenchEmbedded V1.0), http://www.jopdesign.com/perf.jsp 13. Devboards: (Datasheet, DBC3C40 Cyclone III Development Board), http://www.devboards.de/pdf/DBC3C40_Vs.1.04.pdf
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture Roberto Giorgi, Zdravko Popovic, and Nikola Puzovic Department of Information Engineering, University of Siena, Italy http://www.dii.unisi.it/~{giorgi,popovic,puzovic}
Abstract. We believe that future many-core architectures should support a simple and scalable way to execute many threads that are generated by parallel programs. A good candidate to implement an efficient and scalable execution of threads is the DTA (Decoupled Threaded Architecture), which is designed to exploit fine/medium grained Thread Level Parallelism (TLP) by using a hardware scheduling unit and relying on existing simple cores. In this paper, we present an initial implementation of DTA concept in a many-core architecture where it interacts with other architectural components designed from scratch in order to address the problem of scalability. We present initial results that show the scalability of the solution that were obtained using a many-core simulator written in SARCSim (a variant of UNISIM) with DTA support. Keywords: many-core architectures, DTA.
1 Introduction Many-core architectures offer an interesting possibility for efficiently utilizing the increasing number of transistors that are available on a single chip. Several many-core architectures have been developed in industry [1-3] and have been proposed in academia research projects [4, 5]. Although many-core architectures offer hundreds of computational cores, they have to be properly programmed in order to utilize their computing power potential [6]. Decoupled Threaded Architecture (DTA) is a proposal for exploiting fine/medium grained TLP that is available in programs [7]. Even though other types of parallelism are typically present in programs, like Instruction Level Parallelism (ILP) and Data Level Parallelism (DLP), they are not the focus of this paper: we assume that the overall architecture will be offloading parts of the computation with TLP potential on small “TLP-accelerators”, e.g., simple in-order cores, and that other types of accelerators could take care of ILP and DLP. DTA also provides distributed hardware mechanisms for efficient and scalable thread scheduling, synchronization and decoupling of their memory accesses. Previous research experimented with DTA using a simplified framework in order to prove the concept [7]. In this paper, we present an initial implementation of DTA support in a heterogeneous many-core architecture that is compatible with the SARC project [8] architecture, and we describe the hardware extensions that are needed for DTA support. K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 78–87, 2009. © Springer-Verlag Berlin Heidelberg 2009
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture
79
2 DTA Concept The key features of DTA concept are: i) communication and ii) non-blocking synchronization among threads, iii) decoupling of memory accesses that is based on the Scheduled Data-Flow (SDF) concept [9], iv) clusterization of resources in nodes (differently from SDF) and v) the use of a distributed hardware scheduler (which was centralized in SDF). Data is communicated via frames, which are portions of local memory assigned to each thread. A per-thread synchronization counter (SC) is used to represent a of input data that the thread needs. This counter is decremented each time a datum is stored in a thread’s frame, and when it reaches zero (when all input data have arrived) that thread is ready to execute. In this way, DTA provides a dataflowlike communication between threads - dataflow at thread level, and a non-blocking synchronization (threads can be synchronized using SC and while one thread is waiting for data, processors are available to execute other ready threads).
Thread th0
STORE th1, a STORE th1, b STORE th2, c
Thread th1 SC = 2
LOAD a LOAD b
Thread th2 SC = 2
LOAD c
Thread th0 sends data to threads th1 and th2 by writing into their frames (STORE instructions) and threads th1 and th2 are reading them from frames (LOAD instructions). Threads th1 and th2 can run in parallel.
STORE th3, e
STORE th3, d
Thread th3 SC = 2
LOAD d LOAD e
Thread th3 is synchronized with threads th1 and th2 since its execution will not start before all data (2 in this case) is stored into its frame.
Fig. 1. An example of communication and synchronization among threads in DTA
Threads in DTA are logically divided into smaller phases, which are called code blocks. At the beginning of a thread, Pre-Load (PL) code block reads the data from the frame and stores them into registers. Once the PL phase completes, the Execution (EX) code block starts, and it reads data from the registers and performs calculations. At the end of a thread, Post-Store (PS) code block writes data to the frames of other threads. Another possibility, like in SDF architecture [9], is to use more than two types of pipelines, one to handle PL and PS code blocks – named Synchronization Pipeline (SP) - and the other type to execute EX code blocks – named Execution Pipeline (XP); in this work, we don’t want to lose the flexibility of using existing and smaller cores. Communication of data via frames is preferable, but it is not always
80
R. Giorgi, Z. Popovic, and N. Puzovic
possible to replace accesses to global data in main memory with accesses to frame memory so the threads can access main memory at any point during execution, and in this case DMA-assisted prefetching mechanism can be used to completely decouple memory accesses [10]. In order to overcome the effects of wire delay, Processing Elements (PEs) in DTA are grouped into nodes. The nodes are dimensioned so that all PEs in one node can be synchronized using the same clock [7], and that fast communication can be achieved among them using a simple interconnection network inside a node. On the other hand, communication between nodes is slower, and interconnection network is more complex, but this is necessary to achieve scalability as the available number of transistors increases. The first specific hardware structures that DTA uses is a Frame Memory (FM). This is a local memory that is located near each PE and it is used for storing thread’s data. Access to a frame memory is usually fast and shouldn’t cause any stalls during execution. Another DTA-specific hardware structure is the Distributed Scheduler (DS) that consists of Local Scheduler Elements (LSEs) and Distributed Scheduler Elements (DSEs). Each PE contains one LSE that manages local frames and forwards request for resources to the DSE. Each node contains one DSE that is responsible for distributing the workload between processors in the node, and for forwarding it to other nodes in order to balance the workload among them. DSE together with all LSEs provides functionality of dynamic distribution of the workload between processors. Schedulers communicate between themselves by sending messages. These messages can signal the allocation of the new frame (FALLOC request and response messages), releasing a frame (FFREE message) and storing the data in remote frames [7]. Besides these structures, DTA requires a minimal support in the ISA of the processing element for the creation and management of DTA threads. This support includes new instructions for assigning (FALLOC) and releasing (FFREE) frames, instructions for storing data to other thread’s frames (STORE) and for loading data from frame (LOAD). In the case when PEs cannot access the main memory directly, instructions for reading and writing data to and from main memory are also needed. Further details on DTA, as well as one possible implementation are given in Section 4.
3 Heterogeneous Many-Core Architectures Future chip multiprocessor architectures aim to address scalability, power efficiency and programmability issues. In order to achieve the goal of creating a scalable chip, the architecture is typically subdivided into nodes (Fig. 2) that are connected via a Network on Chip (NoC), where each node contains several processors (or domainspecific accelerators). The communication inside a node is done by using a faster network that connects all the elements that belong to one node (crossbar for example). Since the network inside a node is very efficient, and adding more elements to the node would cause the degradation of its performance, the architecture scales up by adding more nodes to the configuration, and not by increasing the node size. Efficient many-core architectures should also be designed to target diverse application domains, from single threaded programs to scientific workloads. In order to address these domains, the architecture can contain heterogeneous nodes (Fig. 2).
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture
81
NoC
L2
P11
L2
P12
P11
LS
P12
…
L3
MI
An1 An2 An3 An4 An5 An6 An7 An8
P13 Node 1
P14
A21 A22 A23 A24 Node 2
Node n NoC P
Network on Chip General purpose processor that contains L1 cache A Domain-specific accelerator MI Memory Interface L2,L3 Caches LS Local Store
Fig. 2. An instance of many-core architecture with heterogeneous nodes
Each node contains a mix of general purpose processors and/or accelerators together with local memory (such as shared L2 cache or a Local Store). Typically a general purpose processor performs the control of the nodes and provides operating system services, and may address aggressive Instruction Level Parallelism (ILP) needs. On the other hand, domain specific accelerators will speed-up applications that have specific processing needs (such as vector or multimedia). We use a shared memory model as it simplifies the programmability of the machine. Several programming models for shared memory multiprocessors have been considered recently, such as OpenMP [11] and Capsule [12].
4 Implementing DTA in Many-Core Processor A possible instance of a many-core processor with DTA support should contain a memory controller, and multiple DTA nodes that contain DTA accelerators (Fig. 3). The system needs to contain a general purpose processor (P) that is responsible for sending the code to the DTA nodes, and for initiating the DTA activity. A crossbar is used for providing a fast communication for elements inside a node, which can be a part of the more complex Network on Chip (NoC) that is used to connect the nodes. The Distributed Scheduler is located in each node and since it will communicate mostly with the LSEs inside the same node it is attached directly to the crossbar. In this study, the DTA accelerators are based on the Synergistic Processing Unit (SPU) [13] from the Cell processor. SPU is an in-order SIMD processor, which can issue two instructions in each cycle (one memory and one calculation). In order to keep the processor simple, the designers didn’t implement any branch prediction and SPU relies on the compiler to give hints on branches. It also doesn’t have any caches, but uses the local store to store data and instructions. For the purpose of running DTA
82
R. Giorgi, Z. Popovic, and N. Puzovic
programs, SPU is extended with the Local Scheduling Element, and frames for threads that execute on one SPU are stored in the Local Store (LS). The SPU’s ISA is extended with DTA-specific instructions, and communication with the rest of the system is handled by the LSE. A SPU with DTA-specific extension is called DTA-PE (Fig. 3). Since the SPU contains only one pipeline, all code blocks (PL, EX and PS) will execute on it in sequence. However, SPU’s pipeline is able to issue one memory and one execution instruction at the same time and for instance it can overlap a load from LS with subsequent instructions. instruction fetch data
Memory Controller
LS
SPU
data
control
DTA Node DTA-PE
DTA-PE
DTA-PE
P
…
NoC
LSE
crossbar DTA Node
L2 + directory
DSE
Fig. 3. An instance of a many core architecture with DTA support
The LSE manages threads that execute on one DTA-PE, and it contains structures with information about the current state of the DTA-PE and frame memory: the PreLoad (PLQ) queue and Waiting Table (WT). The Pre-Load Queue contains information about threads that are ready to run (SC = = 0). It is implemented as a circular queue and each entry contains the Instruction Pointer (IP) of the first instruction of the thread and address of the thread’s frame (Frame Pointer – FP). The Waiting Table contains information about threads that are still waiting for data (SC != 0). Number of entries in the WT is equal to the maximal number of frames that are available in the DTA-PE, and it is indexed by a frame number. Each entry in the WT contains the IP and FP of the thread and synchronization count, which is decremented on each write to the thread’s frame. Once the SC reaches zero, IP and FP are transferred to the PLQ. In order to be able to distribute the workload optimally, the DSE must know the number of free frames in each DTA-PE. This information is contained in the Free Frame Table (FFT) that contains one entry with the number of free frames for each DTA-PE. When a FALLOC request is forwarded to a DTA-PE, the corresponding
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture
83
number of free frames is decremented, and when a FFREE message arrives the number of entries is incremented. Since it may happen that a FALLOC request cannot be served immediately, a Pending FALLOC Queue (PFQ), which stores pending frame requests. Each entry in this queue contains the parameters of the FALLOC request, and the ID of the DTA-PE that sent the request. When a free frame is found, the corresponding entry is removed from this queue. Most of the additional hardware cost that DTA support introduces comes from the structures needed for storing information about threads. These costs are expressed in Table 1. for implementation with one DTA node, using rbe (register bit equivalent) [14] as a unit of measure. The register bit equivalent equals the area of a bit storage cell – a six transistor static cell with high bandwidth that is isolated from its input/output circuits [14]. In the remainder of this section we will give the estimate of the hardware cost that DTA introduces for the case of one node. Table 1. Storage cost of DTA components expressed in register bit equivalent units: nDTA-PEs – number of DTA-PEs in the node, sizeFP – size of FP in bits, nPFQ – number of PFQ entries, sizeIP – size of IP in bits, sizeSC – size of SC in bits
Component DSE LSE
Structure FFT PFQ PLQ WT
Size [rbe] sizeFFT-entry * nDTA-PEs nPFQ * (sizeIP + sizeSC + sizeID) nF * (sizeIP + sizeFP) nF * (sizeIP + sizeFP + sizeSC)
The parameters that influence the hardware cost of DTA support are the number of DTA-PEs in one node (nDTA-PEs), number of frames in each DTA-PE (nF), number of bits needed to store the Synchronization Counter (sizeSC – in bits), Instruction Pointer size (sizeIP – in bits) and the number of entries in the PFQ (nPFQ). The required storage size for keeping an FP entry (sizeFP) in the LSE is log2 nF bits, since instead of keeping the entire address it is enough to keep the frame number from which the address can be reconstructed using simple translation. The Pre-Load Queue contains nF entries (FP and IP), and the Waiting Table contains n entries (FP, IP and SC). The frames are kept in Local Store and no additional storage is needed for them. The Free Frame Table in the DSE contains nDTA-PEs entries, where each entry can have the value from zero to nF since nF is the maximum number of free frames in one DTA-PE (hence, the size of an entry is sizeFFT-entry = log2 (nF+1) bits). The Pending FALLOC Queue contains nPFQ entries where each entry contains the IP (sizeIP bits), SC (sizeSC bits) and the ID of the sender (sizeID = log2 nDTA-PEs). The total size of the structures needed for hardware support is the sum of the costs for LSEs and the cost of the DSE, and it is the function of nDTA-SPEs and nF. Take for example a DTA node with nDTA-PEs = 8 DTA-PEs, where each DTA-PE has 256kB Local Store, Instruction Pointer of sizeIP=32 bits, maximal value for SC of 256 (hence, sizeSC=8 bits), nF = 128 frames per DTA-PE (each frame with 64 4-byte entries). And the DSE that has the possibility to store 8 pending FALLOC requests (one from each DTA-PE). The frames occupy 32kB in each Local Store (256kB total) and the rest of the LS can be used for storing the code and other data that cannot be
84
R. Giorgi, Z. Popovic, and N. Puzovic
communicated using frame memory and needs to be fetched from the main memory (using DMA unit for example). In this case, the PreLoad Queue has 4992 bits and the Waiting Table has 6016 bits of storage, which gives total of 1.3 kB in the LSE. The Pending FALLOC Queue takes 392 bits in the DSE, and the Free Frame Table has 64 bits of storage which yields total of 49 B in the DSE. Hence, all needed structures in one node have 10.8 kB of storage space, which is 0.5% when compared to the total LS size. If we double the number of frames in each DTA-PE to nF = 256 (taking ½ of the LS), the required storage space increases to 22 kB, which is 1.07% of the LS size. Increasing the number of frames even more, to 384 (taking ¾ of the LS), the total size is 33.05 kB which represents 1.6% of the LS size. Based on these values, and neglecting small contributions to the total size, we arrive to the formula: size (nDTA-PEs,nF) = K1 * nDTA-PEs * nF * (2 * log2 (nF) + K2) where K1 = 11/85, K2 = 70 and the size is expressed in bits.
5 Experimental Results 5.1 Methodology In order to validate the DTA support, we extended the SARCSim simulator and tested it with several simple kernel benchmarks. SARCSim is a many-core simulator that is based on the UNISIM framework [15], and developed to simulate the SARC architecture [8]. SARCSim/UNISIM allowed us to use already existing modules (such as processors, network and memory) and to implement only the DTA-specific extensions. The configuration used for performing simulations is the one described in a previous section, and we have varied memory latencies throughout the experiments in order to determine if the memory is the limiting factor for the scalability. The size of the LS used in experiments is 256kB per DTA-PE. The tested configuration didn’t have caches implemented, and all requests went directly to the memory. However, we have performed the tests with memory latency set to one cycle in order to simulate the situation in which the caches are present, and requests always hit. The benchmarks that are used for performing tests are: ─ The bitcount (bitcnt) from the MiBench [16] suite is a program that counts bits in various ways for a certain number of iterations (an input parameter). Its parallelization has been performed by unrolling both the main loop and loops inside each function. ─ Fibonacci (fib) is a program that recursively calculates Fibonacci numbers. Each function call is a new DTA thread. The main purpose of this benchmark is to create a vast number of small threads in order to stress the DTA scheduler. ─ Matrix multiply (mmul) is a program that just does what the name implies. Calculations are performed in threads that work in parallel. Number of working threads is always power of two. Inputs are two n by n matrices. ─ Zoom is a image processing kernel for zooming. It is parallelized by sending different parts of the picture to different processors. Input is an n by n picture.
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture
80
speedup
millions of cycles
cycles
85 speedup
8
40000
70
7
35000
7
60
6
30000
6
50
5
25000
5
40
4
20000
4
30
3
15000
3
20
2
10000
2
10
1
5000
1
0
0
0
1
2
3
4
5
6
7
0 1
8
2
speedup [log]
millions of cycles
3
4
5
6
7
8
b) fibonacci (10)
a) Bitcnt (10000) 16
8
8
14
8
speedup
millions of cycles
8
7
7
6
6
10
5
5
8
4
4
6
3
3
4
2
2
2
1
1
12 4
2
0
1 1
2
4
8
c) mmul(32)
Memory latency 1 cycle
0
0 1
2
3
4
5
6
7
8
d) zoom (32)
Memory latency 150 cycles
Fig. 4. Execution times and speedup when varying memory latency. Execution time is shown using bars, and speedup using lines. The X axis shows number of DTA-PEs.
All these benchmarks were first hand-coded for the original DTA architecture, and then translated in order to use the SPU ISA with DTA extensions. 5.2 Preliminary Results The first set of experiments shows the scalability of the DTA TLP support when number of DTA-PEs is increased from 1 to 8 in one node (Fig. 4). All benchmarks scale well except for Fibonacci, as the number of requests for new threads exceeds the DSE’s capabilities. We have encountered this situation in a previous study [7] and we overcame this problem by using a virtual frame pointers, as described in the same study. In this work, we didn’t consider yet the use of the virtual frame pointer. As expected, the configuration with memory latency set to 1 cycle has lower execution time than the configuration with memory latency set to 150 cycles. However, the scalability is the same in both cases, and speedup is near to the ideal.
86
R. Giorgi, Z. Popovic, and N. Puzovic
6 Related Work Most of the leading hardware producers have introduced their many-core architectures recently. Examples are Cyclops-64 [1], which is a multi-core multithreaded chip currently under development by IBM, UltraSPARC T2 [2] from SUN Microsystems , and Plurality [3], which uses a pool of RISC processors with uniform memory, hardware scheduler, synchronizer and load balancer. DTA mainly differs from these architectures in the execution model, which is based on the Scheduled DataFlow in the case of DTA. Academic research projects are also focusing on many-core architectures. Speculative Data- Driven Multithreading (DDMT) [5] exploits the concept of dataflow at thread level like DTA. The main difference is that DDMT uses static scheduling while in DTA scheduling is done dynamically at run-time in hardware. TRIPS [4] uses tiling paradigm with different types of tiles. These tiles are reconfigurable in order to exploit different types of parallelism. TRIPS uses dataflow concept inside a thread, and control flow at the thread level, which is the opposite of what DTA does. TAM [17] defines a self-scheduled machine language with parallel threads, which communicate in dataflow manner among them and a machine language that can be compiled to run on any multiprocessor system without any hardware support (unlike DTA which has HW support).
7 Conclusions In this paper, we have presented one possible implementation of TLP support for a many-core architecture, which targets fine/medium grained threads via hardware scheduling mechanism of the DTA. The initial test show that scalability of the architecture is promising in all the cases up to 8 processors per node. The overall conclusion is that since this implementation for TLP support scales good, it suites well in the many-core environment. As a future work, we want to test different configurations with more nodes and to implement some techniques present in native DTA (e.g. virtual frame pointers). We will focus also on a tool that would allow us to automatically extract DTA code from high level programming languages by using methods like OpenMP, which would allow us to perform tests with more benchmarks. Acknowledgements. This work was supported by the European Commission in the context of the SARC integrated project #27648 (FP6 ) and by the HiPEAC Network of Excellence (FP6 contract IST-004408, FP7 contract ICT-217068).
References 1. Almási, G., et al.: Dissecting Cyclops: a detailed analysis of a multithreaded architecture. SIGARCH Comput. Archit. News 31(1), 26–38 (2003) 2. Shah, M., et al.: UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC. In: IEEE Asian Solid-State Circuits Conference, ASSCC 2007, Jeju (2007)
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture
87
3. Plurality architecture, http://www.plurality.com/architecture.html 4. Sankaralingam, K., et al.: Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In: Proceedings of the 30th annual international symposium on Computer architecture, pp. 422–433. ACM Press, San Diego (2003) 5. Kyriacou, C., Evripidou, P., Trancoso, P.: Data-Driven Multithreading Using Conventional Microprocessors. IEEE Trans. Parallel Distrib. Syst. 17(10), 1176–1188 (2006) 6. Harris, T., et al.: Transactional Memory: An Overview. IEEE Micro. 27(3), 8–29 (2007) 7. Giorgi, R., Popovic, Z., Puzovic, N.: DTA-C: A Decoupled multi-Threaded Architecture for CMP Systems. In: 19th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2007, Gramado, Brasil, pp. 263–270 (2007) 8. SARC Integrated Project, http://www.sarc-ip.org 9. Kavi, K.M., Giorgi, R., Arul, J.: Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation. IEEE Transaction on Computers 50(8), 834–846 (2001) 10. Giorgi, R., Popovic, Z., Puzovic, N.: Exploiting DMA mechanisms to enable non-blocking execution in Decoupled Threaded Architecture. In: Proceedings of the Workshop on Multithreaded Architectures and Applications (MTAAP 2009), held in conjunction with the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2009), Rome, Italy, May 25-29, 2009, pp. 1–8 (2009) ISBN 978-1-4244-3750-4 11. The OpenMP API specification for parallel programming, http://openmp.org 12. Pierre, P., Yves, L., Olivier, T.: CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, Los Alamitos (2006) 13. Kahle, J.A., et al.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49, 589–604 (2005) 14. Flynn, M.J.: Computer Architecture. Jones and Bartlett Publishers, Sudbury (1995) 15. August, D., et al.: UNISIM: An Open Simulation Environment and Library for Complex Architecture Design and Collaborative Development. IEEE Comput. Archit. Lett. 6(2), 45–48 (2007) 16. Guthaus, M.R., et al.: MiBench: A free, commercially representative embedded benchmark suite. In: Proceedings of the Workload Characterization, WWC-4, 2001. IEEE International Workshop, pp. 3–14. IEEE Computer Society, Los Alamitos (2001) 17. Culler, D.E., et al.: TAM - a compiler controlled threaded abstract machine. J. Parallel Distrib. Comput. 18(3), 347–370 (1993)
Implementation of W-CDMA Cell Search on a FPGA Based Multi-Processor System-on-Chip with Power Management Roberto Airoldi1 , Fabio Garzia1 , Tapani Ahonen1 , Dragomir Milojevic2 , and Jari Nurmi1 1
2
Tampere University of Technology, Department of Computer Systems, P.O. BOX 553, FIN-33101, Tampere, Finland {firstname.lastname}@tut.fi Universit´e Libre de Bruxelles, Bio, Electro and Mechanical Systems, CP165/56, av.F. Roosevelt 50, B-1050 Brussels, Belgium {Dragomir.Milojevic}@ulb.ac.be
Abstract. In this paper we describe a general purpose, homogeneous Multi-Processor System-on-Chip (MPSoC) based on 9 processing clusters using COFFEE RISC processors and a hierarchical Network-on-Chip implemented on an FPGA device. The MPSoC platform integrates a cluster clock gating technique, enabling independent core and memory sleep modes. Low cluster turn-on delay allows frequent use of such technique, resulting in power savings. In order to quantify the performance of the proposed platform and the reduction of power consumption, we implement Target Cell Search part of the WCDMA, a well known SDR application. We show that the proposed MPSoC platform achieves an important speed-up (7.3X) when compared to comparable single processor platform. We also show that a significant reduction in dynamic power consumption can be achieved (50% for the complete application) using the proposed cluster clock-gating technique.
1
Introduction and Related Work
Embedded applications, such as Software Defined Radio (SDR), require more and more processing power that has to be delivered under the constraints of low cost and low power consumption. For most demanding embedded applications standard DSP cores alone can not be used any more, because they do not have enough processing power. This is one of the reasons why in the past years dedicated solutions based on specific hardware platforms have been proposed in the literature. One can mention a multithreaded, low power processor for SDR applications combining classical integer and SIMD units and embedded into a more complex SoC, proposed in [1]. The typical power consumption of this implementation is 150mW per core, running at 600MHz, with a 0.9V supply. In the context of MPSoC platforms for SDR applications, a fully programmable, 4 SIMD cores architecture has been proposed in [2], achieving 2Mbps Wide-band Code-Division Multiple-Access (WCDMA) for a power consumption of 270mW, K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 88–97, 2009. c Springer-Verlag Berlin Heidelberg 2009
Implementation of W-CDMA Cell Search
89
and 24Mbps 802.11a at about 370mW for 90nm technology. More specifically, a low power cell search in WCDMA has been described in [3], with power consumption of 36mW for a circuit built in 350nm. The increased performance of the new generation FPGA circuits, together with low power strategies allow the implementation of the complex SoCs on these devices, some of them being dedicated for the SDR applications [4]. Implementation of OFDM on FPGA has been described in [5] and a more complex SoC using a Network-on-Chip as communication infrastructure has been described in [6]. In this paper we present a medium size, general purpose, homogeneous MultiProcessor System-on-Programmable-Chip (MPSoPC), called Coffee Machine. The platform is based on 9 Coffee RISC processors described in [7] and incorporates a hierarchical Network-on-Chip (NoC) for inter-processor communication, introduced in [8,9,10]. The proposed platform runs the Target Cell Search algorithm, part of the WCDMA[11], a widely adopted air interface for 3rd Generation (3G) mobile communication systems. This paper is organized as follows. In Section 2 we will give an overview of the proposed MPSoC platform architecture and will provide a more detailed description of the different building blocks. We will further concentrate on the utilized software management of clock gating islands for the processing cores and local memories. In Section 3 we will shortly discuss the WCDMA Target Cell Search application mapped on the MPSoC platform and focus on some of the implementation details. Section 4 gives more detailed information on the experimental set-up and system performance: processing capability, area, power consumption with and without cluster clock-gating. Finally, Section 5 concludes the paper and discusses future directions.
2
MPSoC Platform with Dynamic Power Management
Coffee Machine is an MPSoC platform derived from the Silicon Cafe template architecture developed at Tampere University of Technology. Coffee Machine is composed of nine computational clusters (CCs) built around the Coffee RISC [7] processor core. Transactions between CCs take place through a mesh network that taps directly into the local, switched communication infrastructure. A simplified view of the Coffee Machine is provided in figure 1. The nine CCs of the Coffee Machine enable efficient parallelization of the target application. Operation is conducted hierarchically by means of centralized task distribution and data flow management. These run-time management functions include allocating resources for processing tasks as well as parallelizing and scheduling data streams. The management functions are mapped to the central CC (CC4 in figure 1) making it in essence the system master. The master CC is directly connected to off-chip peripherals and main memory. Acting as slaves to the master CC, the remaining eight CCs are utilized for parallel processing. The master CC’s central position in the network topology ensures balanced communication latency across the slaves as the maximum number of hops between any slave and the master is two. This enables wide utility
90
R. Airoldi et al.
Fig. 1. Simplified view of the Coffee Machine MPSoC platform
Fig. 2. Computational cluster built around the Coffee RISC processor core
of the slaves for varying subtask regardless of the physical location. Workload distribution between the eight slave CCs is highly uniform. There are power-oftwo elements in most of the target application’s data streams. Consequently, a significant amount of work can be divided into eight approximately equivalent and independent threads. Figure 2 illustrates the CC structure in more detail. Each CC contains a Coffee RISC processor with local data and instruction scratchpads, and a Bridge Interface (BIF). The BIF is connected with the global network and is composed of an initiator side, a target side and a local arbiter. The master CC contains off-chip I/O peripherals in addition. Data and instruction scratchpad capacities are 256kB and 32kB for the master, 32kB and 16kB for a slave CC. System-wide access to the local resources (memories and peripherals) is provided through non-blocking switching interconnections. 2.1
Interconnection Architecture
CCs have two contending initiators (masters): the processor and the initiator side of the bridge (B/I). Targets (slaves) are accessed in parallel through the local switches unless both request the exact same target at the exact same time. Arbitration both here and in the global switches is based on programmable priority sequences. In case of negative arbitration result, stop signal is asserted forcing the request to be held until go is issued. Processor’s requests to access remote peripherals, that is, peripherals of another cluster, are directed by the local switches to the target side of the bridge (B/T). The B/T contains a run-time reconfigurable lookup table where routing request sequences are assigned to memory pages. These sequences steer the request to its destination cluster where it is absorbed by the respective B/I. The B/I executes remote read and write operations on the local cluster as well as responds to remote reads by passing the data to the B/T for return delivery.
Implementation of W-CDMA Cell Search
91
Multicasting and broadcasting can be achieved in two ways. The data and its remaining routing request sequence can be switched to multiple output ports thereby supporting regular casting patterns. For non-regular casting patterns, data can be targeted to a remote B/T in addition to the remote peripheral. The remote B/T will pass a copy to another CC, where the same procedure can be repeated if necessary. As indicated in table 2, speedup achieved with the Target Cell Search application was 7.3X in comparison to a single processor system. The broadcast mechanism reduces significantly the communication overhead due to distribution of samples to the slave cores. In fact the overall speedup would have been less than 5.2X without broadcasting support. Details of the architecture are described in [10] and [9]. 2.2
Power Management Employed
In an MPSoC it is likely to have an ongoing data transfer, while there’s nothing to process at some part of the chip. This is the case in the Coffee Machine for example during retrieval of results from a slave CC’s data scratchpad. It is thus advantageous to manage the local memory and processing clocks independently from each other. Based on initial experiments with the power management techniques available on Stratix II FPGAs, a software controlled clock gating scheme was chosen for the Coffee Machine. The scheme allows disabling idle processors and/or memories instantly for arbitrary periods of time. There is no adverse effect on the maximum operating frequency and the associated resource utilization overhead is relatively low. Clock gating state is modifiable through a control register with individual enable/disable bits for the memory and processing clocks of the slave CCs.
3
Mapping the Cell Search on the Proposed MPSoC
In this work we will concentrate on one particular part of the W-CDMA system, the target cell search, operation that can be divided into three steps: 1. Slot Synchronization 2. Frame Synchronization 3. Scrambling Code Identification These functional blocks will be mapped on the proposed MPSoC platform using the power management technique described in the previous Section. 3.1
Slot Synchronization
During the Slot Synchronization phase (task graph in Fig. 3), the receiver uses the sequence obtained from the Primary-Synchronization Channel (P-SCH) to synchronize with a cell. The samples coming from one slot (2560 samples) are analyzed in this phase. The operation is performed in two steps.
92
R. Airoldi et al.
Fig. 3. Task graph for the slot synchronization. Tasks in black executed by the main core, tasks in white executed in parallel by all the slaves.
In the first step we identify the possible slot boundary. This is achieved correlating 256 samples from the input sequence with 256 coefficients of the known start code. A value higher than a fixed threshold in the correlation output indicates that the slot boundary has been detected. The task graph in Fig. 3 shows how the workload is distributed. Considering this first part, the main core distributes in broadcast the coefficients for the correlation and a part of the input sequence (256/8 values to each slave core). The slave cores perform the correlations (task in white), and the main core evaluates if the peak has been found. If not, samples are sent to the slaves and the same operations are repeated. In the second step, the system tries to synchronize with the strongest path, since it is possible to have different translated versions of the synchronization signal in the air. This is achieved calculating correlations on a fixed time frame (1024 samples) for the next 4 slots, and returning the average. Also in this case the main core takes care of tdata distribution and calculation of average, while the slaves calculate correlations. In the context of the power analysis, this last step will not be taken into account, because simulation can not be performed using the current environment due to computational and memory resource limitations. For idle time estimation of this operation we will simply assume that all cycles are busy. The number of idle cycles decrease significantly with the peak distance. For example we calculated that 39% of total platform cycles are idle if a peak is found after 50 samples, while is 20% for a peak detected after 470 samples. 3.2
Frame Synchronization
The receiver uses the Secondary-Synchronization Channel (S-SCH) to perform Frame Synchronization, identifying the code group of the cell found in the Slot Synchronization step of the algorithm. Frame Synchronization consists of 16 parallel correlations. The correlations are computed over all the slots that compose one frame, that is 15 slots. Correlations are executed between the received signal and all possible Secondary Synchronization Code Sequences (SSCS), which are 16. Considering the task graph in Fig. 4, the main core transfers two of the 16 SSCSs data to each slave cluster and then 4 samples of the incoming data stream in broadcast.
Implementation of W-CDMA Cell Search
Fig. 4. Task graph for the frame synchronization
93
Fig. 5. Task graph for the scrambling code identification
When 16 correlations are computed, the master core builds the received codeword. Then each core compares the obtained codeword with a local data subset of all possible codewords to identify the code group. Each comparison yields a weight, and the master core finds the maximum among these weights. In this second phase, the share of idle cycles is 37.5%. 3.3
Scrambling Code Identification
During Scrambling Code Identification, the receiver determines the exact primary scrambling code used by the cell. The primary scrambling code is identified through symbol-by-symbol correlation between the received data stream and all possible codes within the code group, identified during Frame Synchronization. This phase is characterized by only 4 tasks (Fig. 5). There is first a sequential transfer of the possible scrambling codes to different slave cores, and then a broadcast transfer of the input samples. At this point slave cores perform the correlations and the master searches for the maximum. In this case the amount of data to be sent sequentially is much higher than the amount of data sent by broadcast. Since most of the time is spent in the transfer, the estimated share of idle cycles for this step is 78.5%.
4
Experimental Set-Up and Results
The MPSoC platform is described in VHDL. The target FPGA device is Altera Stratix II EP2S180. The application has been compiled and run in the simulation framework of the proposed MPSoC platform. Switching activity information is passed to a power analysis tool using Value Change Dump (VCD) files created during simulation. 4.1
Synthesis Results
The Coffee Machine utilizes 72% of the programmable resources on an Altera EP2S180 FPGA Device and works at a maximum frequency of 75M Hz. Area breakdown of the individual functional blocks of one processing cluster, the global communication and the whole system are given in Table 1.
94
R. Airoldi et al. Table 1. Area breakdown of the Coffee Machine Component Adapt. LUT Registers Utilization % CPU 7779 4944 7.4 Init NI 12 114 0.1 Target NI 65 110 0.1 Total/Cluster 8085 6887 7.7 Global Communication 2751 3440 3.1 Total 76648 50218 72
Table 2. Cycle count of the Target Cell Search on a single processing cluster and Coffee Machine Cell Search Step
Single MPSoC Speed [cycles] [cycles] -Up Single Correlation Point 12381 2546 5X Slot Synchronization (Fixed Part) 52890764 7147387 7.5X Frame Synchronization 3750593 471458 8X Scrambling Code ID 149973 56203 2.7X Entire Application 57410380 7802348 7.3X
4.2
Processing Performance of the Platform
For comparison purposes we also implemented the same application on a system containing a single processing cluster only (using the same processor instance as in the MPSoC version). Results of cycle count comparisons for different steps of the application are indicated in Table 2. The first step (Slot Synchronization) is characterized by an undefined number of correlations followed by a fixed part of the code. The number of correlations depends on how long it takes to detect the first peak. For benchmarking purposes, we consider the speedup in calculation of a single correlation point and the speedup related with the fixed part. We observe a speedup of 5 of the proposed MPSoC platform over the single processor cluster. The less than ideal speedup is mainly attributed to the sequential transfer of the samples to all slave cores. The fixed part, that requires only an initial transfer, gives a 7.5X speedup. The speedup for Frame Synchronization is 8X. Such a high value can be explained by the fact that all the transfers use the broadcast mechanism, thus minimizing the communication overhead. The Scrambling Code Identification Step gives the worst speedup (2.7X). In this case the slave cores use different coefficients, known only after Frame Synchronization. Therefore the transfer of these coefficients cannot benefit from the broadcast mechanism, introducing a significant transfer time overhead. We evaluated also the speedup related with the execution of the entire application, supposing that the first peak is found after 50 correlations. The result is 7.3X. Even though the application requires a large number of data transfers,
Implementation of W-CDMA Cell Search
95
its MPSoC implementation benefits from the code parallelization and hardware mechanisms that reduce the communication overhead. Considering the FPGA implementation, the entire cell search requires 104ms. The largest fraction of time is spent in slot synchronization. Since it cannot be done in real-time (one slot is 0.67ms wide), the buffering of 5 slots is required (25 KB altogether). After that, the receiver should keep synchronization with the incoming samples, for example buffering only one slot and overwriting the values when the buffer is full. The frame synchronization is performed in 6.2ms, and since it is faster than the acquisition of a frame of samples, it does not require additional buffering. Same considerations apply for the scrambling code identification that requires 0.75ms. We are assuming here that the three steps are executed sequentially. 4.3
Power Consumption
Power consumption figures of the initial design and the design including processing cluster clock gating technique are shown on Figure 6 (respectively indicated on abscissa as NCLKG and CLKG) for different steps of the Target Cell Search. We present dynamic power consumption components (logic and routing power, respectively shown as light and dark gray boxes) of: the individual clusters (starting from Cluster 0 to Cluster 8) and the complete COFFEE Machine platform. Static power dissipation is not included, because it remains constant with and without clock gating and has been reported at 1.68mW. The power consumption
Fig. 6. Power consumption of the Coffee Machine without and with proposed cluster clock gating
96
R. Airoldi et al.
reduction for different steps is respectively: 18, 44 and 46%. Total reduction in power consumption for the complete Target Cell Search application is 33%. While the absolute power dissipation is high in comparison to ASIC implementations proposed in the literature, one can argue that this is because of the low power efficiency of FPGA circuits. A power scaling factor for FPGA to ASIC technology migration can be used for comparison purposes, such as the one proposed in [12] where 90nm ASIC and FPGA designs were compared suggesting a factor of 12 reduction in dynamic power consumption.
5
Conclusions
We presented a homogeneous MPSoC platform with 9 processing clusters. The architecture is based on computational clusters built around the COFFEE RISC processor and a hierarchical NoC with broadcasting support. For a representative SDR application, the Target Cell Search of the W-CDMA, we show that such platform can provide enough processing power running on an FPGA circuit (Altera Stratix II EP2S180 device; with an operating frequency of 75MHz). It was noted that broadcasting capability of the proposed MPSoC platform contributed significantly to the speedup achieved with the parallelized application. The additional overall speedup thanks to broadcasting was 1.41X, while some parts of the application enjoyed more than twice the performance of a system without broadcasting support. Furthermore, we described a simple power management technique based on cluster clock gating, enabling individual processor and memory sleep modes. We show that depending on the subtask of the application, the power savings vary from 18% to 46%, for a total of 33% for the complete Target Cell Search application.
References 1. Mamidi, S., Blem, E.R., Schulte, M.J., Glossner, J., Iancu, D., Iancu, A., Moudgill, M., Jinturkar, S.: Instruction set extensions for software defined radio on a multithreaded processor. In: CASES 2005: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, pp. 266–273. ACM, New York (2005) 2. Lin, Y., Lee, H., Harel, Y., Woh, M., Mahlke, S., Mudge, T., Flautner, K.: A System Solution for High-Performance, Low Power SDR. In: Proceeding of the SDR 2005 Technical Conference and Product Exposition (2005) 3. Li, C.-F., Chu, Y.-S., Ho, J.-S., Sheen, W.-H.: Cell Search in WCDMA Under Large-Frequency and Clock Errors: Algorithms to Hardware Implementation. IEEE Transactions on Circuits and Systems I: Regular Papers 55(2), 659–671 (2008) 4. Jenkins, C., Ekas, P.: Low-power Software-Defined Radio Design Using FPGAs. In: Software Defined Radio Technical Conference and Product Exposition, Orlando, Florida, November 13-16 (2006) 5. Dick, C., Harris, F.: FPGA implementation of an OFDM PHY. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, 9-12 November, vol. 1, pp. 905–909 (2003)
Implementation of W-CDMA Cell Search
97
6. Delorme, J., Martin, J., Nafkha, A., Moy, C., Clermidy, F., Leray, P., Palicot, J.: A FPGA partial reconfiguration design approach for cognitive radio based on NoC architecture. In: Circuits and Systems and TAISA Conference, NEWCAS-TAISA 2008. 2008 Joint 6th International IEEE Northeast Workshop, June 2008, pp. 355– 358 (2008) 7. Kylli¨ ainen, J., Ahonen, T., Nurmi, J.: General-purpose embedded processor cores the COFFEE RISC example. In: Nurmi, J. (ed.) Processor Design: System-on-Chip Computing for ASICs and FPGAs, ch.5, pp. 83–100. Kluwer Academic Publishers / Springer Publishers (June 2007) ISBN-10: 1402055293, ISBN-13: 978-1-4020-5529-4 8. Ahonen, T., Nurmi, J.: Synthesizable switching logic for network-on-chip designs on 90nm technologies. In: Proceedings of the 2006 International Conference on IP Based SoC Design (IP-SOC 2006), December 6-7, pp. 299–304. Design and Reuse S.A (2006) 9. Ahonen, T., Nurmi, J.: Programmable switch for shared bus replacement. In: Proceedings of the 2006 International Conference on Ph.D. Research in Microelectronics and Electronics (PRIME 2006), June 11-15, pp. 241–244. IEEE, Los Alamitos (2006) 10. Ahonen, T., Nurmi, J.: Hierarchically Heterogeneous Network-on-Chip. In: The International Conference on Computer as a Tool, EUROCON, 2007, September 9-12, pp. 2580–2586 (2007) 11. Wang, Y.-P.E., Ottosson, T.: Cell search in W-CDMA 18(8), 1470–1482 (2000) 12. Kuon, I., Rose, J.: Measuring the Gap Between FPGAs and ASICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26(2), 203–215 (2007)
A Multiprocessor Architecture with an Omega Network for the Massively Parallel Model GCA Christian Sch¨ ack, Wolfgang Heenes, and Rolf Hoffmann Technische Universit¨ at Darmstadt, FB Informatik, FG Rechnerarchitektur, Hochschulstraße 10, D-64289 Darmstadt, Germany {schaeck,heenes,hoffmann}@ra.informatik.tu-darmstadt.de
Abstract. The GCA (Global Cellular Automata) model consists of a collection of cells which change their states synchronously depending on the states of their neighbors like in the classical CA (Cellular Automata) model. In differentiation to the CA model the neighbors are not fixed and local, they are variable and global. The GCA model is applicable to a wide range of parallel algorithms. In this paper a general purpose multiprocessor architecture for the massively parallel GCA model is presented. In contrast to a special purpose implementation of a GCA algorithm the multiprocessor system allows the implementation in a flexible way through programming. The architecture mainly consists of a set of processors (Nios II) and a network. The Nios II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The network is a well-known omega network. Only read-accesses through the network are necessary in the GCA model leading to a simplified structure. A system with up to 32 processors was implemented as a prototype on an FPGA. The analysis and implementation results have shown that the performance of the system scales with the number of processors. Keywords: Global Cellular Automata, multiprocessor architecture, omega network, FPGA.
1
Introduction
The GCA (Global Cellular Automata) model [1] is an extension of the classical CA (Cellular Automata) model [2]. In the CA model the cells are arranged in a fixed grid with fixed connections to their local neighbors. Each cell computes its next state by the application of a local rule depending on its own state and the states of its neighbors. The data accesses to the neighbors states are readonly and therefore no write conflicts can occur. The rule can be applied to all cells in parallel and therefore the model is inherently massively parallel. The CA model is suited to all kind of applications with local communication, like physical fields, lattice-gas models, models of growth, moving particles, fluid flow, routing problems, picture processing, genetic algorithms, and cellular neural networks. K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 98–107, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Multiprocessor Architecture for the GCA Model
99
The GCA model is a generalization of the CA model which is also massively parallel. It is not restricted to the local communication because any cell can be a neighbor. Furthermore the links to the neighbors are not fixed; they can be changed by the local rule from generation to generation. Thereby the range of parallel applications is much wider for the GCA model. Typical applications besides the CA applications are graph algorithms [3], hypercube algorithms, logic simulation [4], numerical algorithms, communication networks, neural networks, and graphics. The state of a GCA cell consists of a data part and one or more pointers (Fig. 1). The pointers are used to dynamically establish links to global neighbors. We call the GCA model one handed if only one neighbor can be addressed, two handed if two neighbors can be addressed and so on. In our investigations about GCA algorithms we found out that most of them can be described with only one link.
Fig. 1. The operation principle of the GCA
The aim of our research is the hardware and software support of this model. There are mainly three possibilities for an implementation. 1. Fully Parallel Architecture. A specific GCA algorithm is directly mapped into the hardware using registers, operators and hardwired links which may also be switched if necessary. The advantage of such an implementation is a very high performance [5], but the problem size is limited by the hardware resources and the flexibility to apply different rules is low. 2. Partially Parallel Architecture with Memory Banks. This architecture [6,7] offers also a high performance, is scalable and can cope with a large number of cells. The flexibility to cope with different rules is restricted. 3. Multiprocessor Architecture. This architecture is not as powerful as the above two ones, but it has the advantage that it can be tailored to any GCA problem by programming. It also allows integrating standard or other computational models. In [8] we described a 16 bit multiprocessor architecture with a multiplexer network, used for dedicated algorithms like fast fourier transformation and bitonic merge. In this contribution we are presenting a multiprocessor architecture for the GCA model with an omega network [9] based on Nios II which was also implemented in
100
C. Sch¨ ack, W. Heenes, and R. Hoffmann
FPGA logic. The performance of multiprocessor SoC architectures is depending on the communication network and the processor architecture. Several contributions described and compared different bus architectures [10]. Multiprocessor architectures with the Nios II softcore were presented in [11].
2 2.1
Multiprocessor Architecture Overview
– The system consists of p processors with internal memories (program and data) and an omega interconnection network (Fig. 2). – The data memories are implemented as dual port memories. One port is directly connected to its associated processor. The second port is connected to the network enabling all other processors to access the data in the memory. – Each data memory holds a part of the GCA cell field of the application. – A processor can modify only its own cells stored in its internal data memory. – Each processor has only read access to the other data memories, write accesses need not to be implemented due to the GCA model. – The local GCA rule is programmable by processor instructions. – The processor instructions support the accesses to the cells in the internal memory and the read accesses to external cells stored in the other memories. One processor with its associated dualport data memory is called Processing Unit (PU). The omega network interconnects the processing units. Therefore the network complexity depends on the complexity of the communication patterns which is needed for the class of GCA algorithms to be implemented. For many GCA algorithms [12] the communication pattern is rather simple which simplifies the design of the network. Simple or specialized networks can be implemented with multiplexers or fixed connections [8,9], complex networks have to be able to manage concurrent read accesses to arbitrary external memory locations. As in the GCA model a cell is not allowed to modify the contents of another cell, the
Fig. 2. General system architecture with datapath
A Multiprocessor Architecture for the GCA Model
101
network design is simplified because write accesses need not to be implemented. Memory arbitration of concurrent read accesses is handled decentralized within the switching elements of the omega network. Thereby concurrent read accesses need not to be handled by each PU which reduces the overall control logic. Therefore it makes no difference if a PU has to wait due to a blocked network or due to a concurrent read access. 2.2
The NIOS II Softcore Processor
The Nios II processor [13] is a general-purpose RISC processor core, providing: – Full 32-bit instruction set, data path, and address space – 32 general-purpose registers – Dedicated instructions for computing 64-bit and 128-bit products of multiplication – Floating-point instructions for single-precision floating-point operations – Custom instructions For the implementation of the multiprocessor architecture, the custom instructions [13] turned out to be very useful (see section 2.4). 2.3
The Omega Network
The omega network is commonly used in parallel computer systems. It belongs into the category of blocking networks. Figure 3 shows the possible switching states of one switching element. The switching elements are able to process a broadcast request. If two processing units want to access the very same memory location, the concurrent access can be resolved with broadcasting. Figure 4 shows the control signals of a switching element. Input 0 (ain, req0) has higher priority compared to input 1. The switching elements consist of combinatorial logic only. Figure 5 shows the multiprocessor architecture with eight processing units. For better understanding of the architecture the processing units are shown twice. This example also shows that each processor can access its own memory either internally or by using the network. Different accesses to the same memory location are sequentialized by using different priorities. 2.4
Custom Instructions
We expanded the NIOS II instruction set with one extended custom instruction. The custom instruction can be specified with one of the following function values.
Fig. 3. Switching elements
102
C. Sch¨ ack, W. Heenes, and R. Hoffmann
Fig. 4. Control signals of a switching element
Fig. 5. Multiprocessor architecture with eight processing units
– – – – – – – –
LOCAL GET P: internal read of the pointer p (Fig. 1) from internal memory LOCAL GET D: internal read of the data d from internal memory SET PS: write pointer p to the internal memory and shift right by one SET P: write pointer p to the internal memory GET P: (synchronized) read pointer p by using the network SET D: write data d to the internal memory GET D: (synchronized) read data d by using the network NEXTGEN: processor synchronization command (see Section 2.5)
The usage of custom instructions will be discussed in section 4. 2.5
Blocking and Synchronization
The omega network is a blocking network. Using eight processors (Fig. 5) it is possible that up to four of them will be blocked during a concurrent read access to eight different memories. This leads to unsynchronized processor execution of the cellular field. As a consequence of the partially sequential calculation two memories (old/new generation) are needed therefore data consistency has to be
A Multiprocessor Architecture for the GCA Model
103
considered. Due to the blocking read access the processors might calculate different generations at a time. To avoid such a behavior the processors need to be synchronized at generation transition which is done by using a special synchronization command at this state. A barrier-synchronization technique [14, p. 165] is implemented by AND gating. The implementation of the omega network also requires a synchronization for each read access over the network.
3 3.1
FPGA Prototype Implementation FPGA Implementation - Synthesis Results
The prototyping platform is a Cyclone FPGA with the Quartus II 8.1 synthesis software from Altera. The Cyclone II FPGA contains 68,416 logic elements (LE) and 1,152,000 RAM bits [15]. The implementation language is Verilog HDL. The NIOS II processor is built with the SOPC-Builder. We used the Nios II/f processor. This core has the highest execution performance and uses a 6 stage pipeline. The DMIPS rate is 218 at the maximum clock frequency of 185 MHz. The synthesis settings are optimized for speed. Table 1 shows the numbers of logic elements (LE), logic elements for network, the memory usage, the maximum clock frequency and the estimated max. DMIPS [16] rate for all processors [13, Chapter 5]. With the DMIPS rate the speed-up can be calculated. Table 1. Resources and clock frequency for the Cyclone II FPGA processing units 1 4 8 16
total LEs for LEs network 2,824 0 10,165 154 19,985 866 40,878 2,145
memory bits 195,296 289,568 415,264 666,656
register max. max. DMIPS bits clock (MHz) DMIPS speed up 1,585 145 168 1.0 4,910 105 487 2.8 9,351 75 696 4.1 18,244 55 1020 6.0
The amount of logic elements scales pretty well with the amount of processing units. The cost of the network in terms of logic elements was not of significant relevance for p = 16 processors. Clock frequency drops significantly due to the omega network. Only half of the clock frequency can be used for 16 processors compared to 4 (table 1). Although the clock frequency decreases the DMIPS rate increases. Comparing the one processor against the 16 processor implementation shows that a DMIPS speed-up of 6 was achieved due to the decreasing clock frequency. 32 processor synthesis is not possible on the Cyclone II due to limited resources. The second prototyping platform was a Stratix II FPGA with 143,520 ALUTs, 71,760 ALMs and 9,383,040 RAM bits [17]. Table 2 shows the required resources. With 32 processors about 30 % of the ALUTs resources are used. Synthesis for 64 processors generally seems to be possible but could not be accomplished due to problems with the synthesis software.
104
C. Sch¨ ack, W. Heenes, and R. Hoffmann Table 2. Resources and clock frequency for the Stratix II FPGA
processing units 1 4 8 16 32
total ALMs 1,203 4,147 8,397 17,448 36,131
total ALUTs for ALUTs network 1,592 0 5,354 279 10,634 818 21,439 2,344 43,815 6,092
memory bits 195,296 289,568 415,264 666,656 1,169,440
register max. bits clock (MHz) 1,520 200 4,813 133 9,200 100 17,994 75 35,591 55
max. DMIPS DMIPS speed up 232 1.0 617 2.6 928 4.0 1392 6.0 2041 8.7
Comparing one processor against the 32 processor implementation shows that a speed-up of 8.7 was achieved.
4
An Application: Merging Bitonic Sequences
The principle of operation of the multiprocessor system will be demonstrated by parallel merging. In the GCA model each cell holds its own value and compares it with a neighbor’s cell which is in changing distance. The distance is a power of two. If the own value has not the desired relation (e. g. ascending order) it will change its own value to the value of the neighbor. If a pointer points to a higher indexed cell the minimum is computed, otherwise the maximum. The number of generations is log2 (N ), where N is the number of cells. The listing 1 shows the C implementation of one processing unit. At the beginning all processors are synchronized (line: 22-23). Synchronization is done by the call of the custom instruction. This function consists of three parameters. The first parameter defines the action (NEXTGEN) that should be processed, the second parameter defines the global address (global address = processor ID, memory address) if applicable and the third parameter defines the data to be written, valid only for the write action. The C implementation consists of one part which handles the external operations (reading external memory - line 38) and one part which handles the internal operations (reading (line 36), writing (line 42-47) internal memory). The usage of barrier-synchronization (see section 2.5) command is shown in line 51. If the GCA model is sequentially executed on one processor, n steps are necessary in each generation leading to a total number of n · log2 (N ) steps. The execution time is the number of clock cycles over the clock frequency. Table 3 shows a linear speed-up for the clock cycles needed to execute the program for N = 2048. There are no blocking accesses to the network caused by the bitonic merge algorithm and therefore the used cycles show a linear speed-up. The system was implemented for p = 1, 4, 8, 16, 32 processors and N = 2048 cells. Each processor holds n = N/p cells. In the first processing part only external cells (stored in external memories) are compared (Fig. 6). We call these generations external generations. In the second processing part only internal cells are compared. We call these generations internal generations.
A Multiprocessor Architecture for the GCA Model
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
105
#i n c l u d e <s t d i o . h> #i n c l u d e ” s y s t e m . h” #d e f i n e #d e f i n e #d e f i n e #d e f i n e #d e f i n e #d e f i n e #d e f i n e #d e f i n e #d e f i n e
LOCAL GET P 8 LOCAL GET D 7 SET PS 6 GET P 5 SET P 4 TIME 3 GET D 2 NEXTGEN 1 SET D 0
#d e f i n e #d e f i n e #d e f i n e #d e f i n e
SC 128∗ 1 DAT BLKS 2 0 4 8 LOCL BLKS 128 GENS 11
i n t main ( ) { int i , j ,
neighbor ,
/ / SC = // amount
i i , N,
I ,
s t a r t c e l l . D e f i n e s memory a d d r e s s o f f s e t // t o t a l amount o f d a t a b l o c k s of data b l o c k s p ro ce s s ed by t h i s p r o ce s s o r / / a m o u n t o f g e n e r a t i o n s . 2 ˆ GENS = DAT BLKS
k; // S y n c h r o n i z a t i o n
ALT CI CUSTOM NETWORK EXT INST (NEXTGEN, 0 , 0 ) ; ALT CI CUSTOM NETWORK EXT INST (NEXTGEN, 0 , 0 ) ; f o r ( j = 0; j