PREFACE This book aims at giving an impression of the way current research in algorithms, architectures and compilation for parallel systems is evolving. It is focused especially on domains where embedded systems are required, either oriented to applicationspecific or programmable realisations. These are crucial in domains such as audio, telecom, instrumentation, speech, robotics, medical and automotive processing, image and video processing, TV, multimedia, radar and sonar. Also the domain of scientific, numerical computing is covered. The material in the book is based on the author contributions presented at the 3rd International Workshop on Algorithms and Parallel VLSI Architectures, held in Leuven, August 2931, 1994. This workshop was partly sponsored by EURASIP and the Belgian NFWO (National Fund for Scientific Research), and organized in cooperation with the IEEE Benelux Signal Processing Chapter, the IEEE Benelux Circuits and Systems Chapter, and INRIA, France. It was a continuation of two previous workshops of the same name which were held in Pont&Mousson, France, June 1990 [1], and Bonas, France, June 1991 [2]. All of these workshops have been organized in the frame of the EC Basic Research Actions NANA and NANA2, Novel parallel Algorithms for New real.time Architectures, sponsored by the E S P R I T program of Directorate XIII of the European Commission. The NANA Contractors are IMEC, Leuven, Belgium (F. Catthoor), K.U. Leuven, Leuven, Belgium (J. Vandewalle), ENSL, Lyon, France (Y. Robert), TU Delft, Delft, The Netherlands (P. Dewilde and E. Deprettere), IRISA, Rennes, France (P. Quinton). The goal within these projects has been to contribute algorithms suited for parallel architecture realisation on the one hand, and on the other hand design methodologies and synthesis techniques which address the design trajectory from real behaviour down to the parallel architecture realisatlon of the system. As such, this is clearly overlapping with the scope of the workshop and the book. An overview of the main results presented in the different chapters combined with an attempt to structure all this information is available in the introductory chapter. We expect this book to be of interest in academia, both for detailed descriptions of research results as well as for the overview of the field given here, with many important but less widely known issues which must be addressed to arrive at practically relevant results. In addition, many authors have considered applications and the book is intended to reflect this fact. The reallife applications that have driven the research are described in several
vi
Preface
contributions, and the impact of their characteristics on the methodologies is assessed. We therefore believe that the book will be of interest also to senior design engineers and CAD managers in industry, who wish either to anticipate the evolution of commercially available design tools over the next few years, or to make use of the concepts in their own research and development. It has been a pleasure for us to organize the workshop and to work together with the authors to assemble this book. We feel amply rewarded with the result of this cooperation, and we want to thank all the authors here for their effort. We have spent significant effort in trying to deliver as much as possible consistent material, by careful editing. The international aspect has allowed us to group the results of many research groups with a different background and "research culture," which is felt to be particularly enriching. We would be remiss not to thank Prof. L. Thiele of Universit~t des Saarlandes, Saarbriicken, Germany, who was an additional member to the workshops organizing committee, and F. Vanpoucke who was a perfect workshop managing director and also did a great job in collecting and processing the contributions to this book. We hope that the reader will find the book useful and enjoyable, and that the results presented will contribute to the continued progress of the field of parallel algorithms, architectures and compilation.
Leuven, October 199~t, the editors
References [1] E.Deprettere, A.Van der Veen (eds.), "Algorithms and Parallel VLSI Architectures", Elsevier, Amsterdam, 1991. [2] P.Quinton and Y.Robert (eds.), "Algorithms and Parallel VLSI Architectures II", Elsevier, Amsterdam, 1992.
Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.
ALGORITHMS
AND PARALLEL VLSI ARCHITECTURES
F. CATTHOOtt IMEC Kapeldreef 75 800I Leuven, Belgium
[email protected] M. MOONEN ESAT Katholieke Universiteit Leuven 800I Leuven, Belgium Marc. Moonen @esat.kuleu ven. ac. be
ABSTRACT. In this introductory chapter, we will summarize the main contributions of the chapters collected in this book. Moreover, the topics addressed in these chapters will be linked to the major research trends in the domain of parallel algorithms, architectures and compilation.
1
STRUCTURE
OF BOOK
The contributions to the workshop and the book can be classified in three categories: 1. Parallel Algorithms: The emphasis lies on the search for more efficient and inherently parallelisable algorithms for particular computational kernels, mainly from linear algebra. The demand for fast matrix computations has arisen in a variety of fields, such as speech and image processing, telecommunication, radar and sonar, biomedical signal processing, and so on. The work is motivated by the belief that preliminary algorithmic manipulations largely determine the success of, e.g, a dedicated hardware design, because radical algorithmic manipulations and engineering techniques are not easily captured, e.g, in automatic synthesis tools. Most of the contributions here deal with realtime signal processing applications, and in many cases, the research on these algorithms is already tightly linked to the potential parallel realisation options to be exploited in the architecture phase.
F. Catthoor and M. Moonen
2. Parallel Architectures: Starting from an already paraUelized algorithm or a group of algorithms (a target application domain), the key issue here is to derive a particular architecture which efficiently realizes the intended behaviour for a specific technology. In this book, the target technology will be CMOS electronic circuitry. In order to achieve this architecture realisation, the detailed implementation characteristics of the building blocks  like registers/memories, arithmetic components, logic gates and connection networks have to be incorporated. The end result is an optimized netlist/layout of either primitive custom components or of programmable building blocks. The trend of the last years is to mix both styles. So more custom features are embedded in the massively parallel programmable machines, especially in the storage hierarchy and the network topologies. In addition, (much) more flexibility is built into the custom architectures, sometimes leading to highlyflexible weaklyparallel processors. The path followed to arrive at such architectures is the starting point for the formalisation into reusable compilation methodologies. 3. Parallel Compilation: Most designs in industry suffer from increasing time pressure. As a result, the methods to derive efficient architectures and implementations have to become more efficient and less errorprone. For this purpose, an increasing amount of research is spent on formalized methodologies to map specific classes of algorithms (application domain) to selected architectural templates (target style). In addition, some steps in these methodologies are becoming supported by interactive or automated design techniques (architectural synthesis or compilation). In this book, the emphasis will be on modular algorithms with much inherent parallelism to be mapped on (regular) parallel array styles. Both custom (applicationspecific) and programmable (generalpurpose) target styles wiU be considered.
These categories correspond to the different parts of the book. An outline of the main contributions in each part is given next, along with an attempt to capture the keyfeatures of the presented research. 2
PARALLEL ALGORITHMS
In recent years, it has become clear that for many advanced realtime signal processing and adaptive systems and control applications the required level of computing power is well beyond that available on presentday programmable signal processors. Linear algebra and matrix computations play an increasingly prominent role here, and the demand for fast matrix computations has arisen in a variety of fields, such as speech and image processing, telecommunication, radar and sonar, biomedical signal processing, and so on. Dedicated architectures then provide a means of achieving orders of magnitude improvement in performance, consistent with the requirements. However, past experience has shown that preliminary algorithmic manipulations largely determine the success of such a design. This has led to a new research activity, aimed at tailoring algorithmic design to architectural design and vice versa, or in other words deriving numerically stable algorithms which are suitable for parallel computation. At this stage, there is also interaction already with the parallel architecture designers who
Algorithms and Parallel VLSI Architectures have to evaluate the mapping possibilities onto parallel processing architectures, capable of performing the computation efficiently at the required throughput rate. In the first keynote contribution, CHAPTEK 1 (~egalia), a tutorial overview is given of socalled subspace methods, which have received increasing attention in signal processing and control in recent years. Common features are extracted for two particular applications, namely multivariable system identification and source localization. Although these application areas have different physical origins, the mathematical structure of the problems they aim to solve are laced with parallels, so that, e.g. parallel and adaptive algorithms in one area find an immediate range of applications in neighbouring areas. In particular, both problems are based on finding spanning vectors for the null space of a spectral density matrix characterizing the available data, which is usually expressed numerically in terms of extremal singular vectors of a data matrix. Algorithmic aspects of such computations are treated in subsequent chapters, namely CHAPTHR 6 (G&tze et al.] and CHAPTER 7 (~axena et al.), see below. Linear least squares minimisation is no doubt one of the most widely used techniques in digital signal processing. It finds applications in channel equalisation as well as system identification and adaptive antenna array beamforming. At the same time, it is one of the most intensively studied linear algebra techniques when it comes to parallel implementation. CHAPTHRS 2 through 5 all deal with various aspects of this. Of the many alternative algorithms that have been proposed over the years, one of the most attractive is the algorithm based on QR decomposition. To circumvent pipelining problems with this algorithm, several alternative algorithms have been developed, of which the covariancetype algorithm with inverse updating is receiving a lot of attention now. In CHAPTER 2 (Mc Whirter et al.), a formal derivation is given of two earlierly developed systolized versions of this algorithm. The derivation of these arrays is highly nontrivial due to the presence of data contraflow in the underlying signal flow graph, which would normally prohibit pipelined processing. Algorithmic engineering techniques are applied to overcome these problems. Similar algorithmic techniques are used in CHAPTHR 3 (Brown et al.), which is focused on covariancetype algorithms for the more general Kalman Filtering problem. Here also, algorithmic engineering techniques are used to generate two systolic architectures, put forward in earlier publications, from an initial threedimensional hierarchical signal flow graph (or dependence graph). In CHAPTEK 4 (Schier), it is shown how the inverse updates algorithm and systolic array treated in CHAPTER 2 may be equipped with a blockregularized exponential forgetting scheme. This allows to overcome numerical problems if the input data is not sufficiently informative. Finally, in CHAPTER 5 (Kadlec) the informationtype RLS algorithm based on QR decomposition is reconsidered. A normalized version of this algorithm is presented which has potential for efficient fixed point implementation. The main contribution here is a global probability analysis which gives an understanding of the algorithms numerical properties and allows to formulate probability statements about the number of bits actually used in the fixed point representation. A second popular linear algebra tool is the singular value decomposition (and the related symmetric eigenvalue decomposition), which, e.g., finds applications in subspace techniques as outlined in CHAPTER 1. The next two chapters deal with the parallel implementation of
F. Catthoor and M. Moonen such orthogonal decompositions. In CHAPTER 6 (G6tze et al.), it is explained how Jacobitype methods may be speeded up through the use of socalled orthonormal protations. Such CORDIClike rotations require a minimal number of shiftadd operations, and can be executed on a floatingpoint CORDIC architecture. Various methods for the construction of such orthonormal protations of increasing complexity are presented and analysed. An alternative approach to developing parallel algorithms for the computation of eigenvalues and eigenvectors is presented in CHAPTER 7 (Sazena et al.). It is based on isospectral flows, that is matrix flows in which the eigenvalues of the matrix are preserved. Very few researchers in the past have used the isospectral flow approach to implement the eigenvalue problem in VLSI, even though, as explained in this chapter, it has several advantages from the VLSI point of view, such as simplicity and scalability. CHAPTER 8 (Arioli et al.) deals with block iterative methods for solving linear systems of equations in heterogeneous computingenvironments. Three different strategies are proposed for parallel distributed implementation of the Block Conjugate Gradient method, differing in the amount of computation performed in parallel, the communication scheme, and the distribution of tasks among processors. The best performing scheme is then used to accelerate the convergence of the Block Cimmino method. Finally, CHAPTER 9 (Cardarilli et al.) deals with RNStobinary conversion. RNS (Residue Number System) arithmetic is based on the decomposition of a n u m b e r  represented by a large number of bits  into reduced wordlength residual numbers. It is a very useful technique to reduce carry propagation delays and hence speed up signal processing implementations. Here, a conversion method is presented which is based on a novel class of coprime moduli and which is easily extended to a large number of moduli. In this way the proposed method allows the implementation of very fast and low complexity architectures. This paper, already bridges the gap with the detailed architecture realisation, treated in the second category of contributions.
3
PARALLEL ARCHITECTURES FOR HIGHSPEED NUMERICAL AND SIGNAL PROCESSING
Within this research topic, we have contributions on both customized and programmable architectures. For the applicationspecific array architectures, the main trend is towards more flexibility. This is visible for instance in the high degree of scalability and the different modes/options offered by the different architectures. We can make a further subdivision between the more "conventional" regular arrays with only local communication and the arrays which are combined with other communication support like tree networks to increase the speed of nonlocal dependencies. In the first class, two representative designs are reported in this book. In CHAPTER 10 (Riern et al.), a custom array architecture for long integer arithmetic computations is presented. It makes use of redundant arithmetic for highspeed and is very scalable for wordlength. Moreover, several modes are available to perform various types of multiplication and division. The emphasis in this paper lies on the interaction with the algorithmic transformations which are needed to derive an optimized architecture and also on the methodology which is used
Algorithms and Parallel VLSI Architectures throughout the design trajectory. Similarly, in CHAPTER 11 (Rosseel et al.), a regular array architecture for an image diffusion algorithm is derived. The resulting design is easily cascadable and scalable and the datapath supports many different interpolation functions. The extended formal methodology used to arrive at the end result  oriented to fixed throughput applications  forms a red thread throughout the paper. Within the class of arrays extended with nonlocal communication also two representative designs are reported, again including a high degree of scalability. The topic of CHAPTER 12 (Duboux et al.) is a parallel array augmented with a tree network for fast and efficient dictionary manipulations. The memory and network organisation for handling the keyrecord data are heavily tuned to obtain the final efficiency. Also in CHAPTER 13 (Archambaud et al.), a basic systolic array is extended with an arbitration tree to speed up the realisation of the application. In this case, it is oriented to genetic sequence comparison including the presence of "holes". In order to achieve even higher speed, a setassociative memory is included too. For the class of programmable architectures, both massively and weakly parallel machines are available. Apparently, their use depends on the application domain which is targeted. For highthroughput realtime signal processing, in e.g. image and video processing, the main trend nowadays is towards lower degrees of parallelism (4 to 16 processor elements) and more customisation to support particular, frequently occurring operations and constructs. The latter is especially apparent in the storage and communication organisation. The reduced parallelism is motivated because the amount of available algorithmic parallelism is not necessarily that big and because the speed of the basic processors has become high enough to reduce the required parallelisation factor for the throughput to be obtained. Within the programmable class, the main emphasis in the book lies on the evolution of these novel, weakly parallel processor architectures for video and image processing type applications. Ill CHAPTER 14 (Vissers et al.), an overview is provided of the VSP2 architecture which is mainly intended for video processing as in HDTV, video compression and the like. It Supports a highly flexible connection network (crossbar) and a very distributed memory organisation with dedicated registerbanks and FIFO's. In CHAPTER 15 (Roenner et al.), the emphasis lies on a programmable processor mainly targeted to image processing algorithms. Here, the communication network is more restricted but the storage organisation is more diversified, efficiently supporting in hardware both regular and datadependent, and both local and neighbourhood operations. The two processor architectures are however also partly overlapping in target domain and the future has to be show which of the options is best suited for a particular application. Using such video or image signal processors, it is possible to construct flexible higherlevel templates which are tuned to a particular class of applications. This has for instance been achieved in CrIAPTER 16 (De Greef et al.)where motionestimation like algorithms a r e considered. A highly efficient communication and storage organisation is proposed which allows to reduce these overheads considerably for the targeted applications. Realtime
F. Catthoor and M. Moonen execution with limited boardspace is obtained in this way for emulation and prototyping purposes. In addition, higher efficiency in the parallel execution within the datapath can potentiaUy be obtained by givingup the fully synchronous operation. This is demonstrated in CHAPTER 17 (Arvind et al.), where the interesting option of asynchronously communicating microagents is explored. It is shown that several alternative mechanisms to handle dependencies and to distribute the control of the instruction ordering are feasible. Some of these lead to a significant speedup. FinaUy, there is also a trend to simplify the processor datapath and to keep the instruction set as small as possible (RISC processor style). Within the class of weakly parallel processors for image and video processing, this was already reflected in the previously mentioned architectures. In CHAPTER, 18 (Hall et al.) however, this is put even more to the extreme by considering bitserial processing elements which are communicating in an SIMD array. The use of special instructions and a custom memory organisation make global datadependent operations possible though. This parallel programmable image processor is mainly oriented to wood inspection applications. Within the class of massively parallel machines, the main evolution is also towards more customisation. The majority of the applications targeted to such machines appears to come mainly from the scientific and numerical computing fields. In CHAPTER 19 (Vankats), a new shared memory multiprocessor based on hypercube connections is proposed. The dedicated memory organisation with a directory based cache coherence scheme is the key for improved speed. An application of a fast DCT scheme mapped to such parallel machines is studied in CHAPTER 20 (Christopottlo8 et al.). Here, the emphasis lies on the influence of the algorithmic parameters and the load balancing on the efficiency of the parallel mapping. Efficient massive parallelism is only achievable for large system parameters. The power of a "generalpurpose" array of processors realized on customizable fieldprogrammable gate arrays (FPGAs) is demonstrated in CHAPTER, 21 (Champeau et al.). This combination allows to extend the customisation further without overly limiting the functionality. An efficient realisation of parallel text matching is used as a testcase to show the advantages of the approach. Compiler support is a key issue for all of these parallel programmable machines so all the novel architectures have been developed with this in mind. Hence, each of the contributions in CHAPTER 14 (Vissers et al.), CHAPTER 15 (Roenner et al.), CHAPTER 18 (Hall et al.), CHAPTZ~t 21 (Champean et al.)and CHAPTZR 19 (Vankats)devotes a section to the compilation issues. Most of these compilers can however benefit from the novel insights and techniques which are emerging in the compilation field, as addressed in section 4.
4
PARALLEL COMPILATION FOR APPLICATIONSPECIFIC GENERALPURPOSE ARCHITECTURES
AND
As already mentioned, the key drive for more automated and more effective methodologies
Algorithms and Parallel VLSI Architectures comes from the reduced design time available to system designers. In order to obtain these characteristics, methodologies have to be generally targeted towards application domains and target architecture styles. This is also true for the domain of parallel architectures. Still, a number of basic steps do reoccur in the methodologies and an overview of the major compilation steps in such a targeted methodology is provided in CHAPTER 22 (Featrier). In that contribution, the emphasis lies on array dataflow analysis, scheduling of the parallel operations on the time axis, allocation to processors and processor code generation including communication synthesis. Even though this survey is mainly oriented to the compilation on programmable machines, most of the concepts recur for the field of custom array synthesis (see also CHAPTER 10 (Riera et al.) and CHAPTER 11 (Rosseel et al.)). Still, the detailed realisation of the algorithmic techniques used for the design automation typically differs depending on the specific characteristics of the domain (see also below). The other papers in the compilation category are addressing specific tasks in the global methodology. Representative work in each of the different stages is collected in this book. The order in which these tasks will be addressed here is not fully fixed, but still most researchers converge on a methodology which is close to what is presented here. The first step is of course the representation of the algorithm to be mapped in a formal model, suitable for manipulation by the design automation techniques. The limitations of this model to affine, manifest index functions have been partly removed in the past few years. Important in this process is that the resulting models should still be amenable to the vast amount of compilation/synthesis techniques which are operating on the afnne model. This also means that array dataflow analysis should remain feasible. Interesting extensions to this "conventional" model which meet these requirements, are proposed in CHAPTER 23 (Held et al.) and CHAPTER 24 (Rapanotti et al.). The restriction to linear or affine index functions can be extended to piecewise regular affine cases by a normal form decomposition process. This allows to convert integer division, modulo, ceiling and floor functions to the existing models, as illustrated in CHAPTER 23 (Held et al.). Moreover, also socalled linearly bounded lattices can then be handled. The restrictions can be even further removed by considering the class of "integral" index functions, as studied in CHAPTER 24 (Rapanotti et al.). This allows to handle also more complicated cases as occurring e.g. in the knapsack algorithm. By especially extending the socalled uniformisation step in the design trajectory, it is still possible to arrive at synthesizable descriptions. There is also hope to deal with part of the datadependent cases in this way. Finally, it is also possible to consider the problem of modelling from another point of view, namely as a matching between primitive operations for which efficient parallel implementations are known, and the algorithm to be mapped. This approach is taken in CHAPTER 25 (Rangaswarni), where a functional programming style with recursion is advocated. By providing a library of mappable functions, it is then possible to derive different options for compiling higherlevel functions and to characterize each of the alternatives in terms of cost. Once the initial algorithm has been brought in this manipulatable form, it is usually nee
F. Catthoor and M. Moonen essary to apply a number of highlevel algorithmic transformations to improve the efficiency of the eventual architecture realisations (see also CHAPTER 10 (Riem et al.) and CHAPTER 11 (Rossee! et al.)). Support for these is considered in CHAPTER 26 (Durrieu et al.), where provably correct small transformations allow the designer to interactively modify the original algorithm into the desired form. Also the uniformisation transformation addressed in CHAPTER 24 (Rapanotti et al.)falls in principle under this stage, but for that purpose also more automated techniques have become available lately. Now that the algorithm has a suitable form for the final mapping stages, it is usually assumed that all index functions are uniform and manifest, and that the algorithm has been broken up into several pure loop nests. For each of these, the scheduling, allocation and code generation/communication synthesis steps then have to be performed. Within the target domain of massively parallel machines (either custom or programmable), the notion of affine mapping functions has been heavily exploited up to now (see also CHAPTEIt 22 For instance, the work in CHAPTER 27 (Bouchittg et al.) considers the mapping of evaluation trees onto a parallel machine where communication and computation can coincide. This assumption complicates the process a lot and heuristics are needed and proposed to handle several practical cases within fine and coarsegrain architectures. It is however clear from several practical designs that purely affine mapping are not always leading to optimal designs. This is clearly illustrated in C~APTER 28 (Werth et al.) for both scheduling and communication synthesis, and this for the testcase of the socalled Lamport loop. Therefore, several researchers have started looking at extensions to the conventional methods. A nonunimodular mapping technique including extended scheduling/allocation and especially communication synthesis is proposed in CI~APTER 29 (Reffay et al.). For the Cholesky factorisation kernel, it is shown that significantly increased efficiency can be obtained, while still providing automatable methods. Up till now, we have however still restricted ourselves to mapping onto homogeneous, locally connected parallel machines. As already demonstrated in section 3, the use of weakly parallel and not necessarily homogeneous architectures is finding a large market in highthroughput signal processing, as in video and image applications. As a result, much research has been spent lately on improved compilation techniques for these architectures too. Most of this work is originating from the vast amount of knowhow which has been collected in the highlevel synthesis community on mapping irregular algorithms onto heterogeneous single processor architectures. Several representative contributions in this area are taken up in this book. In CHAPTER 30 (Schwiegershausen et al.), the scheduling problem of coarse grain tasks onto a heterogeneous multiprocessor is considered. The assumption is that several processor styles are available and that the mapping of the tasks on these styles has been characterized already. Given that information, it is possible to formulate an integer programming problem which allows to solve several practical applications in the image and video processing domain.
Algorithms and Parallel VLSI Architectures When the granularity of the tasks is reduced, an ILP approach is not feasible any longer, and then other scheduling/allocation techniques have to be considered. This is the case for instance in ('~,HAPTI~,R.31 (reimS, wh~ra. ~. evc.Jn~tatle li~t ~rhpr11,1~n~ t~r1~n~n11~ ~ nr'.~'nf'r]
Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) © 1995 Elsevier Science B.V. All rights reserved.
S U B S P A C E M E T H O D S IN S Y S T E M I D E N T I F I C A T I O N LOCALIZATION
13
AND SOURCE
P.A. REGALIA D~partement Signal et Image Institut National des T~ldcommunications 9, rue Charles Fourier 91011 Evry cedez France
[email protected] ABSTRACT. Subspace methods have received increasing attention in signal processing and control in recent years, due to thelr successful application to the problems of multlvariable system identification and source localization. This paper gives a tutorial overview of these two applications, in order to draw out features common to both problems. In particular, both problems are based on finding spanning vectors for the null space of a spectral density matrix characterizing the available data. This is expressed numerically in various formulations, usually in terms of extremal singular vectors of a data matrix, or in terms of orthogonal filters which achieve decorrelatlon properties of filtered data sequences. In view of this algebraic similarity, algorithms designed for one problem may be adapted to the other. In both cases, though, successful application of subspace methods depends on some knowledge of the required filterorder of spanning vectors for the desired null space. Data encountered in real applications rarely give rise to finite order filtersif theoretically "exact" subspace fits are desired. Accordingly, some observations on the performance of subspace methods in "reduced order" cases are developed. KEY WORDS. callzation.
1
Subspace estimation, autonomous model, system identification, source lo
INTRODUCTION
Subspace methods have becomes an attractive numerical approach to practical problems of modern signal processing and control. The framework of subspace methods has evolved simultaneously in source localization [1], [2], [6], [8], and system identification [4], [10]. Although these application areas have different physical origins, the mathematical structure
14
P.A. Regalia
of the problems they aim to solve are laced with parallels. The intent of this paper is to provide a brief overview of the structural similarities between system identification and source localization. To the extent that a common objective may be established for seemingly different application areas, numerical algorithms in one area find an immediate range of applications in neighboring areas. Our presentation is not oriented at the numerical algorithm level, but rather abstracted one level to the common algebraic framework which unerlies subspace methods. Section 2 reviews the underlying signal structure suited for subspace methods, in terms of an autonomous model plus white noise. Section 3 interprets the underlying signal structure in the context of multivariable system identification. Section 4 then shows how this same signal structure intervenes in the broadband source localization problem, and stresses similarities in objectives with the system identification problem. Section 5 then examines the approximation obtained in a particular system identification problem when the order chosen for the identifier is too small, as generically occurs in practice where real data may not admit a finite dimensional model. We shall see that subspace methods decompose the available data into an autonomous part plus white noise, even though this may not be the "true" signal structure. 2
BACKGROUND
Most subspace methods are designed for observed vectorvalued signals (denoted by {y(.)}) consisting of a usable signal {m(.)} and an additive disturbance term {b(.)}, as in y(n) = s(n) + b(n) We assume that these (column) vectors consist of p dements each. In most subspace applications, one assumes that the disturbance term is statistically independent of the usable signal, and that it is white: ~ Ip, m = n; O, m~n. (Here and in what follows, the superscript * will denote (conjugate) transposition). E[b(n) b*(m)]
l
The usable signal is often assumed to satisfy an autonomous model of the form Bo s(n) + B1 , ( n  l ) + . . . + B M s ( n  M ) = O,
for all n,
(1)
for some integer M. Here the matrices B k axe "row matrices," i.e., a few row vectors stacked atop one another. Examples of this relation will be brought out in Sections 3 and 4. If we consider the covariance matrix of the usable signal, niz
E
s(n) s(n1) ." s(nM)
[']* =
Ro R~ .' R~
where Rk =
= R'_k,
R1 ... R M Ro ". ' "'. ". R1 ... R~ Ro
A = Tg.M,
Subspace Methods in System Identification
15
then the assumption that {s(.)} satisfiesan autonomous model implies that
RI IBm1iol
R~
Ro
".
'
B~
:
".
".
R~
'
=
o
(2)
o
~M This suggests that the matrix coemcients B k of the autonomous model could be found by t h . n111] ,'.n.'e, nf t11, m , , t r l v ~..
id".ntlf,,in,,,
16
P.A. Regalia
linear system concatenated into a single timeseries vector: s(n) [sl(n)] } , 
r inputs
s2(n) }r outputs Suppose the inputs and outputs are related by a linear system with unknown transfer function H(z):
s2(n) = HCz) sl(n).
(3)
(This notation means that s2(n) is the output sample at time n from a linear system with transfer matrix H(z) when driven by the sequence {sl(.)}, with sl(n) the most recent input sample). Suppose that H(z) is a rational function. This means that H(z) can be written in terms of a matrix fraction description [3] H(z) = [D(z)] 1 N(z)
(4)
for two (left coprime) matrix polynomials N(z) = N o + N l z  I + ' ' ' + N M z M D(z) = Do + D1 z 1 + " " + DM Z  M The relations (3) and (4) combine as
[rx(pr)] (r • r)
D(z) s2(n) = N(z) sl(n) which is to say Do s2(n) + D1 s2(n1) + . . . + DM s 2 ( n  M ) = No sl(n) + N1 s l ( n  1 ) + . . . + N M s l ( n  M ) This in turn may be rearranged as
[No 9 Do "
' 9.. " N . , !  D M ]
t3o
B,
BM
for all n.
, ( . .1 )
=0,
for alln,
s(nM)
which leads to a simple physical interpretation: The autonomous relation (1) holds if and only if the signal {s(.)) contains the inputs and outputs from a finitedimensional linear system. We also see that the coefficients of a matrix fraction description may be concatenated into null vectors of the matrix T~M. One subtle point does arise in this formulation: The dimension of the null space of ]~M may exceed r (the number of outputs) [2], in such a way that uniqueness of a determined matrix fraction description is not immediately clear. Some greater insight may be obtained by writing the subspace equations in the frequency domain. To this end, consider the power spectral density matrix OO
s.(ei ) =
e
(p • p)
which is nonnegative definite for all w. At the same time, let B(z) = Bo + BI z I + ' " + BM z M,
[(pr) • p]
Subspace Methods in System Identification
17
where the matrix coefficients Bk are associated with the autonomous signal model. By taking Fourier transforms of (2), one may verify B(e j~)
,Sa(ejw) = O,
for all w.
(5)
The row vectors of B(e j~) then span the null space of ,.qs(ejw) as a function of frequency. In case B(z) consists of a single row vector, it is straightforward to verify that the smallest order M for the vector polynomial B(z) which may span this nun space [as in (5)] is precisely the smallest integer M for which the block Toeplitz matrix T~M becomes singular [as in
(2)]. For the system identification context studied thus far, one may verify that the spectral density matrix ,Sa(ej~) may be decomposed as
3"(eJ~) : [H(eJ )] Sa~ (e j~) [Ipr H*(eJ~)], where OO
=
•
is the power spectral density matrix of the input sequence {sl(')}. This shows that the rank of S,(e jw) is generically equal to the number of free inputs to the system (= p  r ) , assuming further dependencies do not connect the components of {sl(')} (persistent excitation). As the outputs {s2(.)} are filtered versions of {Sl(.)}, their inclusion does not alter the rank of
Next, we can observe that [N(e jw) D(eJW)]
,S,(e jw) : 0,
for all w.
With a little further work, one may show that provided the r row vectors of the matrix [N(e jw) D(eJW)] are linearly independent for (almost) all w (which amounts to saying that the normal rank of [N(z) D(z)] is full), then the ratio [D(z)] 1N(z) must furnish the system H(z). Note that if N(z) and D(z) are both multiplied from the left by an invertible matrix (which may be a function of z), the ratio [D(z)] 1 N(z) is left unaltered. As a particular case, consider a Gramian matrix [N(e jw) D(eJW)l [ D*(eJ~)] N*(ejw) : F(e/~)
F*(eJ'),
with the r x r matrix F(z) minimum phase (i.e., causal and causally invertible). It is then easy to verify that the row vectors of the matrix [F(eJ~)] 1 [N(e j~) D(eJ~)]
18
P.A. Regalia
are orthonormal for all w, and thus yield orthonormal spanning vectors for the null space of 3a(eJW). The system identification problem is then algebraically equivalent to finding orthonormal spanning vectors for the null space of 8a(eJW).
4
BROADBAND SOURCE LOCALIZATION
Source localization algorithms aim to determine the direction of arrival of a set of waves impinging on a sensor array. We review the basic geometric structure of this problem, in order to obtain the same characterization exposed in the system identification context. The outputs of a psensor array are now modelled as
y(n) =
b(n)
where the elements of s(u) are mutually independent source signals, and where {b(.)} is an additive white noise vector. The columns of A(z) contain the successive transfer functions connecting a source at a given spatial location to the successive array outputs. Each column of ~4(z) is thus called a steering vector, which models spatial and frequential filtering effects proper to the transmission medium and array geometry. The problem is to deduce the spatial locations of the emitting sources, given the array snapshot sequence {y(n)}. The spectral density matrix from the sensor outputs now becomes oo
8~(e jw) = =
~
E[y(n) y*(nk)] e jk~ so( s
+
provided the noise term {b(.)} is indeed white. Here 8,(e jw) is thepower spectral density matrix of the emitting sources. Provided the number of sources is strictly less than the cc number of sensors, the first term on the righthand side (i.e., the signalinduced component) is rank deficient for all w. It turns out that its null space completely characterizes the solution of the problem [6], [8]. For if we find orthonormal spanning vectors for the null space of the signal induced term, then we will have constructed the orthogonal complement space to that spanned by the columns of .A(eJw). This, combined with knowhdge of the array response pattern versus emitter localition, is sufficient to recover the information contained in A(z), namely the spatial locations of the sources [8]. More detail on constructing orthonormal spanning vectors for this null space, in the context of adaptive filtering, is developed in [2], [6] and the references therein. We can observe some superficial similarities between the system identification problem and the source localization problem. In both cases, the usable signal component induces a rankdeficient power spectral density matrix, and in both cases, the information so sought (a linear system or spatial location parameters) is entirely characterized by the null space of the singular spectral density matrix in question. Accordingly, algorithms designed for subspace system identification can be used for subspace source localization, and viceversa. See, e.g., [5], [9].
Subspace Methods in System Identification 5
19
T H E U N D E R M O D E L L E D CASE
The development thus far has, for convenience, assumed that the order M of the autonomous signal model (1) was available. In practice, the required filter order M is highly signaldependent, posing the obvious dilemma of how to properly choose M is cases where a priori information on the data is inadequate.
20
P.A. Regalia
to show the result. Note also that, as expected, a vector in the null space of 7~M yields the coefficients of the ARMA model in question. Suppose that the actual sequence (sl(')} and (s2(')} are related as OO
=
(6)
k=O To avoid an argument that says we can increase the chosen order M until we hit the correct value, we assume that the transfer function OO
k=O is infinite dimensional (i.e., not rational). In this case, the covariance matrix T~M will have full rank irrespective of what value we choose for the integer M. Consider then trying to find an ARMA signal model which is "best compatible" with RM. To this end, let two sequences {~1(')} and {s2(')} be related as M M k=O k=O where the coefficients { a k ) and {bk) remain to be determined. We note here that, with fi(n) = [h(n)Jh(n)]and ...
~(n1)
7?.M  E
.
"1"
[
,
w we shall have bo a0 "
= o,
(7)
v so that T~Mis always singular. Set now 2(n)
where the disturbance terms {bl(')} and (b2(')} are chosen to render {~r(.)} compatible with the true data. In particular, the covariance matrix built from 3)(n), . . . , ~ ( n  M ) takes the form A
A
RM + Rb, where 7~b is the covariance matrix built from {bl(')} and {b2(')}. This becomes compatible
Subspace Methods in System Identification
21
with the true data provided we set A
~b  ~ M  ~M. A
Given only that ~ M is singular, though, a standard result in matrix approximation theory gives [l~b[[ = [ITEM ~M[[ >_. ~,~i,(7~M). As a particular case, consider the choice ~M = ~M
A~I.
This retains a block Toeplitz structure as required, but is now positive sen~definite. We
P.A. Regalia
22
we have similarly
1
Amdn
so that the matching of cross correlationterms from (8) may be expressed as
hk=
i
~,~'
~=0,1,...,M;
0 (= hk),
k = I,2,...,M.
This shows that the first few terms of the impulse response of H(z) agree to within a factor 1 / ( 1  Amd,~)with those produced by the true system H(z). Similarly, we can also observe that OO
Z[.~(~).~(~k)] = ~ ~ ~+k =A ~k, i=O if {81(')} is unitvariance white noise. This gives the kth term of the autocorrelation sequence associated to H(z). For the reconstructed model, we likewise have OO
z{~(~) ~(~k)l = ~ 1  Amin
~ h~ h~+k = (1  ~ ) ~ k . i=0
The matching properties (10) then show that
I r0 Amin ~k=
'iAmi~
'
k=O;
rk
'I A,,i~
'
k = 1,2,...,M;
which reveals how the correlationsequences compare. A slightlydifferentstrategy is investigated in [7],which builds the function H(z) from an extremal eigenvector of a Schur complement of ~M. This can improve the impulse and correlationmatching properties considerably [7]. 6
CONCLUDING REMARKS
We have shown how the system identification and source localization problems may be addressed in a common framework. In both cases, the desired information is characterized in terms of spanning vectors of the null space of a power spectral density matrix. Numerical methods for determining the null space have appeared in different state space formulations [2], [4], [6], [10], which are, for the most part, oriented around orthogonal transformations applied directly to the available data. We have also examined the influence of undermodeUing. Some recent work in this direction [7] shows that subspace methods correspond to total leastsquares equation error methods. This can yield weaker subspace fits in undermodelled cases compared to Hankel norm or 7/oo subspace fits. The reduced order system so constructed, however, is intimately connected to low rank matrix approximation, which in turn can be expressed in terms of
Subspace Methods in System Identification
23
interpolation properties relating the impulse and correlation sequences between the true system and its reduced order approximant. More detail on these interpolation properties is available in [7]. References
[1] J. A. Cadzow, "Multiple source locationThe signal subspace approach," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 38, pp. 11101125, July 1990. [2] I. Fijalkow, Estimation de Sous.Espaces Rationnels, doctoral thesis, Ecole Nationale Sup~rieure des Tdldcommunications, Paris, 1993. [3] T. Kailath, Linear Systems, PrenticeHall, Englewood Cliffs, NJ, 1980. [4] M. Moonen, B. DeMoor, L. Vandenberghe, and J. VandewaUe, "On and offline identification of linear statespace models," Int. J. Control, vol. 49, pp. 219232, 1989. [5] P. A. Regalia, "Adaptive IItt filtering using rational subspace methods," Proc. ICASSP, San Francisco, March 1992. [6] P. A. Regalia and Ph. Loubaton, "Rational subspace estimation using adaptive lossiess filters," IEEE Trans. Signal Processing, vol. 40, pp. 23922405, October 1992. [7] P. A. Regalia, "An unbiased equation error identifier and reduced order approximations," IEEE Trans. Signal Processing, vol. 42, pp. 13971412, June 1994. [8] G. Su and M. Morf, "The signal subspace approach for multiple wideband emitter location," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 31, pp. 15021522, December 1983. [9] F. Vanpoucke and M. Moonen, "A state space method for direction finding of wideband emitters," Proc. EUSIPCO94, Edinbourgh, Sept. 1994, pp. 780783.
[lo] M. Verhaegen and P. Dewilde, "Subspace model identification (Pts. 1 and 2)," Int. J. Control, vol. 56, pp. 11871241~ 1992.
Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 1995 Elsevier Science B.V.
25
P I P E L I N I N G T H E I N V E R S E U P D A T E S R L S A R R A Y BY A L G O R I T H M I C
26
J.G. McWhirter and I.K. Proudler
still uses orthogonal transformations but also produces the optimum coefficients every sample time. Verhaegen[13] has shown that, provided the input is persistently exciting (i.e. is sufficiently wideband), this algorithm has bounded errors and should therefore be numerically stable. It is worth noting that the two algorithms discussed above can be classified in terms of the nomenclature of Kalman filtering. It is well known that RLS optimisation is equivalent to a Kalman filter in which the state transition matrix is unit diagonal. In these terms, the original QRDbased algorithm[ 1][4] constitutes a squareroot information algorithm whereas the inverse updates method[9] constitutes a squareroot covariance algorithm. Viewed in this way the inverse updates algorithm is not new; indeed, Verhaegen's analysis of this algorithm[13] predates its publication in the signal processing literature. In this paper we address the problem of pipelining the inverse updates algorithm. This is highly nontrivial since the basic algorithm requires a matrixvector product to be completed before the same matrix can be updated. This limits the extent to which the algorithm can be pipelined and hence the effectiveness of any systolic implementation. In terms of the signal flow graph (SFG) representation used here, the algorithm exhibits a long feedback loop which defeats the usual methods for deriving a systolic array. We begin, in section 2, by reviewing the inverse updates method. In section 3, the basic algorithm is transformed into a form which has no long feedback loops using the emerging technique of algorithmic engineering (McWhirter[5], Proudler and McWhirter[ 10]). The derivation of a systolic array is then reduced to straightforward application of the cut theorem and retiming techniques (Megson[7]). Two alternative systolic arrays are derived; the first is identical to the one originally presented (without proof) by Moonen and McWhirter [8]; the other was first presented by McWhirter and Proudler in [6]. 2 INVERSE UPDATES METHOD Consider the least squares estimation of the scalar y(n) by a linear combination of the p components of the vector ~p(n). The (pdimensional) vector of optimum coefficients, at time n, O0p(n) is determined by Min
ly_(n)+ Xp(n)_~p(n)12
(1)
where ~(n) = [y(1), ..., y(n)] T
and
Xp(n) = [~p(1), ..., X_p(n)]T
(2)
The solution to this problem using QR decomposition is well known[2]. The optimum coefficient vector _~p(n) is given by ~_p(n) = Rp l(n)up(n)
(3)
where Rp(n) is a p • p upper triangular matrix and up(n) is a pdimensional vector. These quantities may be calculated recursively via the equation
Pipelining the Inverse Updates RL~ Array
[[3Rp(n 1) [3Up(n 1)1 [Rp(n)ttp(n)1 xT(R) v(n) =  0T a(n~l"
Op(n)/
27
(4)
28
J.G. McWhirter and L K. Proudler
related to the Kalman gain vector). Secondly, the orthogonal matrix 0y(n) can be generated from knowledge of the matrix RyT(n  1) and the new data vector. Specifically, t~y(n) is the orthogonal matrix given by: =
(8)
n where ~_y(n) = RyT(n1)F~p(n)l
(9)
b, (n)J This can easily be proved as follows. Let
Oy(n) L[~,(n), y (n)
= U
(10)
From equation (9) it follows that
[o]
(11)
and hence ~_ = 0. If 0(n) is constructed as a sequence of Givens rotations which preserves the structure of the upper triangular matrix in equation (10), it follows that U = Ry(n). Hence t~y(n) is equivalent to the orthogonal matrix defined in equation (5). The inverse updates algorithm can thus be summarised as follows: Given the new data X_p(n),y(n), calculate _~y(n) (equation (9)). Using e y(n), calculate 0y(n) (equation(8)). Using t)y(n), update RyT(n 1) (equation (7)). Extract the least squares coefficients from RyT(n) (equation (6)). We will now show how a systolic array to implement this algorithm may be designed fairly simply by means of algorithmic engineering.
Pipelining the Inverse Updates RLS Array
29
3 ALGORITHMIC TRANSFORMATIONS Algorithmic engineering[5][10] is an emerging technique for representing and manipulating algorithms based on the SFG representation. The power of this technique, for algorithm development, is twofold: firstly, the cells of the SFG are given precise meanings as mathematical operators thus endowing the SFG, and any SFG derived from it, with a rigorous mathematical interpretation; secondly, unnecessary complexity in the SFG can be removed by the formation of 'block' operators. This latter concept leads to what may be termed a hierarchical SFG (HSFG) and allows the SFG to be simplified so as to reveal any pertinent structure in the algorithm. Once a suitable SFG has been derived, creating a systolic array implementation is then straightforward by means of standard techniques such as the cut theorem[7]. '
x1 0
C
x2 0
.
.
.
.
.
n x3 0
y, 0 e
yolt . ,f(~.nl) 2 + (1~1~)2
Y;n~
c ffi Y~out
S ffi
p_~
B
. . . . . .
1 Yout
X, Kin 
40

eout ~. , S, c ~ ~ L ~
.~
ein ~ S, C tin
X, Kout, rou t 9
yyl
9
..
.
9
..
.
.
.
.
Xl' ICy, 1 t~ 1 x2' ICy,2t~ 2 ] x3' ~:y, 3t~ 3 I
.
.
.
Y, ~:y, 4 e~l
eou t = ein + rinX
D Figure 1. SFG for the inverse updates algorithm
L~~
.
tF0 ., l L qo J
A SFG for the inverse updates algorithm is shown in figure 1 for the case p=3. This SFG is obtained by combining SFGs for the three basic operations involved in the inverse updates algorithm: a matrixvector product operator[5]; a rotation operator to update Ry T [12]; and the operator for the rotation calculation defined in equation (8). The first two operators are triangular in shape and can be conformally overlaid, combining the original SFGs into one. The mathematical definitions of the elementary operators (or cells) shown in figure 1 assume that the matrix (~y(n) is to be constructed using Givens rotations. Note that the matrix Ry T is stored in the cells of the triangular block. The elements of this matrix, which has the decomposition shown in equation (6), are explic
J.G. McWhirter and I.K. Proudler
30
itly shown and for notational convenience are denoted by rij. Furthermore, we define the energy normalised weight vector ~p by ~p(n) = epl(n)mp(n)
(12)
The sequence of events depicted in figure 1 is as follows: at time n, the new data [ xT(n), y(n)] is input at the top of the triangular part of the SFG. It flows through the array, interacting with the stored matrix to form the vector ~_y(n). This vector is accumulated from right to left and emerges at the left hand side of the triangular army. Here, the rotation matrix Qy(n) is calculated and fed back into the triangular array where it serves to update the stored matrix RyT(n  1). It is clear that the SFG can be pipelined in the vertical direction by making horizontal cuts (e.g. cut AB in figure (1)). However, itcan not be pipelined in the horizontal direction due to the contraflowing data paths. Any vertical cut (e.g. cut CD) through the SFG will cut these lines in the opposite sense and so a delay applied to one path would necessitate an unrealisable 'antidelay' on the other. The algorithm must be transformed so as to avoid this problem e.g. by creating a delay on one of the contraflowing lines that can be paired with the antidelay introduced by the action of cutting the SFG.
xx(n) 0 I
I
y(.)] 0
~~ l(n)
[~pT z(n), e;'(n)]
Figure 2. HSFG for the inverse updates algorithm The structure shown in the SFG of figure 1 is too detailed for our purposes. In what follows we will only need to consider the structure of the first column of the triangular array. Figure 2 constiT tutes a HSFG based on figure 1 and shows this first column explicitly (labelled Ry, 1 )" The left hand block represents the operator that calculates the rotation parameters, whilst the right hand block represents a p • p block of multiply/rotate cells which stores the matrix R T y, 2 where
31
Pipelining the Inverse Updates RLS Array
y, 2 triangular block consists of p rows of multiply/rotate cells whereas both the Note that the R T rotation calculator and the first column contain (p+ 1) cells. As such, each of the latter two operators has been split conformally into a column of dimension p and a single cell (which corresponds to the top row of the SFG in figure 1). Again for the sake of clarity, the only outputs shown in figure P  [t~p, 1' t.o~,2] and the normalisation factor (epl). This 2 are the normalised weight vector (0T "
HSFG, although visually different to the SFG in figure 1, does not represent a change in the algorithm and accordingly, the data contraflow is still evident. y, 2 triangular operator. From figure 1 it is easy to see that this Consider the function of the R T operator performs two tasks: 1. matrixvector product:
2. matrix update:
e.2(n) = RTT2(n 1) [~2(n)] 9 [y (n)J 0y, 2(n)I[51RyT2(n=T 0 1)1
(14)
EKe,yT2(n JR.2(n)j 1
(15)
where the subscript '2' signifies the quantity corresponds to the reduced order problem (i.e. without T
the first column). The problem with pipelining the algorithm is that the matrix Ry,2(n  1) cannot beupdated in time (equation (15)) until 0y, 2(n) is known but the latter matrix depends on the vector ~.2(n). In order to pipeline the algorithm this dependency can be broken as follows. Using equations (14) and (15) and defining 62(n) = RT,T2(n_ 2) r~2(n)1 [y (n).]
(16)
it can be shown that
= 0y,2(n1)I~lRyY2(n2)l[~T(n I T 0 Ly(n)J
Lr_~,2(n1
L y(n)j
Lrly, 2(n)/
(17)
32
J.G. McWhirter and I.K. Proudler
where the term fly, 2(n) is defined by this operation. Equation (17) indicates that it is possible to calculate the matrixvector product ~2(n) with an outofdate matrix and still obtain the correct product ~2(n) by means of an extra rotation step. Figure 3 shows the SFG for this rotation operator. xl(n) ;1' el
A
J,~in
0
~
[0
~KT(n)'y(n)]
#"4
....,
;2' C2
S , ~ ~out
~4 Ry~z(n 
;3' C3 Lni.J
~4
fop, l(n  1)
~T [__p, 2( n  1), ep 1(n  1)_']
;4' C4
Figure 3. SFG for rotation operator
Figure 4. HSFG after 1st algorithmic transformation. Rotation operator ~ is defined in figure 3.
The small circular symbols will be explained later and should be ignored for the moment. The utility of the above observation is that the outofdate matrix (RyT2(n  2)) does not require knowledge of r
2(n) in order to be updated; in fact it is thematrix (~y,2(n 1) that is required. However,
because e2 (n) can still be calculated using R T y, 2 (n  2) the HSFG of figure 2can be transformed into that shown in figure 4 which has a delay in the rotation parameter data line. As the right hand T block now stores Ry, 2 (n  2 ) , its output is ~p, 2 (n  1) .The output top, 1 (n) from the left hand column has been aligned in time by delaying it accordingly. In order to create a fully systolic implementation of the inverse updates algorithm is necessary to create a delay on both of the horizontal data paths. One approach to introducing the missing delay is to invoke the "kslowing lemma'[7] with k=2. This amounts to replacing each delay in figure 4 with two delays and reducing the input data rate by a factor of two (i.e. inputting zero data every second clock cycle). It is then possible to move one delay from the rotation parameter line to the other horizontal data path by applying the type of pipeline cut labelled AB in figure 4. The left hand T T column Ry, 1 and the triangular array Ry, 2 then constitute independent pipeline processing stages. If the algorithmic transformation is repeated to create a delay between every pair of adjacent columns in the original SFG (before 2slowing), a complete set of pipeline cuts (equivalent to the one labelled AB in figure 4) may be applied to produce the systolic array defined in figures 5 and 6. This
Pipellning the Inverse Updates RIs Array
0
0, X2,0
%, ~
0, x3, 0
4, %, ~
33
0, y, 0
4, %, 3
Figure 5. Systolic array for RLS by inverse updates I
y~nI, x, Kin
e ~ e + rinX
S) C r, n
"~?n1
C ==  Yoult
S ==
I
l]in, x, Kin
I~1~.
eo= S) C
~
"/;I t
" ein + rinX
ei,,
S, C rin
~d
Lni,,J
~'.1
L '~; J
r,lr+ ; v;~, ,,, ,,:~,, ,'~,
L":+d
L % J
',o,,,, x, ,,:~,, %
Figure 6. Definition of processing cells in figure 5 the one presented recently by Moonen et al.[8]. An alternative approach to creating the extra delay required in figure 4 is to apply the algorithmic transformation again thereby generating the HSFG in figure 7. The rotation parameters are now delayed twice between the first column operator and the R T y, 2 operator whilst the matrixvector product term is rotated twice in compensation. 'Ikvo delays are now required on the output from the left hand column R T y, 1 to align it in time with the output from the triangular array. The HSFG in figure 8 is obtained by applying the pipeline cut AB to the HSFG in figure 7 and then moving the delay on the rotation data path for the rotation operator ~ from input to output. This move is valid pro
34
J.G. McWhirter and L K. Proudler
x,(n)
o
[~T(n), Y(n)]
R y,~(n B
b ~,b . , = . . ~ . ~ .=b .sb ~ , ~ ~r
.,. ~ , . . ~
.~ ,..,,..,.

E~pT 2(n 2); epl(n 2)]
~p,l(n2)
Figure 7. HSFG after 2nd algorithmic transformation xl(n)
0
0
,
y(.)] 9
~p,l(n2)
~,
~,
1
E~pT2(n  2), epl(n  2)]
Figure 8. HSFG of figure 7 after pipeline cut and retiming vided that the rotation operator is modified to include extra delays in place o[ the small circular symbols in figure 3. The resulting "delayed" rotation operator is simply denoted by zlO. In order to derive a fully pipelined systolic array the entire procedure applied to transform the HSFG in figure 2 to that in figure 8 must be repeated to create delays between each pair of adjacent columns in the original SFG. The resulting systolic array, which is identical to the one proposed by McWhirter and Proudler[6], is defined in figures 9 and 10. The different processing cells on the diagonal boundary of the array have not been defined explicitly since they are just special cases of the internal cells and can easily be deduced from them.
Pipelintng the Inverse Updates Rts Array
35
4 CONCLUSIONS By means of algorithmic engineering, we have shown how to derive a fully pipelined systolic array for the inverse updates RLS algorithm. This proves to be a nontrivial task since the inverse updates algorithm involves a major computational feedback loop. Two distinct systolic array designs have been presented. The original one, defined in figures 5 and 6, was derived using a 2slowing procedure and can only process one new data vector every two clock cycles. The other one, defined in figures 9 and 10, can process a new data vector every clock cycle but requires an extra rotation to
36
J.G. McWhirter and L K. Proudler
be performed in every cell. It has been estimated that the cells in figure 9 would require ~ 60% more silicon area than their counterparts in figure 5. Since both arrays require the same number of cells, it would appear that the one in figure 9 is more efficient (~ twice the throughput for only 60% extra circuitry). However, since adjacent cells of the array in figure 5 are idle (i.e. processing zero data) on alternate clock cycles, it is possible in the normal way to combine them in pairs with little additional overhead and so reduce the hardware requirement by almost a factor of two. The array in figure 9 would then be less efficient requiring almost twice as many cells, each  60% bigger, in order to double the maximum throughput rate. Both arrays have been derived in this paper to illustrate how easily the different designs are obtained using the techniques of algorithmic engineering. References
[1] W. M. Gentleman and H. T. Kung, "Matrix Triangularisation by Systolic Arrays", Proc. SPIE Real Time Signal Processing IV, Vol 298, pp 1926, 1981. [2] G. H. Golub and C. F. Van Loan, "Matrix Computations", North Oxford Academic Publishing CO., Johns Hopkins Press, 1988. [3] S. Haykin, "Adaptive Filter Theory", 2nd Edition, PrenticeHall, Englewood Cliffs, NJ, USA, 1991. [4] J. G. McWhirter, "Recursive Least Squares Minimisation using a Systolic Array", Proc. SPIE Real Time Signal Processing IV, Vol 431, pp 105112, 1983. [5] J. G. McWhirter, "Algorithmic Engineering in Adaptive Signal Processing", IEE Proc., Pt F, Vol 139, pp 226232, 1992. [6] J. G. McWhirter and I. K. Proudler, "A Systolic Array for Recursive Least Squares Estimation by Inverse Updates", Proc. lEE Int. Conf. on Control, Warwick (Mar 1994) [7] G. M. Megson, "An Introduction to Systolic Algorithm Design", Oxford University Press, 1992. [8] M. Moonen and J. G. McWhirter, "Systolic Array for Recursive Least Squares by Inverse Updating", Electronics Letters, Vol 29, No 13, 1993. [9] CT Pan and R. J. Plemmons, "Least Squares Modifications with Inverse Factorisation: Parallel Implications", J. Comput. and Applied Maths., Vol 27, pp 109127. 1989. [10] I. K. Proudler and J. G. McWhirter, "Algorithmic Engineering in Adaptive Signal Processing: Worked Examples", lEE Proc., Vis. Image Signal Proc.., Vol 141, pp 1926, 1994 [11] R. Schreiber, "Implementation of Adaptive Array Algorithms", IEEE Trans. ASSP, Vol 34, pp 103845, 1986. [12] T. J. Shepherd, J. G. McWhirter and J. E. Hudson, "Parallel Weight Extraction from a Systolic Adaptive Beamformer" in "Mathematics in Signal Processinglr', J. G. McWhirter (Ed), Clarendon Press, Oxford, pp 775790, 1990. [13] M. H. Verhaegen, "Roundoff Error Propagation in Four Generally Applicable Recursive Least Squares Estimation Schemes", Automatica, Vol 25, pp 437444,1989. 9 British Crown Copyright 1994
Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 1995 Elsevier Science B.V.
HIERARCHICAL SIGNAL FLOW GRAPH REPRESENTATION OF THE SQUAREROOT COVARIANCE KALMAN FILTER
D.W. BROWN, F.M.F. GASTON
Control Engineering Research Centre Department of Electrical and Electronic Engineering The Queen's University of Belfast Ashby Buildino, StranmiUis Road, Belfast BT9 5AH
37
D.W. Brown and F.M.F Gaston
38
robust than non squareroot forms, as they are less susceptible to rounding errors and prevent the error covariance matrices from becoming negative definite. A number of different architectures have been proposed in the literature and most have been outlined in a survey paper by Gaston and Irwin, [5]. Algorithmic engineering has grown up out of paraUd processing techniques used in designing systolic arrays and sees the resulting diagrams as illustrations of the algorithm itself and not just a possible systolic architecture. It shows the data tiow and computational requirements of a particular algorithm. However, the parallel algorithms are not necessarily unique and therefore one should be able to transform one parallel form to another for the same algorithm using simple graphical techniques. McWhirter and Proudler have illustrated this in [7]. In this paper, we will demonstrate that all systolic squareroot covariance Kalman filter architectures can be obtained from the corresponding hierarchical signal flow graph. In particular, the architectures proposed by Gaston, Irwin and McWhirter [8] and by Brown and Gaston [6], will be verified, using algorithmic engineering methodology, from the overall hierarchical signal flow graph. In the next section, the notation and defining equations for the squareroot covariance Kalman filter are given, followed by a section illustrating hierarchical signal flow graphs. Section 4 develops the full hierarchical signal flow graph for the squareroot covariance Kalman filter. Sections 5 and 6 illustrate the systolic architectures, [6] and [8], formed by considering different projections of this hierarchical signal flow graph. 2
SQUAREROOT COVARIANCE KALMAN FILTERING
The general Kalman filtering algorithm can be numerically unstable in some applications and for this reason several squareroot algorithms have been proposed. The square root covariance algorithm is summarised as follows [8]'
Q(k)
P'/'(klk
0
=
WT/2(~) pre array
(1)
[ V~re/2(k) Ve1/'(k)C(k)PT(k]k1)A:r(k) PT/'(k "Fllk) 0
0
~.
post *array where
=
I)cT(k)+ V(k)
(2)
and
~(k + ilk) = A(k)~_(klk 1)+ (3) A(k)P(klk 1)CT(k)g,l(k) ~(k) C(k)fc_(klk 1)] where ~_(klk 1) is the (n x 1) predicted state estimate vector at time k given measurements
HSFG Representationof the SRCKF
39
up to time k  1, z_(b) is the ( m x 1) measurement vector, A(k) is the (n x n) state matrix, C(k) is the ( m x n) measurement matrix, WT/2(k) and VT/2(k) are the squareroots, or Cholesky factors, of the state and measurement noise covarlance matrices, and P(k[k  1) is the (n x n) predicted state error covariance matrix. The Cholesky factors are usually taken to be positive definite and can be either upper or lower triangular, i.e.
V(k) " V1/~(k)VT/2(k)
(4)
From now on, all timescripts have been removed, and for simplicity ~ ( k )  C ( k ) ~ ( k l k  1)] r may be referred to as z'.
3
H I E R A R C H I C A L S I G N A L F L O W G R A P H S (HSFGS)
For the purposes of algorithmic engineering a hierarchical signal flow graph (HSFG) may be regarded as an instantaneous inputoutput processor array. For example, consider the case of matrixmatrlx multiplication in equation 5. c =
(5)
The corresponding HSFG is shown in figure 1. The input matrices, A & B, flow in the i ~ j directions respectively, while the product, C, is propagated in the k direction. The value of considering the 3D HSFG is seen in figure 2. Figure 2 shows the projected HSFG obtained by projecting figure 1 along the kaxis, i.e. along the direction in which the C matrix is propagated. This results in this product matrix being stored in memory with matrices A & B being passed through the array. This is illustrated diagrammatically by shading the "stationary" data. Figure 2 is not an HSFG in the strictest sense of the meaning, due to the fact that the data has to be fed in sequentially thus not making it an "instantaneous inputoutput processor". Despite this fact, these type of projections are valuable in determining the architecture and cell descriptions of the resulting systolic array. These cell operations depend, not only on the actual function of the HSFG but also, on the chosen projection, which explains why different systolic architectures can be generated from the same HSFG to produce the same overall mode of operation.
4
HSFG FOR THE SQUAREROOT COVARIANCE KALMAN FILTER
A full HSFG for the squareroot covarlance Kalman filter can be built up by considering the following steps: (i) Formation of the prearray in equation 1. (il) Error Covarlance Update : PT/2(k). (iii) State Update : ~(k) .
40
D.W. Brown and F.M.F Gaston
4.1
Formation Of The P r e  A r r a y
The various product terms included in the prearray fall into two categories:
(i)
Postmultiplication by C r.
(ii) Postmultiplication by Ar. 1
The computation of PT/=CT and ~.  C ~ T can be described by using the HSFGs shown in figure 3. ~1.I.1
C T Products
In the lefthand HSFG, pT/2 and C T are passed in together with the null matrix from above and the product PT/2CT emerges from the bottom of the flow graph. Note that both pT/~ and C T pass through unchanged in their respective directions. In the righthand HSFG a similar calculation takes place, C T and z T are multiplied together and combined with z T which is fed in from above to produce [ z  Cz] T which emerges from the bottom of the HSFG. These two diagrams can be combined into one HSFG by joining the flow graphs along identical data flow directions, i.e. letting C T flow uninterrupted from the left hand to the right hand HSFG, forming the HSFG shown in figure 4. It can be seen that the two product terms are generated from the bottom of the HSFG (  j direction). The input matrices are fed in along the i and  k directions and pass through unchanged. From now on, all outputs of unchanged data have been removed to clarify the diagrams. ~.I.2 A T Products The products _PT/2AT and z T A T are generated in much the same way as before with pT/2 and z T being multiplied by A T with the products emerging from the bottom of the flow graph, as shown in figure 5.
Figures 4 and 5 can also be combined by "glueing" the HSFGs together along common data flow directions, i.e. letting pT/2 and z T flow directly from figure 4 to figure 5, producing an HSFG for the generation of all the product terms in the prearray of equation 1, shown in figure 6. As can be seen, the product terms are produced from the bottom of the array and any null inputs are removed for clarity.
4.2
Generating The Error Covariance Update

pT/2
Having generated all the product terms in the prearray, the postarray can now be formed to update pT/2,/ equation (1), and a by applying a set of orthogonal (Givens) rotations ' ' Schur complement calculation to update ~T, equation (3). The generation of the updated error covariance matrix, pT/~, can be described by the HSFG in figure 7. Rotating PT/2CT into VT/2 and passing the resulting Givens rotations across PT/~AT before it is rotated into W T/~ produces the updated error covariance matrix pr/2. Also produced as byproducts are the terms yT/2 and ~I/2CpTAT which are needed in the calculation of the updated state estimate, ~T.
HSFG Representationof the SRCKF 4.3
41
Generating The State Update ~T(k)
The state update is performed by taking the Schur complement of the compound matrix in equation (6).
[
veTl'(k) [ ~ ( k )  C ( k ) ~ ( k  1)]T
Vell'(k)C(k)Pr(k1)AT(k,)] [A B] ~..T(k 1)AT(k) = C D
(6)
If the submatrix C is zeroed by computing the Schur complement of the compound matr~ /3. (7. 41R ~ nrnttur~.rl Thg.r~.fnr^ 1~]T ~. . . . rn^rl h,, rntt~n~ it
[~,(t,~_~(t,~,(t, ....
42
D.W. Brown and F.M.F Gaston
inserted resulting in an identical architecture to that shown in figure 10. This demonstrates that the adhoc methods of systolic design can be replaced by a formal design methodology via the use of signal flow graphs and algorithmic engineering. While the above architecture is very efficient and fast, O(2n) timesteps per iteration, it does require feedback loops to produce the prearray from the new pT/~ located in the triangular part of the array. In the next section, an architecture will be described briefly which does not have feedback loops but which has the same iteration time and higher cell efficiency. 6
T H E S t t C K F S Y S T O L I C A R C H I T E C T U R E OF B R O W N A N D G A S T O N
The architecture given in figure 11 is not unique as the next example demonstrates. TO obtain the systolic architecture documented in [6] is a more complex task than that given in the previous section. Three different projections are needed: (i) iaxis projection of the multiplication layer (ii) jaxis projection of the state update layer (iii) kaxis projection of the error covariance update layer These projections are shown separately in figure 12. Note that the kaxis projection has been flipped upside down. These projections in themselves are valid systolic architectures but can be combined on top of one another in the following way to produce a more efficient array. The shaded area of projection (ii) is identical to the results produced from projection (i) and can be combined as illustrated in figure 13. The products PT/~AT and x T A T overwrite pT/2 and x T respectively, followed by the Schur complement calculation to update the state estimate, x ~, which in turn overwrites x T A M in memory. Finally, appending projection (iii) to figure 13 by storing the lower triangular W T/2 in a secondary memory under the existing lower triangular pT/2 will result in the updated error covariance matrix being formed in the correct position for calculations in the next iteration. It should also be noted that the measurement vector, z, which has been replaced by a unity matrix in memory, is now fed into the array with the C T matrix, producing the architecture given in figure 14. This architecture is identical to that described at length in [6], again showing that a formal design method exists for the generation of systolic squareroot Kalman filters. 7
CONCLUSIONS
To conclude, we have demonstrated that: 1. The SRCKF algorithm can be represented as an HSFG. 2. Using algorithmic engineering techniques, numerous systolic architectures can be obtained by projecting the 3D HSFG in various planes.
HSFG Representation of the SRCKF
43
3. A formal design method for systolic architectures has been shown using HSFGs.
Acknowledgements The authors gratefullyacknowledge the support of the Defence Research Agency, M~Ivern and the financial assistance given by the Department of Education for Northern Ireland.
References [1] J.M. Jover and T. Kailath, "A Parallel Architecture for Kalman Filter Measurement Update and Parameter Estimation.", Automatica, 1986, Vol. 22, No.l, pp. 4357. [2] M.J. Chen, K. Yao, "On Realizations of LeastSquares Estimation and Kalman Filtering by Systolic arrays.", Proc. 1st Int. Workshop on Systolic Arrays, Oxford, 1986, pp. 161170. I3] P. Gosling, J.E. Hudson, J.G. McWhirter and T.J. Shepherd, "Direct Extraction of the State Vector from Systolic Implementations of the Kalman Filter.", Proc. Int. Conf. on Systolic Arrays, Killarney, Ireland, May 1989, pp. 4251. [4] H.T. Kung and C.E. Leiserson, "Introduction to VLSI systems", edited by C.A.Mead and L. Conway, AddisonWesley, 1980. [5] F.M.F. Gaston, G.W. Irwin, "Systolic Kalman filtering: an overview.", IEE ProceedingsD Control theory and applications, Vol. 137, No. 4, pp. 235244, 1990. [6] D.W. Brown and F.M.F Gaston, "Systolic Squareroot Kalman Filtering without Feedback Loops.", to be presented at IEEE European Workshop on ComputerIntensive Methods in Control and Signal Processing, Prague, September 1994. [7] I.K. Proudler, J.G.McWhirter, "Algorithmic Engineering in Adaptive Signal Processing II  Worked Examples.", to appear in IEE Proc. VIS. [8] F.M.F. Gaston~ G.W. Irwin, J.G.McWhirter, "Systolic Square Root Covariance Kalman Filtering.", Journal of VLSI Signal Processing II, pp. 3749, 1990. [9] G.M. Megson, "An Introduction to Systolic Algorithm Design.", Clarendon Press Oxford, 1992. [10] M. Moonen and J.G. McWhirter, "Systolic Array for Recursive Least Squares by Inverse Updating.", Electronic Letters 29, No. 13, pp.121718, 1993.
44
D.W. Brown and F.M.F Gaston
Figure 1 9 HSFG for MatrixMatrix Multiplication
Figure 2" Projection along the kaxis
Figure 3" HSFGs for pT/2cT and I _z c $ l r
HSFG Representation of the SRCKF
zT CT
~T~ 1L\ 9
V_
t i
1,/"
pT/2cT i
[z_Cx]T
Figure 4" HSFG for C T Products ,
,
,
0 /lk
xT~ IIV 9
i
! V
xTATPT/2AT
i
Figure 5 9HSFG for AT Products
zT
k
i
pT/2cT [zCx] T
cT
Ir pT/2AT xTAT
'
,,,,,
Figure 6" HSFG for all Products
45
D.W. Brown and F.M.F Gaston
46
pT/2cT
pT/'2AT
..........
............../ o
Figure 7" HSFG for the Gen'cration Of the Updated p T / 2
[z.Cx]T V:/2
xTAT ]
Ve'I/2cpTAT
I,_ _f/.
Complement
l
v:: i
NV/
Transformations
, V
VrI/2CpTAT xT
Figure 8 9HSFG for the Generation of the Updated State Estimate
zT
' cT ....
I
/
/NI
I
AT. . . . . . . .
/
V.....///I~ ./~ PTt2.._L.~\~I ! xXP! i_ N~ xT ,
LI
~N~
VT/2 9 e
/ ,,
A
~
! I
A
[ ~
~ [//N~ "K
i/~
P" I N ~ Vr I/2CpTAT ' T Y [ / . . . . .
pT~
.
.
MULTIPLICATION LAYEI~
.
[
W2STATEUPDATELAYER
[
ERROR COVARIANCE
UPDATELAYER "
.
Figure 9 9HSFG for the SRCKF
,
HSFG Representation of the SRCKF
Figure 10 : Existing SRCKF Architecture
lzCx] r PT/2cT
[
xTAT pT~ AT
[
Figure 11 : Systolic Architecture obtained by kaxis Projection
47
48
D.W. Brown and F.M.F Gaston
Figure 12" Projections of the three layers of the HSFG in figure 9
Figure 14" Systolic Architecture given by Brown and Gaston, [6]
Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.
A SYSTOLIC ALGORITHM
FOR BLOCKREGULARIZED
49
J. Schier
50
In [3], a blockregularized parameter estimator has been presented, compatible with the requirements of implementation on a pipelined systolic architecture. The throughput that it achieves is an order of magnitude higher in comparison with the general framework presented in [5] and half as fast as that of the standard ttLS systolic array [11]. In this paper, we shall apply the concept of the blockregularization to the ItLS array with inverse updates [12]. As a result, we shall get a systolic array, with increased robustness to weakly exciting data, completely pipelined and using only nearest neighbour cell connections, which has the additional advantage of explicitly produced transversal filter weights. 2
I D E N T I F I C A T I O N OF T H E S Y S T E M M O D E L
2.1
Linear Regression Model
The system is modeled by the linear regression
y= e'~+e,
(I)
where the scalar output measurement, y is assumed to be related through an unknown parameter vector 0 to a known ndimensional vector 7~, which is composed of the recent inputs, outputs and measurable disturbances contributing to the model output, and e is a scalar Gaussian white noise with zero mean. 2.2
P a r a m e t e r Estimation
~.~.1
Minimization of Cost Function. In system identification, we choose an estimate
which minimizes the cost function Je
where P is the covariance matrix, V denotes the extended information matrix and A coincides with the remainder after the least squares estimation.
2.2.2 Notational Conventions. Since we work only with estimates in the following text, we shall not refer to them explicitly by the 'hat' symbol (()). Instead, we introduce the following notation where suitable: the 'tilde' symbol above the variable denotes the value before the data update (~5), stacked 'bar' and 'tilde' symbols refer to the value after the data update, but before the time update (~), the 'breve' symbol denotes the value after exponential forgetting (/3) and the 'bar' symbol refers to the value after the time update 2.2.3 Data Update. For the data update, we use the wellknown formulae of the recursive least squares (RLS) identification ~" = ~'.P~, (3) Ir
=
(1t~'),~/3~o,
(4)
e
=
y
(5)
Blockregularized R ~ Identification O
=
(6) (7)
O+~e,
P = P  (1 + r which is equivalent to v=~7+
[ ][ ] y
51
y
.
(8)
We do not have to compute A, since the estimates are independent of it. To compute the parameter estimates from the extended information matrix V, we divide it into the following submatrices
l'v
,,.. I
52
J. Schier
~.~.5 Block Regularization. Regularized exponential forgetting in the standard version is not suitable for a systolic implementation. To preserve pipelining of the systolic estimator, the block regularization was proposed in [4, 2]. The idea is to keep the alternative parameters V~ and @* constant over N >_ n periods of identification, where n is the dimension of O
9 = v$(k) = v;(k +
~) =...= v$(k + i v  ~),
(19)
{9* = e*(k) = @*(k + 1) = . . . = e*(Ir + IV  1).
and include the addition of the alternative parameters defined in the time update (15, 16), in accumulated form only after every N periods of identification 1. Standard exponential update over N periods := V(l[0), A(1,N):= 0 fo_s i := 1 to N
a)
:= ~(i)(V + [~,(i) v(i)][~,(/) ~(i)1)
b)
A(1, N ) : = A(i)A(1, N)I (1  A(i))
(20)
c)
en,d 2. Accumulated regularization in the Nth period
~,~y := vse* v~(;v + ~llv):= ~ + ;~(~,N)V;
a) b)
(oCN + 1IN):= Vj~(N + llN)v~(/V + IIN))
d)
(21)
where X(1, N ) > 0 is an accumulated forgetting factor. 2.3
S q u a r e  R o o t I m p l e m e n t a t i o n of t h e RLS A l g o r i t h m
Usually, we use the squareroot version of the estimator, because it guarantees symmetry and positive definiteness of the covariance/information matrix, and also it can be implemented on a systolic array. 2.3.1 Square.root Decomposition of the R L S Algorithm. Let us introduce the triangular squareroot decomposition of the cowriance matrix by the formula
(22)
P = R R I,
where R is an upper triangular matrix. Using this decomposition, the formulae of the RLS identification (3)  (7) transform to a matrixvector multiplication 0
lI:l
(23)
and to an inverse update
I] ' 1 [1 0
R
o
~'
= GflQ
Q _ ,~....,~2,~1,
a ~' ~'
e
,
G=
[1 ] I
els~
a = diag{1,(l/~)Xr
(24)
1
1),
(25)
Blockregularized RLS Identification
53
where Q is an orthogonal matrix given as a product of elementary rotations ~ 1 . . . ~n with rotation ~i zeroing the ith element of vector a with respect to the first element of the same column in the composed matrix; f~ is a weighting matrix and G is a nonorthogonal transformation used to update the parameters.
~.g.~ Regularization as Input of Alternative Data. We can consider the regularization process to be an input of some alternative data. To show this, let us discuss the time update of the information matrix V. Using the partitioning (9), we shall introduce a square root decomposition of matrix V* (** represents a scalar don't care term) and express V* as a sum of data dyads
[" "1 V~ )' v~,
=
"
*
i1
1
(26)
where
EUi*l u~
" ith row of the matrix
["" 0
.*
(27)
'
An analogous squareroot decomposition may be used for V. We can write the addition of the regularizing matrix V* to the information matrix 1~ in a recursive form
a)
u* := U*O* fo....rri = 1 to n .
§ =
[
1'[
~ + ,/~(i,s)[ v,. ~r ]',/~(i,s)[ v,.
ena v(k + ~lk):= ~:
b)
(28)
c)
The first formula (28 a) results from (26) and (10). The relation for the recursive regularization (28 b) has the same form as the formula of exponential update (20 b). The only difference is that forgetting is not applied to the information matrix, but to the input data. We can conclude that the process of evolution of the covariance matrix P and of the information matrix V~ must be equivalent given the same input data. Hence, we can use the same regularizing data no matter which matrix we work with.
3
SYSTOLIC IMPLEMENTATION
In this section, we shah describe the systolic implementation of the regularization, which is the main contribution of this paper. To be able to do that, let us remind the reader of the systolic algorithm for the inverseupdated RLS identification [8, 9, 12].
J. Schier
54
3.1
Systolic A l g o r i t h m for RLS Identification with Inverse U p d a t e s
The squareroot RLS algorithm (23, 24) is implemented on a lower triangular systolic array. The transposed factor of the covariance matrix/V and the vector of parameter estimates 0 reside in the cells of the array, as shown in Fig. 1. The input vector/3 with initial value/3 = [0 ... 0] accumulates the expression V ~ (4). The input vector a, initiated by a = [1 0 ... 0] is necessary for proper pipelining [12] and its first element accumulates v ~ while being passed through the array.
Figure 1: Mapping of R' and 
to the systolic array, input and output of the array
3.1.1 Function of the Cells. The function of the cells in the RLS array is described in Fig. 2, but the forgetting factor is not included for simplicity. The notation used in the figure refers to the data update, nonetheless, the same formulae are also used for regularization. 3.1.~ Propagation of Forgetting. If we assume A to be time variable, we have to synchronize its changes in the array with the propagation of the rotations (25). For this reason, it is entered in the upper left cell and propagated through the array as shown in Fig. 3. Because A(k) is used to compute the accumulated forgetting coefficient A(1, N) (20 c), it cannot be entered in the squarerooted and inverted form. 3.2
I m p l e m e n t a t i o n of the BlockRegularization
Implementation of the blockregularized forgetting in the systolic array for RLS identification with inverse updates involves implementation of the foUowing mechanisms: , Multiplication of n
rows
of matrix U* (26, 27) with O* (28 a)
9 Switching between the identified and the regularizing data 9 Computation of the accumulated forgetting factor A(1, N) (20 c) 9 Switching of the exponential forgetting (when processing the regularizing data, the exponential forgetting is not used   cf. (20 b) and (21 b, c).
Blockregularized RIs Identification
55
~I ~I 0~I ,.,!
ai  Ril ~ l
@i := arctan ~1
C~
"t z! a l l := [ cos ~bi sin~~.~,i ] [ ~I 0  sin &i cos ~il
ai
al"l ai  ~il~Pl
Left column cell
[
~
aj
COS~i :=
.R~j ai
 sin @~ cos ~'
k~#
Internal cell Upper part of array
e  01~i W := 

Oll 01
,~==E
Ol
0
:=
~ C~1
~
1
(~I
g  E)I~j
Left column cell
Oj E
Oj
e
a~ 1
Oj eOj~pj
4.... e
Internal cell Bottom row of array Figure 2: Function of cells in the R L S array
a~  R~#~#
]
J. Schier
56
IHE3 i !
I
r E3 ...... E] IJE] ...... UE] Figure 3: Movement of ~ through the RLS array 9 Storing of e* during multiplication with U*, writing of new O*.
3.~.1 Selection of O* for the Block Regularization. For implementation reasons (simplification of control mechanisms), we choose O, computed n steps before the end of the data block, for e*. 3.2.2 Loading of U*. The parameter estimates e , that we Use as the regularizing parameters O* for multiplication with the matrix U*, are computed in the bottom row of the array. Hence, a straightforward choice is to load U* into the array from the bottom, skewed in time, and to perform the multiplication also in the bottom row, as shown in Fig. 4.
Figure 4: Loading of matrix U* into the array and its multiplication with parameters Since the parameter estimates change with every new data input, while all rows of U* must be multiplied with the same e*, it is necessary to store e* = e(k), k being the time before the start of the multiplication, in registers. Prom the bottom row, the elements of U* are shifted up through the array to the diagonal, where they are entered as the alternative data.
Blockregularized RLS Identification
57
The movement of U* has to be synchronized with the input of the data samples, so that U~ arrives at the diagonal just after the last sample of the data block has been processed. To ensure this, U~I has to be entered the bottom row of the array n steps before the end of the data block.
8.2.3 Control Signal. To switch on and off writing the O estimates to the storage of O*, to switch between the real data samples and the regularizing data and to switch on and off the forgetting, a control signal is used. This signal, aligned with U*, is first propagated upwards through the array. After entering the array, it controls writing of the estimates O to the O* storage in the bottom row. In the diagonal cells, it controls switching of the data entry (Fig. 5). After it has reached the diagonal, it is sent back from the upper left cell, in the same way as we propagate )~ (Fig. 3), to switch on and off the forgetting.
Figure 5' Propagation of the control signal
3.2.~ Buffering of Input Data. Since the processing of the input data is interrupted by regularization for n periods every N periods, it is necessary to buffer the input data and to sample the identified system at a slower rate than the systolic estimator runs. This slowdown is equal to 1/2 in the worst case (for N = n). 8.2.5 Multiplication of U* with ,~(1, N). As mentioned previously, ,~ is entered through the upper left cell of the array (Fig. 3). There we shall also compute the accumulated forgetting coefficient ,~(1, N). To implement the product ~/~(1,N)U* (28 b), we use the methods of algorithmic engineering [10]: instead of computing the product before loading U* to the array, we do that in the left column cells of the upper part of the array, before using vector a (23) to compute the rotations ~ (25). We have yet to implement the product ~/,~(1,N)u*. This product is entered during
58
J. $chier
regularization instead of y, and through the computation of the prediction error e (23), it is used to compute the transformation w (Fig. 2): e
X/~(1,N)u~'  ~'~/A(I,N)U* _ VA(1,N) = ..... V~" .... Vfs ( u 7  ~'U~*).
= ~
(29)
The fraction x/~)~N )~r is computed in the cen storing R~,~ (the bottom left cell of the upper part of the array). y
4

SIMULATION EXAMPLE
The influence of the blockaccumulated regularization on the robustness of the estimator is shown on the graphs in Fig. 6. The following simple system was identified: y(k) = alYCk  1)  a2yCk  2) + bou(k) + blu(k  1) + ce(k),
(30)
where al = 0.05, a2 = 0.2, b0 = 0.2, bl = 0.3, c = 0.02 and e(k) and u(k) is a white A / ' ( 0 , 1 ) noise. Matrix U* was set to U*  0.07I, weighting factor ~ = 0.8.
The oscillations of Rll for zero input are due to the accumulated data updates. 5
CONCLUSIONS
In this paper, we have implemented the blockregularized exponential forgetting in the squareroot RLS algorithm with inverse updates [12]. The principle of regularization consists of weighting the RLS cost function (2) with an alternative function, specified by the user. This weighting prevents the parameters of the cost function from numerical instability in the case of noninformative data, because they converge to the regularizing values in this case. The blockaccumulated regularization accumulates the regularization step over several steps of identification. This is necessary for pipelined systolic implementation. Unlike the standard regularization [5, 7], which is not suitable for systolic implementation, the blockregularization reduces the throughput of the systolic algorithm by only 1/2 in the worst case, compared with the exponentially weighted RLS. Other advantages are that the implementation preserves the compactness of the original array and that it directly provides the transversal filter weights. Acknowledgements This research was supported by Research Grant Nr. 102/93/0897 of the Grant Agency of the Czech Republic. References [1] L. D. J. Eggermont et al., editors. VLSI Signal Processing VI, New York, 1993. IEEE Signal Processing Society, IEEE Press. Proceedings of the IEEE Signal Processing
Blockregularized RLS Identification
Figure 6: Comparison of exponential and regularized forgetting
59
60
J. Schier Society Workshop, held October 2022, 1993, in Veldhoven, The Netherlands.
[2] J. Kadlec. The ceUlevel description of systolic block regularised QR filter. In Eggermont et al. [1], pages 298306. Proceedings of the IEEE Signal Processing Society Workshop, held October 2022, 1993, in Veldhoven, The Netherlands. [3] J. Kadlec, F. M. F. Gaston, and G. W. Irwin. Parallel implementation of restricted parameter tracking. In J. G. McWhirter, editor, Third IMA International Conference on Mathematics in Signal Processing, Mathematics in Signal Processing, University of Warwick, December 1517 1992. Oxford Publishers Press. [4] J. Kadlec, F. M. F. Gaston, and G. W. Irwin. Systolic implementation of the regularised parameter estimator. In K. Yao et al., editors, VLSI Signal Processing V, pages 520529, New York, 1992. IEEE Signal Processing Society, IEEE Press. Proceedings of the IEEE Signal Processing Society Workshop, held October 2830, 1992, in Napa, CA. [5] R. Kulhav~. Restricted exponential forgetting in realtime identification. A utomatiea, 23:589600, 1987. [6] L. Ljung and S. Gunnarsson. Adaption and tracking in system identification   a survey. A utomatica, 26:721, 1990. [7] L. Ljung and T. S6derstrfm. Theory and Practice of Recursive Identification. MIT Press, Cambridge, MA, 1983. [8] J. G. McWhirter. Systolic array for reeursive least squares by inverse iterations. In Eggermont et al. [1], pages 435443. Proceedings of the IEEE Signal Processing Society Workshop, held October 2022, 1993, in Veldhoven, The Netherlands. [9] J. G. McWhirter. A systolic array for recursive least squares estimation by inverse updates. In International Conference on Control '9~, University of Warwick, London, March 2124 1994. IEE. [10] J. G. McWhirter. Algorithmic engineering in adaptive signal processing, lEE Proc., Pt. F, 139(3), June 1992. [11] J. G. McWhirter and I. K. Proudler. The QR Family, chapter 7, pages 260321. Prentice Hall International Series in Acouetics, Speech and Signal Processing. Prentice Hall International Ltd., 1993. [12] M. Moonen and J. G. McWhirter. A systolic array for recursive least squares by inverse updating. Electronics Letters, 29( 13):12171218, 1993. [13] J. Schier. Parallel algorithms for robust adaptive identification and squareroot LQG control. PhD thesis, Inst. of Information Theory and Automation, Academy of Sciences of the Czech Republic, Prague, 1994. [14] J. Schier. A systolic algorithm for the blockregularized rls identification. Res. report 1807, Inst. of Information Theory and Automation, Prague, 1994. Also accepted for publication in Kybernetika (Prague).
Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) © 1995 Elsevier Science B.V. All rights reserved.
NUMERICAL ANALYSIS OF A NORMALIZED RLS FILTER USING A PROBABILITY DESCRIPTION OF PROPAGATED
61
DATA
J. KADLEC
Control Engineering Research Centre Department of Electrical and Electronic Engineering The Queen's University of Belfast Ashby Building, Stranmillis Road, Belfast BTg 5Att Northern Ireland
[email protected],qub.ac. uk ABSTRACT. The normalized version of the Qtt algorithm for recursive least squares estimation and filtering is presented. An understanding of the numerical properties of a normalized ttLS algorithm is attempted using a global probability analysis. KEYWOttDS. Systolic array, normalization, fixed point, probability.
1
INTRODUCTION
A normalized version of the QR algorithm [3], [7] for recursive least squares estimation and filtering is presented. All data and parameters of the main triangular section of the array have the guaranteed range of values [1,1]. The array has the potential for a minimallatencyimplementatlon, because the normalized section can use DSP, or VLSI, fixed point hardware (lookuptables) [2]. An understanding of the numerical properties of the normalized ttLS algorithm is attempted using a global probability numerical analysis. This approach derives the analytic formulas for the probability density functions (distributions) describing the data (normalized innovations) propagated in the normalized filter. This approach is used to formulate probability statements about the number of bits used in the fixed point representation of propagated data. The derived analytic formulas for the probability distributions are verified by the comparison with the data histograms measured on a fixed point normalized array. The array was simulated by Ccoded functions under Matlab.
J. Kadlec
62 2
NESTED RLS IDENTIFICATION PROBLEMS
We consider the recursive least squares (RLS) identification of a singleoutput system described by the regression model
=
(1)
+ e(")
where n is discrete time, the pvector, ~(n), is the data regressor, y(n) is the output signal and e(n) represents (in the ILLS con+,ext) the equation error. The unknown pvector, 0, of the regression parameters is estimated by the pvector 8(n). To prepare the ground for the numerical analysis of the algorithms, we will operate with p(p "1 1)/2 different regression models (indexed by i = 1 , 2 , . . . m  1;m = 2,...p+ 1). Let us denote 7h:p+l(n), the vector of data measurements as ~Ol:p+l (R) =
~ 1...i
i+l...m1
. m
m+l...p
p+l
The set of RLS models is given by
(3)
~m(") = ~Ti[m(R)~Ol:iCrt) Jr ei[m(n). The standard, maximal order RLS model (1) is part of the set (3) for m = p + 1; i = p. The estimates
01ql,n(n) minimize
the sum of weighted squares
j~'TI,
J(Ol:i[m("))  E
j=l
j~2(nj) ((Pro(j)  ~Ti[m(")~l:i(J)) 2
(4)
where 0 among p processors. An example of a matrix partitioning strategy is shown in Figure 1. In Algorithm 2.1 the parallel computations of the matrices" HX (~ H P (j), ~j, and 7j required to build in full the P(J), and It(J) matrices. The places in the algorithm where the matrices H X (~ H P (j), ~j, and 7j are computed require interproeess communication and synchronization, and these places can penalize the efficiency of the parallel implementation of the BlockCG method. In our first implementation, we minimize the number of required communications us
Figure 2: MasterSlave : centralized BlockCG implementation. ing a masterslave computational approach in which the master performs the BlockCG
algorithm 2.1 with the help of p slaves to perform the H P (j) products. In BlockCG algorithm, the most expensive part in term of computations is the calculation of the H P (j) products and in this implementation we only parallelized these products. We refer to this implementation as "MasterSlave: centralized BlockCG." Figure 2 illustrates the flow of computations for the MasterSlave: centralized BlockCG implementation. As a second implementation, we consider a masterslave computing approach in which each of the p slaves perform iterations of the BlockCG Algorithm 2.1 in a set of matrices < Hi, X~~ Ki >. The role of the master in this case is to gather partial results from the
Parallel B l o c k lterative Solvers
101
R!J)rRI j) and P!J)rHP!J) products in order to build the 7j and ~j matrices respectively. At the same time, each slave has information about other slaves with whom it needs to exchange information to build locally a part of the full HP~ j) matrix. The implementation is illustrated in Figure 3. We will refer to this implementation as "MasterSlave: distributed BlockCG." Lastly, we develop an implementation based on an all  to  all computing model. This
Figure 3: MasterSlave: Distributed BlockCG implementation. time the motivation is to reduce the communication bottlenecks created by having a processor that acts as master and needs to receive messages from p slaves and broadcast the results back. In this implementation, we have an a l l  t o  a l l communication for computing the 7j and/~j matrices which means that, after the communication, the same 7/ and tgj information is local to every processor. To compute the full HP! j) matrix each processor communicates with only the processors that have information relevant to its computations. Figure 4 is an illustration of this implementation of the BlockCG algorithm. We refer to this implementation as the "AUtoAU BlockCG". In all three implementations the interprocessor communication has an impact on performance. Therefore, we analyse the amount of information that needs to be communicated in each implementation. Let mk be the number of processors with whom the kth processor must communicate in order to compute the full HPI j) or HX! ~ matrix. Notice that each processor only needs to communicate a part of its local information with its mk neighbour processors. In the
102
M. Arioli et al.
MasterSlave: centralized BlockCG implementation, these products are handled differently and each processor sends results back to the processors executing the master role. Table 1 summarizes the number of messages sent per iteration of the BlockCG. We observe in Table 1 that the MasterSlave: centralized BlockCG implementation needs
Table 1: Number of messages sent at every iteration the least amount of messages per iteration. However, in the MasterSlave: centralized BlockCG the length of every message is nl • s ( master processor sends P(J) and each slave processor sends back H P (j)). As stated before, in the AUtoAU BlockCG and MasterSlave distributed BlockCG implementations, the processors communicate directly with their neighbours to compute the H P (j) products. These messages are in almost all
Parallel Block lterative Solvers
103
cases smaller than n~ x s except when the matrix H is a full matrix. The length of the messages used to exchange the inner product is s x s. In the MasterSlave: centralized BlockCG, the master processor assembles the full HP(j) from partial results sent by slave processors. In the other two implementations, the assembly of the full matrices happens in parallel because each slave processor builds the part of the full matrix it needs. Furthermore, the overhead of assembling the H P (j) matrix in a centralized way increases as the number of subproblems and degree of parallelism increase. The results shown in Tables 2 and 3 were run on a BBN TC2000 computer. We ran the
'Nu'mber of PE's ~ i 2 4
8 12 16
Laplace Matrix 4096X 4096 . . . . . . . (Block size = 4, 171 iterations) Elapsed Time of sequential version  279142 AlltoAll Mstrslv: disi~ibuted MstrSlv: centralized Elps. Timt [SpeeduP ' Elps. Time Speedup Elps.....Time {Speed?up .$' 278827 1.001 279436 0.999 1.951 143419 143083 1.946 301884 0.925 3.910 71244 3.918 278184 71393 1.003 40755 6.849 38798 7.195 273320 1.021 40668 6.864 29747 9.384 279414 0.999 57759 25452 4.833 lO.967 283649 0.984 .,
.
.
.
.
.
.
.
.
.
.
. . . . . . . . .
.
.
.
.
.
....
Table 2: Test matrix generated from a discretizationon a 64 • 64 grid: Laplace's equation. Times shown in table are in microseconds.
Number of PE's 1 2 4 8
12 16
L A N i ~ R O Matrix 960 x 960 (Block size = 4, 138 iterations) Elp. Time of sequential version = 64869 AlltoAll ....MstrSlv: distributed MstrSlv: centralized
Elps. Time I Speedup 64980 0.998 34063 1.904 19531 3.321 14108 4.598 20943 3.097 48054 1.350 ,,
. . . .
Elps. Time ] Speedup 65455 0.991 34347 1.889 19451 3.335 12667 5.121 11319 5.730 11874 5.463 ,,.
Elps. Time I Speedup . $ .
61942 53964 52175 53566 58400
1.047 1.202 1.243 1.211 1.110
Table 3: This matrix comes from The HarwellBoeing Sparse Matrix Collection, and it is obtained from a biharmonic operator on a rectangular plate with one side fixed and the others free. Times shown in table are in microseconds. experiments with 1, 2, 4, 8, and 16 processors. We used two SPD matrices for running the experiments. The first matrix is the result of a discretization on a 64 • 64 grid: Laplace's equation. The matrix is sparse of order 4096, and has 20224 nonzero entries. The second matrix, L A N P R O of order 960 with 8402 nonzero entries, comes from the HarwellBoeing Sparse Matrix Collection [9]. The sequential time reported in Tables 2 and 3 is the result of running a sequential implementation of BlockCG without any routines for handling parallelism. The sequential implementation uses the same BLAS and LAPACK [1] routines as the parallel BlockCG
104
M. Arioli et al.
implementations. We can see in Tables 2 and 3 that the larger the problem size is, the better the speedups we get with the MasterSlave: distributed BlockCG implementation. This is not the case for the Masterslave: centralized BlockCG implementation, for which the performance decreases as we increase the size of the problem, and the overhead from monitoring the parallelism by the master processor negates all the benefits from performing the HP(J)products in parallel. In the AlltoAll BlockCG implementation, we have chosen to perform redundant computations in parallel instead of waiting for a master processor that gathers, computes and broadcasts the results from computations. As can be seen in Tables 2 and 3, an increase in the degree of parallelism penalizes the performance of the implementation due to the accompanying increase in interprocessor communication. We conclude from these experiments that the MasterSlave: distributed BlockCG implementation performs better than the other two implementations because the amount of work performed in parallel justifies better the expense of communication. Furthermore, we use this implementation to accelerate the rate of convergence of the Block Cimmino iterative solver to be presented in the next section.
3
PARALLEL DISTRIBUTED
BLOCK CIMMINO
The Block Cimmino method is a generalization of the Cimmino method [7]. Basically, we partition the linear system of equations:
Ax=b,
(3.1)
where A is a ~ x n matrix, into ! subsystems, with I _~ m, such that: A1 b1 A 2 9
x =
A'
b2 .
(3.2)
b'
The block method ([5, 2]) computes a set of I row projections, and a combination of these projections is used to build the next approximation to the solution of the linear system. Now, we formulate the Block Cimmino iteration as: i(k) =
=
Ai+b i  PR(A~T)X(k)
(3.3)
Ai+ (bi  Aix(k)) l
x(k+1) =
x (k) + v ~ 6i(k) i1
(3.4)
In Equation (3.3), the matrix A i+ refers to the MoorePenrose pseudoinverse of A i defined
as: A i+ = A ir (AiAiT) 1. However, the Block Cimmino method will converge for any other pseudoinverse of A i and in our parallel implementation we use a generalized pseudo
Parallel Block Iterative Solvers
105
inverse [6], AG_ai = G1Ai r ""[AiG1Air)l, where G is an eUipsoidal norm matrix. The Plc(Air) is an orthogonal projector onto the range of A it. We use the augmented systems approach, [4] and [10], for solving the subsystems (3.3) G Ai
Ai~] [u i
0
with solution: v' = _ ( A ' G  1 A ' T )  l r
'
, and
u i = AG_li(b i  Aiz) = 6i
(3.5)
The Block Cimmino method is a linear stationary iterative method, with a symmetrizable iteration matrix [11]. The use of eUipsoidal norms ensures the positive definiteness of the Block Cimmino iteration. An SPD Block Cimmino iteration matrix can be used as a preconditioning matrix for the BlockCG method. The use of BlockCG in this case accelerates the convergence rate of the Block Cimmino method. We recall that BlockCG will simultaneously search the next approximation to the system's solution in sKrylov subspaces and, in the absence of roundoff errors, will converge to the system's solution in a finite number of steps. We use the MasterSlave Distributed BlockCG implementation presented in the previous section to develop a parallel block iterative solver based on the Cimmino iteration. At first, we solve the system (3.5) using the sparse symmetric linear solver M A 2 7 from the HarweU Subroutine Library [8]. The MA27 solver is a frontal method which computes the LDL r decomposition. The MA27 solver has three main phases: Analyse, Factorize, and Solve. These MA27 phases are called from the parallel Block Cimmino solver. First of all, the parallel Block Cimmino solver builds the partition of the linear system of equations (3.1) into (3.2) and generates the augmented subsystems. The solver then examines the augmented subsystems to count the number of nonzero elements inside each of them and identifies the column overlaps between the different subsystems. The number of nonzero elements per subsystem gives a rough estimation of the amount of work that will be performed on the subsystem. The column overlaps determine the amount of communication between the subsystems. In addition, the solver gathers information from the computing environment either supplied by the user or acquired from the message passing programming tool. Processors are classified into single processor, shared memory clusters, and distributed memory clusters. We assume that the purpose of clustering a group of processors is to take advantage of a specific communication network between the processors. The information from the processors is sorted in a tree structure where the root node represents the startup processor, intermediate level nodes represent shared or distributed clusters, and the leaf nodes represent the processors. Processors in the tree are sorted from left to right by their computer power. The tree of processors, the augmented subsystems, the number of nonzero elements per subsystem and the information from the column overlaps are passed to a static scheduler. The scheduler first sorts all the subsystems by their number of nonzero elements. Later, subsystems are assigned to processors following a postorder traversal visit of the tree of processors (e.g., first visit leaf nodes then the parent node in the tree). A cluster node receives
106
M. Arioli et al.
a number of subsystems to solve equal to the number of processors it has. In this case, a first subsystem is assigned to the cluster and the remaining ones are chosen from a pool of notyetassigned subsystems. To choose amongst the candidate subsystems, we consider the amount of column overlaps between them and the subsystems already assigned to the duster and, then, we select the candidate subsystem with the highest factor of overlapping. This choice aims to concentrate the communications between subsystems inside a cluster. Every time a subsystem is assigned to a processor or cluster, we update a workload factor per processor. This workload factor is useful in the event that there are more subsystems than processors. The subsystems that remain in the notyetassigned pool after the first round of work distribution are assigned to the least loaded processor or cluster one at the time. Every time the least loaded processor is determined from the workload factors. After assigning all the subsystems to processors, these subsystems are sent through messages to the heterogeneous network of processors. Each processor calls the MA27 Analyse and Factorize routines on the set of subsystems it has been assigned. Afterwards, it performs the Block Cimmino iteration on these subsystems checking the convergence conditions at the end of every iteration. The same parallel computational flow from Figure 3 is used in the parallel Block Cimmino solver. The only difference is a call to the MA27 Solve subroutine to solve the augmented subsystems and compute a set of projections 6i. The scheduler may redistribute subsystems to improve the current workload distribution. This redistribution may take place after the MA27 Analyse phase, MA27 Factorize phase, or during the Block Cimmino iterations. Moreover, the user specifies to the scheduler the different stages of the parallel solver where redistribution is allowed. Given the high expense of moving a subsystem between processors (move all the data structures involved in the solution of a subsystem, and update the neighbourhood information), we recommend allowing redistribution only before the MA27 Factorizatlon phase started because there are many data structures that are created per subsystem during the solve phase and sometimes the time to relocate these data structures across the network is more expensive than letting the unbalanced parallel solver finish its execution. In Table 4, we present some preliminary results of the Block Cimmino solver. We ran in a heterogenous environment of five SUN Sparc 10, and three IBM 1%$600 workstations. We used one of the IBM workstations to monitor the executions (master processor). The first test matrix is Gl%El107 from the HarweUBoeing Sparse matrix collection [9]. This matrix is partitioned into 7 blocks (6 block of 159 rows and one of 153 rows) using a block size of 8 for BlockCG. As a second test matrix, we consider a problem that comes from a two dimensional wing profile at transonic flow (without chemistry effects). The problem is discretized using a mesh of 80 by 32 points. This leads to an unsymmetric, diagonal dominant, block tridiagonal matrix of order 80 • 32 • 3. In this case we test with three different partitionings. We use block size of 4 for the BlockCG algorithm only to increase the problem granularity since the problem converges very fast even with a block size of one for BlockCG. The numbers inside parenthesis in Table 4 show the relations between an execution time with a given number of slave processors and the execution time of the same problem with a single slave processor. We do not anticipate speedups in a network of workstations and we expect the Parallel Block Cimmino solver to perform better in parallel heterogeneous environments where we can take advantage of clusters of processors, and very different pro
Parallel Block herative Solvers
107
cessing capabilities. Besides, we conclude that the Parallel Block Cimmlno will provide in a "reasonable" time a solution to a problem that cannot be solved in a single processor. N.Slaves IBM [sUN 1 1
0 2
0
3
1
4
GRE1107 205449 (1.0) 161969 (1.3) 201517 (1.0) 249802 (0.8)
5
2s6320
o
(0.9)
........ ~ransonic Flow 10 Blks 16 Blks
78079 (1.0i 53352 (1.5) 47479 (1.6)
33916 (2.3)
40966
77757 (1.0) 75297 (1.0) 64349 (1.2)
11 Blks 77579 (i.0)
4849~ 4122~ (1.9) 44352 (1.8) 50895
36200 (2.1)
(i.~) (1.9)
(1.5)
42721 (1.8)
Table 4: Preliminary results of the parallel Block Cimmino solver. Times shown in table are in mUlseconds.
108
M. Arioli et al.
References
[1] E. Anderson, Z. Bai, C. Bischof,J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. L A P A C K User's Guide. SIAM, Philadelphia, 1992. [2] M. Arioli,I. S. Duff',J. No~iUes, and D. Ruiz. A block projection method for sparse matrices", S I A M J. Scientificand StatisticalComputing 1992, 13, pp 4770. [3] M. Arioli,I. S. Duff',D. Ruiz, and M. Sadkane. Block Lanczos techniques for accelerating the block Cimmino method CERFA CS TR//PA//92//70,Toulouse, France, 1992. [4] R.H. Bartels, G.H. Golub, and M.A. Saunders. Numerical techniques in mathematical programming. In Nonlinear programming J. B. Rosen, O.L. Manga.sarian, and K. Ritter, eds.,Accademic Press, New York, 1970. [5] R. Bramley and A. Sameh. Row projection methods for large nonsymmetric linear systems. S I A M J. Scientificand StatisticalComputing 1992, 13, pp 168193. [6] S.L. Campbell and C.D. Meyer, Jr. Generalized inverses of linear transformations. Pitman, London, 1979. [7] G. Cimmino. Calcolo approssimato per le soluzioni dei sistemi di equazioni lineari. Ricerca Sci. II, 9, I, pp 326333, 1938. [8] I.S. Duff and J.K. Reid. The multifrontal solution of indefinite sparse linear systems. A CM Trans. Math. Softw. 9, pp 302325, 1983. [9] I.S. Duff, R.G. Grimes and J.G. Lewis. Users' guide for the HarwellBoeing sparse matrix collection (Release 1). RAL 92086 Central Computing Department, Atlas Centre, Rutherford Appleton Laboratory, Oxon OXll 0QX, 1992 [10] G.D. Hachtel. Extended applications of the sparse tableau approach finite elements and least squares. In Basic question of design theory W.R. Spillers, ed., North Holland, Amsterdam, 1974. [11] L.A. Hageman and D. M. Young. Applied Iterative Methods. Academic Press, London, 1981. [12] M. It. Hestenes and E. L. Stiefel. Methods of conjugate gradient for solving linear systems. Nat. Bur. Std. J. Res. 49, pp 409436, 1952. [13] D. P. O'Leary. The block conjugate gradient algorithm and related methods. Linear Algebra and its Applications 1980,29, pp 293322. [14] D. Ruiz. Solution of large sparse unsymmetric linear systems with a block iterative method in a multiprocessor environment. CERFA CS TH/PA/9~/6. Toulouse, France, 1992.
Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.
109
EFFICIENT VLSI ARCHITECTURE FOR RESIDUE TO BINARY CONVERTER
G.C.CARDARILLI, R . L O J A C O N O ,
M.RE, M . S A L E R N O
Dept. of Electronic Engineering University of Rome. Tot Vergata Via della Ricerca Scientifica, I Rome Italy cardarilli @utovrm. it
A B S T R A C T . The Residue Number System (RNS) to binary conversion is a criticaloperation for the implementation of modular processors. The choice of moduli is strictlyrelated to the performance of this converter and affectsthe processor complexity. In this paper, we present a conversion method based on a class of coprime moduli, defined as (n  2 k, n + 2k). The method and the related architecture can be easily extended to a large number of moduli. In this way the magnitude of the modular arithmetics used in the R N S system can be reduced. The proposed method allows the implementation of very fast and low complexity architectures. KEYWORDS. set.
1
Parallel architectures, RNS, arithmetic representation conversion, moduli
INTRODUCTION
Residue number system (RNS) is a very useful technique to improve speed and arithmetic accuracy of digital signal processing implementations. It is based on the decomposition of a number represented by a large number of bits into reduced wordlength residual numbers. These residual arithmetic blocks are independent each other. Consequently, this approach reduces the carry propagation delay speeding up the overallsystem. This fact makes R N S an interesting method for a low level parallelization.In addition, modular operation with high computational cost, as for example multiplication, can be speededup by using suitable isomorphisms stored in lookup tables. The main drawback for the use of R N S in high speed D S P is related to the conversion between the internal and the external number
I I0
G.C. Cardarilli et al.
representations. This conversion requires the translation from binary to RNS and viceversa and uses two different types of converters. The input converter transforms binary numbers into a set of numbers corresponding to the RNS representation. The output converter is used for the inverse conversion and it transforms the numbers from RNS to binary. In general, both the converters are critical in terms of speed and complexity but the second one is more important for the definition of the overall system performance. This second conversion can be performed using different approaches based on two fundamental techniques: the Mixed Radix Notation (MRN) and the Chinese Remainder Theorem (CttT). While the first approach is intrinsically serial, the second one can be easily made parallel but it requires a large dynamic range in order to represent the intermediate results [1]. Recently several authors have developed a number of alternative methods to overcome the CRT problems. In particular, in [2] Premkumar proposed a method derived from the CRT for a particular choice of RNS moduli. He considered an RNS system defined by the three different moduli ( 2 n  1,2n,2n + 1). With these moduli he reduced the internal dynamic range of the converter and simplified the final modular operation. Different solutions based on other choices for the moduli set as for example (2 n  1, 2'~ + 1) were also proposed. In this case, it is possible to use the elementary converter for defining an ItNS system composed by a large number of moduli. The disadvantage of this method is the exponential growth of the moduli magnitudes that makes the arithmetics of the RNS system complex and slow. In this paper, we present a conversion based on a class of coprime moduli, defined as ( n  2 k, n+2k). This system can be easily extended to a large number of moduli limiting the magnitude of the modular arithmetics. In addition, this method allows the implementation of very fast architectures based on simple operations. 2
BINARY TO RNS CONVERSION
Let us consider an RNS arithmetic based on two moduli rnl and m2. For this choice, the number X can be obtained from its residues rl and r2 by using the classical CRT approach
where (X}M represents the result of modular operation X modulo M, with M
 7711 * 7712 =
=
rh, = __.M rh2 = __M 1711
(2)
1712
The two quantities rhl 1 and ~h~1 are such that ,n, = 1; m~= <m~'l>m2.
The application of the CP~T shown in [1] to the ltNS to binary conversion leads to complex architectures. These architectures require a large wordlength and a complex modular operation rood M. The wordlength is related to the dynamic range of partial results of equation (1). In our approach these problems are avoided by using a particular form of (1)
Efficient VLSI Architecture for Residue to Binary Converter
111
and considering a particular choice for the moduli mt and m2 9 If the left side and the right side of (1) are multiplied by (m2  ml), taking into account equation (2), we obtain
Regarding the definition of
~;1.
we can write (m21 m2)
quently we obtain ~r~2
~rtl

(klr/~ 1
+ 1)m 1 and conse
ml
There are an infinite number of values for kl that make the second member of equation (4) an integer number. It can be easily proved that among these values there exists a particular value k~ such that k~ ml + 1 < ml m2. Using this value, equation (4) can be written without the modular operator, i.e.
('/Tl'2"l)'m,:l
" k t f T ' t l 
+1
m2
(5)
A similar procedure leads to the expression
=
,.,..~
(6)
Substituting the above values in (3), we obtain
(x (~2  ..~)). = (~2 (ki.. + 1) r~ 
~1
(k~.~2 + 1)r2)u
(7)
The modular operation present in equation (7) can be removed by introducing an additional term a M . Finally we obtain X = m2rl  mlr2 + a M
' (m2'
.',1)
(8)
If the difference ( m 2  ml) is a power of two, namely 2 h, the equation (8) can be easily evaluated. In fact, in this case the division is reduced to a right shift of h positions. In this work we consider a class of moduli defined as ml = n  2k, m2 = n + 2 k being n an odd number. For this choice, the two moduli (m2, ml) are coprime (see Appendix A) and the difference is (m2  ml) = 2k+l. This method can be extended to a large number of moduli. In particular for a RNS representation using four moduli the use of equation (8) requires (ml, m2, m3, m4) = (n  2/r n F 2 k , m  2 j, m + 2 j)
(9)
and m4m3  m2ml 2/t
(10)
where ml, m2, mz and m4 are coprime moduli. Equation (10) allows the recursive application of equation (8) on the three pairs (ml, m2), (m3,rn4) and (ml * m2, m z , m4) as shown in Fig.1. In order to prove the usefulness of the above procedure, we searched for four moduli according to (9) and (10). In particular, we searched for moduli with k = j , for which the equation (10) becomes m4m3  m2ml = m 2  n 2 = 2 R
(11)
112
G.C. Cardarilli et al.
With such a choice, it can be proved that (ml, m2, ms, m4) is a set of coprime moduli, as shown in Appendix B. The solutions of this search is shown in Tab.1. In this table, some solutions corresponding to the wordlength normally used in D S P applications are presented. It is worth nothing that the moduli corresponding to these solutions are very similar in magnitude. This means that the four moduli R N S is implemented by four arithmetic blocks with similar complexity. 3
HARDWARE
IMPLEMENTATION
In order to point out the requirements of a hardware implementation for the proposed method, it is necessary to evaluate the dynamic range of the different terms of (8). In our approach with respect to the classical approach we need a reduced dynamic range. In fact, we have
(m2 I) < m2r 
_
x
"1
IcoNi2M_ I
CONV2M
x2
q'm3 q'm,
Fig. 1: Architecture of a four moduli converter.
Cin aM rl . . . =
m2
ml Fig.2: Two moduli converter architecture
_I ' ~n~ Io,I T ...
I_~ i ~
Cln (to adder)
To Mux
m2r Im Ir2
) R o m address
J f
k+l
....
N(k+l)
Fig. 3: Control logic circuit
j~To adder
Algorithms and Parallel VLSI Architectures Ill M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.
I19
A CASE STUDY IN ALGORITHMARCHITECTURE CODESIGN: HARDWARE ACCELERATOR FOR LONG INTEGER ARITHMETIC
C. RIEM, J. KONIG, L. THIELE
C. Riem, J. KOnig and L. TMele
120
methods will be treated in some detail. Furthermore, the particular application leads to a concentration on the class of massive parallel architectures with some degree of regularity. An overview of the different mathematical approaches to the design of (piecewise) regular architectures is given in e.g. [23, 24]. Despite of many generalizations, the computational model of Karp, Miller and Winograd [9] may still be considered to be the most important contribution. Recently, methods which aim at mechanical, provably correct synthesis are receiving more and more attention, see e.g. [12, 18, 25]. The overall process of mapping consists of a sequence of program transformations that is applied to an initial behavioral specification. The basis of the design process is the methodology adopted in CoMPAR[1]. The main properties with respect to the used models and methods are: 9 Algorithms and architectures are described using LIRAN (Linear Recursive Algorithm Notation), a functional subset of UNITY [2] which has been extended to allow for functional hierarchy and type refinements. 9 The basic model of algorithms can be described by a set of equations of the form V X~Iv: v[f(O ] =
Yv(w[g(O],...)
(1)
The three dots in the list of arguments of the function ~'v represent similar arguments. The index set I is privat to each equation. The index functions f ( l ) and g(I) are af[ine. 9 The index sets of piecewise linear algorithms are linearly bounded lattices ( L B L ) , see e.g. [21, 22]: I={I"
I=A,•+b
^ C , a > _ d ^ a E Z t}
where A E Z "xt, b E Z', C E Z mxt and d E Z m. This particular algorithm model is closed under operations like localization of data dependencies, partitioning, multiprojection, scheduling and control generation. This property is not satisfied by many other models. Even after a first level of partitioning or local broadcasting, the representation can be processed further, e.g. for increasing the levels of hierarchy, to perform scheduling and allocation or to generate control signals. These methods will be explained on the design of an arithmetic processor CoSEMI, a Coprocessor for SEMlnumerical algorithms, which was developped at the University of Saarland to help in computer algebra computations and cryptoanalysis such as RSA cryptography and factorization of large integers [26]. The term 'seminumeric' is borrowed from D.E. Knuth [10] "because they [the algorithms] lie on the borderline between numeric and
symbolic calculation". An overview on computer arithmetic is given in [7], an in depth discussion of digit recurrence algorithms for division can be found in [6]. The algorithms used in the coprocessor CoSEMI are based on [15]. The architecture developed using CoMPAR is characterized by
A Case Study in Algorithmarchitecture Codesign
121
9 computation of multiplication, Zdivision, modular multiplication and GCD on long integers using a piecewise regular architecture, 9 a scalable architecture, 9 different nested levels of partitioning for load balancing and hardware matching, see e.g. [201, 9 local broadcast for fast onchip and local offchip communication, see e.g. [3], and 9 local control flow for operation switching, see e.g. [19]. COSEMI has been fabricated and embedded into a SPARC Workstation. The host interface is described in [11]. 2
OPERATIONS
ON LONG INTEGERS
In order to achieve high performance in digital arithmetic, the design of efficient division algorithms is of central importance. Several systolic approaches have been proposed, combining it with multiplication [26], multiplication and square root computation [5] or multiplication and computation of the greatest common divisor [15]. We are now going to concentrate on division algorithms, as it is an essential part of both modular multiplication and computation of the greatest common divisor. On the other hand, multiplication can be adopted as a simplified case of division so that the major issue of integrating the required operations can be satisfied. Moreover digit recurrence algorithms working in a MSD (Most Significant Digit) first fashion are chosen. 2.1
Division
Division with remainder of N by D is defined as follows: Find integers Q and R so that the following equation holds: N =Q.D+R (2) In order to uniquely determine Q and R, a further condition is necessary. This paper concentrates on division by smallest remainder, called Zdivision (Ndivison with positive remainder is quite similar)' D D  0
(10)
This suggests the following implementation: Initialize N 1, then To.,.. = max(SQ.,..,TB.,~ Tn < 2 • Tn < 2 x (C~ + 2) x (IogP + Iog N + 2).
Therefore,
El
It is important to notice that for a capacity K = P • (~  C, • (log P + log N + 2)), the response time of the machine is Tn < Ca x (log P + log N + 2), yielding that, for large enough values of P and N, Tn corresponds to log K. Another point to note is that all the proofs above are based on the hypothesis that ni >_ 2 + log N, for all i. This allows us to ensure that the balancing phase can be executed.
T. Duboux, A. Ferreira and M. Gastaldo
154
This restriction is not severe as it just means that the data structure is not completely empty. However, even if this condition is not verified, the balancing strategy can be applied. It has been shown that the resulting structure is not balanced immediately, but becomes more and more balanced in time. A very detailed study of the balancing algorithm when the structure is empty, in the case of SIMD implementations, with indepth analysis, can be found ill [5]. 6
CONCLUSION
The main features of the dictionary machine proposed in this paper are its scalability and online characteristics. Indeed, such an architecture can be used as an embedded system coupled to a high speed I/O device for real time information processing. It has the same performance characteristics as the ones of [8], providing now a feasible and scalable archia response time corresponding tecture. In other words, we have a capacity close to T N•, to the logarithm of the capacity, and a constant pipeline interval. A further question concerns the interval between two balancing phases. We proposed to perform a balancing every 2" steps, what is not necessarily optimal. To improve this, one could study the impact of an adaptive frequency for balancing, that would depend on the amount of data exchanged during the previous balancing phase. References [1] S. G. Akl. Design and Analysis of Parallel Algorithms. PrenticeHall International, Inc., 1989. [2] F. Dehne and M. Gastaldo. A note on the load balancing problem for coarse grained hypercube dictionary machines. Parallel Computing, 16:7579, 1990. [3] F. Dehne and N. Santoro. An optimal VLSI dictionary machine for hypercube architectures. In M. Cosnard, editor, Parallel and Distributed Algorithms, pages 137144. North Holland, 1989. [4] T. Duboux, A. Ferreira, and M. Gastaldo. MIMD dictionary machines : from theory to practice. In Bouge et al., editor, Parallel Processing: CONPAR 92  VAPP V, number 634 in LNCS, pages 545550. SpringerVerlag, 1992. [5] M. Gastaldo. Dictionary Machine on SIMD Architectures. Technical Report RR 9319, LIP ENSLyon, July 1993. Submitted to Publication. [6] H.F. Li and D.K. Probst. Optimal VLSI dictionary machines without compress instructions. IEEE Trans. on Computers, 39:332340, 1990. [7] A.R. Omondi and J. D. Brock. Implementing a dictionary on hypercube machines. In Int. Conf. on Parallel Processing, pages 707709, 1987. [8] T. Ottmann, A. Rosenberg, and L. Stockmeyer. A dictionary machine (for VLSI). IEEE Trans. on Computers, c31(9):892897, sep 1982.
Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.
155
S Y S T O L I C I M P L E M E N T A T I O N OF S M I T H A N D W A T E R M A N A L G O R I T H M ON A S I M D C O P R O C E S S O R
D. ARCHAMBAUD, I. SARAIVA SILVA, J. PENNI~
MASI Laboratory, University Paris 6 4 place Jussieu, 75252 Paris Cedex 05, France eraail : { archambaud, dasilva} @masi. ibp.fr ABSTRACT. We present a parallel algorithm that performs genetic sequence comparison in a systolic architecture. The Smith and Waterman algorithm provides an alignment between two sequences, and calculates accurate similarities between all possible subsequences  however it requires huge amounts of calculation. Software tools used so far to compare genetic sequences are approximations of the Smith and Waterman algorithm, running reasonably fast but with reduced accuracy. We aim at implementing the exact algorithm in our architecture, to provide a fast and accurate sequence comparison tool. We describe the implementation of the main dynamic programming formula (easily paraUelizable since the data dependence is topologically limited) and also the way to manage the gap penalties and the substitution matrix weights. The coprocessor board we are developing is a paginated setassociative memory with onedimensional systolic capabilities. It is intended to be connected to a host machine via a standard PC bus. The board is made of multiple VLSI circuits, each one containing some processing elements. The I/O throughput between the host and the systolic row does not limit the board speed with regards to this application. KEYWORDS. Similarity, Smith and Waterman, Proteins, DNA sequences, VLSI implementation, SIMD coprocessor, Systolic architecture.
1
INTRODUCTION
Genome alignment is an important task in molecular biology. It consists in comparing every substring of two strings associated to proteins or DNA sequences. A basic algorithm has been described by Smith and Waterman [18] and went under several variations [1],[8]. The genome database is already huge and is still increasing. A prohibitive computing time has been reached on standard workstations, up to 400 days of CPU time is necessary for the
156
D. Archambaud, L Saraiva Silva and J. Pennd
exhaustive matching of an entire protein sequence database [7]. The time complexity of such an application is O(M~N), M and N being the sequence lengths. Gotoh reduced this complexity to O(MN) under some limitations [S]. A systolic implementation runs in time O(M+N). There are basically two hardware approaches to increase the processing power for a given problem. One can either design/use a dedicated computer with its own operating system and application softwares [5],[4],[10], or add an accelerator to an already existing machine [14]. The second solution has many advantages: it is a fast solution since the device to design is reduced to a specialized coprocessor with an interface to the host computer, it is a cheap solution since only one board is necessary, and it is a standard solution as the accelerator can adapt to different architectures. The SIMD organization (single instruction stream, multiple data stream) is a lowcost way to implement a massively parallel architecture. It only requires a unique sequencer for all the execution units. Our accelerator board is composed of a control circuit (containing the sequencer and the interface with the host computer) and several cascadable execution circuits containing the processing elements [2]. We first present the Smith and Waterman algorithm and how it can be adapted to a linear systolic architecture. We then briefly describe our own system   the Rapid2 coprocessor   and show the way we implement the algorithm. As a conclusion, some performance evaluations are given. We wrote a simulation program which exactly simulates what is executed in the coprocessor at each clock cycle, and measures the time elapsed in each data transfer between the host and the board. Such a program helped us to make architecture choices and to write optimized and valid implementations of the algorithm. 2
SMITH AND WATERMAN ALGORITHM
The use of similarity measurements to compare two proteins or DNA sequences was first proposed by Needleman and Wunsch [15]. Mathematically the problem was presented as a twodimensional array, where the value of all elements Hi,j is a function of the comparison between the ith and jth amino acids in both sequences. Pathways of nonzero values through the array correspond to the possible alignments of the two proteins. The basic problem of the comparison of biological sequences is searching the alignment with the highest similarity. An alignment is a subset of the cartesian product of two sequences. Similarity measurements are based on two weight functions. The first, s(ai, bj), measures the weight of a substitution between ai and bj. The other function, wk _ 0 is the weight of a gap (insertions or deletions) of length k. Gaps take place when one or more amino acids of one of the sequences are not present on the subset of the cartesian product A (a, b). The similarity of an alignment between two sequences "a" and "b" is the sum of the weights of all the substitutions minus the sum of the weights of all the gaps. Smith and Waterman [18] presented a new algorithm to compare biological sequences. Their approach gives a way to find the pair of segments of two sequences with the maximum similarity. A segment is defined as being a subsequence of a biological sequence; for instance, the subsequence aiai+l...aj is one segment of length (j  i + 1) of a = ala2...an, where
Systolic Implementation of Smith and Waterman Algorithm
157
1 _< i _<j _< n . Let Hi,j be the maximum similarity of two segments that end in ai and bj; or zero when the calculated value for the similarity is negative. In its most general form, the similarity measurement in the Smith and Waterman algorithm is defined as: Hij = max {Hil,j1 + s(ai,bj), maxl> 1 A N D mask[j+l]. An example is given in table 1, where the text is aabaacaabac and the pattern is aabac. 4.2
Approximate Matching
Wu and Manber show that the previous algorithm supports various kind of errors (insertions, deletions or substitutions) when a fixed set of registers is maintained per error level. Actually, d additional arrays R 1, R 2, ... , R a must be maintained to store all possible matches with up to d errors. To determine the transition from R i~ to R~.+I d , there are four possibilities for obtaining a match of the first i characters with _< d errors up to ti+1: 9 There is a match of the first i  1 characters with _< d errors up to tj and tj+l = pi. This case corresponds to matching t/+y. 9 There is a match of the first i  1 characters with _ d  1 errors up to tj. This case corresponds to substituting tj+l. 9 There is a match of the first i  1 characters with _< d  1 errors up to tj+l. This case corresponds to deleting Pi.
J. Champeau, L. Le Pape and B. Pottier
252
9 T h e r e is a match of the first i characters with _0 q 2 = d i v ( / + 1,2) i f  j + 2 , q 2  1 >_ 0 q3=div(2, ql + 2, 3) if  i + 3 , q3  3 >_ O q4= div(q3, 2) a 1 ( 3 , q 3  3 , 2 , q2  1) E! The example shows that div functions may be nested. The div functions and inequalities of the example form a part of specification of the iteration domain of the dependency for variable al. Our goal is to describe iteration domains of variables as linearly bounded lattices. The method is based on the socalled hermite normal decomposition [10]. Other approaches can be found in [9] [11]. To find the lattice defined by the integer divisions and inequalities involving the q's, we start by writing the div's as equations by setting the remainders r to zero.
P. C. Held and A. C.J. Kienhuis
278
Let N be a matrix of which the rows are the normals of these equations. Let Q = (ql,..,qm) be the vector of variables of the m divisions and let I be the vector of the iterators. We write the system of equations defined by the div's as: N
(i) Q
=0
E x a m p l e 6.2 With I = (i,j) t and Q = (ql,q2,qa, q4)t, the system of equations of example 6.1 is: i  2ql = 0 j
2"q2 = 0
2q2  3q3 = 0 q32q4 = 0 Thus matrix N is
i
._
1 0 0 0
0 2 1 0 0 2 0 0
0 2 0 0
0 0 3 1
0 0 0 2
We assume that the system has a solution. Otherwise, we would have removed this piece of code from the program by dead code elimination procedures. The system has m equations in n + m variables. Because each row k introduces variable qk it follows that the rows of N are independent. The nullspace of the system is thus ndimensional, equal to the dimension of the iterationspace. We will call the variables corresponding to the nuUspace the free variables of the system. To find the solution, we use the hermite normal decomposition [10]. This procedure gives us two matrices C1 and C2 such that:
N[C, C2I = [H0] in which matrix H is called the hermite normal form of N. Matrix H has an inverse because the rows of N are linearly independent. Observe that matrix C2 consists of the vectors of the n nullspace vectors of N as NC2 = O. So any linear combination of the vectors of (72 added to a given solution s will also be a solution of the system. Because we are only interested in the values of I, we decompose matrix C1 into Cfl, size n by n, and C12 and decompose matrix C2 into matrices C21 and C2~ as follows:
Now, the columns of matrix C~1 are the lattice vectors. So the hermite normal form gives us directly lattice matrix L defined by the divs.
DIV, FLOOR, CEIL, MOD and STEP Functions
279
E x a m p l e {}.3 Hermite normal decomposition of matrix N gives: 1 0 0 0 0 0
C1 =
0 2 1 0 0 1 0 0 0 1 0 0
0 0 0 0 0 1
and matrix C2" 6 0 0 2 3 0 C2 :
0
1
2 0 1 0 The iterators i and j are defined by matrix C21. Let K I be the vector of free variables. We write I = (i,j) t, with offset O still to be determined, as:
I=
7
(60)Kf+O 02
LATTICE OFFSET
Next we have to find the lattice offsets. Let B = (bl, .., bin)t be the vector of the divisors of the integer divisions, with remainder rk between 0 < rk < bk. An offset 0 must first of all be an integral solution of the system:
0 _< N
(o) Q
_< B
(9)
Apart from these inequalities there may be others in the program that restrict the value of the variables standing for the integer divisions. We disregard inequalities not involving Q as they do not affect the lattice offset. Let < Nq, Bq > be the system of all inequalities involving Q. We assume that NqC2 = O. When this assumption is satisfied, we may use the vectors of C21 as lattice vectors because the variables corresponding to C21 are free. Let Kb be the vector of variables corresponding to matrix C1 and let Kf be the vector of variables corresponding to matrix C2. We define (O, Q)t as 0
(10)
P. C. Held and A. C.J. Kienhuis
280
and substitute it in the polytope:
Q) >_Bq
Nq( 0
(11)
after which we obtain the polytope:
NqC1Kb
>_ Bq
(12)
This polytope defines all offsets 0 = CllKb of the lattice and we call it the lattice offset domain. The number of offsets depend on the value of the divisors b. The lattices corresponding to the polytope in q are defined by:
I = C21K1 + 0 0 = Ca,Kb NqC1Kb >_ Bq
(13)
(14) (15)
The lattices are bounded by remaining inequalities of the nested loop program. These inequalities together with a lattice define an iterationdomain. A special case is when the offset domain contains a single point. Then the lattice descriptions reduces to I = C21K f + O, and we do not have to enumerate the lattice offset domain. E x a m p l e 7.1 In example 6.1 there are three ifstatements defining inequalities in q:  i T 2 , q l  1 >_ 0
(16)
j+2,q2
(17)
1>_0
 i T 3 , q 3  3 >_ 0
(18) (19) After the substitution I = C11Kb and Q = C12Kb, we get inequalities in variables of I(b:  k 1 _> 1
k2 > 1  k l  k3 _~ 3 By the same substitution we get for the inequalities of the remainders: l_ 0, consider: r . z = 0 a hyperplane containing the domain D such that 7r is a vector in tile space ( l i n ( n ) + ( d ) ) f q l i n ( D ) • (where (d) denotes the space spanned by d and l i n ( D ) • the space orthogonal to i i n ( D ) ) ; 71 = 7r .el and D1 = {z + hl [ z E D,O O) in D, the length of the routing path corresponding to z is equal to g(z). 4.2
S e c o n d Case: d E lin(D) and dim(D) < n
The second uniformisation technique is given in the following theorem: T h e o r e m 4.2 Let us consider an atomic integral recurrence equation z E D 9 U(z) = f ( Y ( I ( z ) ) ) , where D C._ Z n and I(z) = z + g(z)d. If d e lin(D) and dim(D) < n, there exists a system of conditional uniform recurrence equations in Z n which is equivalent, over D, to the equation. [] 4.2 Note that when the condition dim(D) < n of Theorem 4.2 is not satisfied, the equation needs to be reindexed in Z n+l before the theorem may be applied. The resulting system of equations is then (n + 1)dimensional. In this case, the system of equations is defined as follows. Let gma= = maxzeDg(z) and ~ = [gma~,/2Js. If ~ > 0, consider: ~r. z = 0 a hyperplane containing the domain D, such that r in lin(D)L; the routing directions = d + r and ~ = d  r ; r/= ~r.d; D1 = { z + l ~ l z e D,0 < l < ~} and D2 = {Z+lld+12~l z E O1,0 < 11 _< 1, 0 < 12 _< ~} N {z I 7r. z >_ 0}. The system of equations is: z eD.
z e D l , a ( z ) > 7(z)"
Rl(z)=
z e D l , a ( z ) < 7(z)" Rl(z) = z e D l , a ( z ) = 7(z),/~(z)= 1," R ' ( z ) = z e D l , a ( z ) = 7(z),/~(z) = 0"
Rl(z) =
z e D 2 , r . z > O " R2(z) = z E D2,~r.z = 0" z E D1,Tr.z < r/~+ 0 :
R2(z)
oo R l ( z + ~) R2(z + d) R2(z)
R2(z + ~) V(z)
=
o(z +
= 7/~ + 0 :
~(z) =
z E D l , r . z < r/~+ 0 :
=
z E D1,Tr.z = r/~+ O :
/~(z) =
g(z .r
z E D1,Tr'z < rLq + 0 :
7(z)=
7(z+~)I
z E Dl,r.z
[g(z .r +
SWe denote Ix] the floor function returning the greatest integer less or equal than x, and x mode the modulo function returning the remainder of the integral division of z by c.
Reducible Integral Recurrence Equations
291
Fig. 5. Uniformisation (d E lin(D),9 > O)
zEDl,~r.z=y~+0" 7(z)= ~. In this case, the routing paths have the shape of a "rooftop" on the routing domains D1 and D2, as illustrated in Fig. 5 (in a 2dimensional space). Two routing variables, R 2 and R 1, are needed to pipeline the values of V according to the two directions ~ (the "ascending" part of the paths) and ~ (the "descending" part of the paths), respectively. A single displacement at the "top" of the path is required when the length of the path is odd, and is flagged by the control variable/~. Variables (~ and 7 have a similar function as in the previous case. When ~(z) = 7(z), then a change of direction for the routing path is required and the value carried by R 2 is transferred to R 1. The latter is then pipelined to the original domain D (subdomain of D1), where it is used for the computation of U.
5
THE KNAPSACK
PROBLEM
As an example of our techniques, we consider the knapsack problem ([4, 3]), a classic combinatorial optimisation problem, which consists 9 of determining the optimal (i.e., the most valuable) selection of objects of given weight and value to carry in a knapsack of finite weight capacity. If c is a nonnegative integer denoting the capacity of the knapsack, n the number of object types available, wk and vk, respectively, the weight and value of an object of type k, for 1 < k < n and wk > 0 and integral, a dynamic programming formulation of the problem is represented by the following system of recurrences 1~ (k, y) E Dl " F(k, y) = 0
(k,y)~D2" (k, y) E n3 " (k,y) ~ 1)4" (k, y) ~ D2 " (k,y) ~. D4"
F(k,y)= F(k, y) = F(k,y) = V(k, y) = V(k,y)=
0 ~ f ( F ( k  1 , ~ ) , F ( k , y  ,,,k),V(k,y)) vk V ( k , y  1),
9This is one of the several variants of the knapsack problem. A complete presentation together with a number of applications is given in [4]. 1~ system of recurrences corresponds to the socalled forward phase of the knapsack problem in which the optimal carried value is computed. The corresponding combination of objects can be determined from the optimal solution in a second phase, known as the backward phase of the algorithm, which is essentially sequential and mainly consists of a backward substitution process on suboptimal values of F. An efficient algorithm for the backward phase is given in [3].
L. Rapanotti and G.M. Megson
292
Fig. 6. The Knapsack Problem with f defined as f(a, b, c) = max(a, b + c), for all a, b, c, and equation domains" D2 = {(k,y) l 1 _< k < n,y = 0} D1 = {(k,y) lk = 0,1 O" R 2 ( k , y , z ) = R2(k,y  1, z  1) !
( k , y , z ) E D4,2, z = O" R2(k,y,z) = F ( k , y , z ) ( k , y , z ) E D~,l,z < ~" ~ ( k , y , z ) a ( k , y  l , z + l) ( k , y , z ) E D~,~,z'O" a ( k , y , z ) [g(k,y+~,z~)/2]
R2(k
 l z),
Reducible Integral Recurrence Equations
293
Fig. 7. Uniformisation of the Atomic Integral Data Dependence ( k , y , z ) ~. D~,l,z < g " ( k , y , z ) E D~, 1 , z = g "
fl(k,y,z) = fl(k,y,z)=
f l ( k , y  l , z + l) g(k,y+g,zO)mod2
( k , y , z ) ~. D~4,1,z < O "
7(k,y,z) =
7(k,y
(k,y,z)
,y(k,y,z)
O,
e O'4,~,z = O "
=
l , z + l)  i
where 9 = [Wmax/2] and Wmax is the maximum weight of the objects. The new domains are defined as: D~
=
{(k, y, z) ] k = 0,1_< y _< c, z = 0}
D~
=
{ ( k , y , z ) J 1 p r e
(if F l a g
then
if N < P * P
else
Undef_2
then Undef_l else 2*P ) ;
Transformation rules associated to undefined signals Undef_x allow, giving them a value, to simplify the above equation: P = 1 > p r e
( 2*P
) ;
Transformation of SynchronousDescriptions
starting
state
transitions the
conditions
: P
B
H
D
Flag
1



true
(next v a l u e s )
transitions
are valid
on current
: with
D>I,
E = D div
state
next B
P Flag Flag not not
and N p r e
{Flag a n d n o t N < P 2
R e a d y = {not Flag) a n d Result = BD ; NItM2 = N<M2 ; P2 = 1 > p r e (4 * P2) M2 = B2 + B D + E2 ; E2 = D2 d i v 4 ; B2 = 0 > p r e ( s Flag elge if e l s e M2 BE = BD div 2 ; B D = 0 > p r e ( Is F l a g e l s e Is e l s e BE D2 = 0 > p r e ( Is F l a g e l s e E2
(D2=1)
} ;
;
;
then 0 N I t M 2 t h e n B2 ) ; then 0 N I t M 2 t h e n BE + E2 ) ; t h e n P2 ) ;
tel
Figure !0 A LUSTRE program extracting integer square root equivalent to the program of figure 9 after transformations.
9 to hctorize as much as possible common expressions. Finally the program of figure 10 is obtained. There are no longer multiplications, except by 4, but that could be done using shifts. This final implementation keeps the sizeable parallelism already existing in the specification. It could be shown that the computation time (identical to that of the specification) is proportional to O(log 2 N). 3
CONCLUSION
The use of program transformations in hardware design is wellknown ([12, 13, 15, 19, 20] for example). The contribution of this paper to the domain is twofold. It shows that: 9 the LUSTRE language is wellsuited to support this approach. 9 as suggested by our three examples, the transformational approach applies to quite different domains of hardware synthesis; the last example (an arithmetic architecture computing integer square roots) shows that we can reuse and formalize results from the software transformation field. Opposite opinions are often encountered concerning the proposed approach. Some people like the interactive, exploratory feature, and the interactive capability left to the designer in front of the tool; others accept only fully automated tools. Some particular problems of retiming are in hct automatable [14]. The example of the filter probably could be automated (section 2.1), since the strategy is a quite simple and guiding one. The example of miniCPU pipelining probably could not, since it is necessary to know the characteristic equation of the memory in order to carry out the work. The example of square root certainly could not since it exploits the creativity of the designer.
Transformation of Synchronous Descriptions
317
In the example of the miniCPU, the synthesis emerges into a solution not known in advance. The designer tries to introduce the idea of buffering the lastest written elements, and corrective parts are brought to him without additional cost. On the contrary, in the example of square root, the general idea is rather the verification of a sequence of preexisting transformations, found in literature. The rewriting system implemented by TRANSE can be seen as forward chaining deduction, which was found, considering this last example, to be sometimes rudimentary. This example in fact underlines the limits of the system at present time. There is thus an oscillation between verification and genuine synthesis (invention). It would be interesting to incorporate to TRANSE the possibilities of deductive systems proceeding with demonstration of subgoals, LARCH [7] for example, already used for architecture verification [20]. This study incidentally showed the great quality of LUSTRE, a purely functional, synchronous, data flow, equational language, for specifying VLSI architectures and formally working on these specifications. One could criticize some misses at the expression of generic components level. On the other hand some possibilities of the language were not exploited here, in particular all the features concerning clocks, which allow the description of subsets working with separate clocks. This project is now joining a larger research project, named ASAR, grouping together six French teams, and oriented towards architectural and system synthesis, the goal of which is to build a generic multiformalism framework for architectural synthesis [2]. Acknowledgments This work has been supported by Etude D R E T 346800 and French C N R S [ P R C A N M . The exposition of this paper has greatly benefittedfrom the comments of referees and members of the conference committee. References
[1] [2] [3] [4]
[5] [6] [r]
J. Arsac, Prdceptes pour programmer, DUNOD, 1991. P. Asar, Towards a multiformalism framework for architecural synthesis : the ASAR project, in Proc. of the Codes/CASH'94 Conf., Grenoble, France, Sept. 1994. G. Berry, Esterel on Hardware, in Mechanised Reasoning and Hardware Design, Prentice Hall, 1992. R.T. Boute, Systems semantics 9 Principles, applications, and implementation, ACM Trans. Prog. Lang. Syst., 10 (1988), pp. 118155. Bull Systems Products, Proving "Zero Defect" with VFormal, 1993. D. Cldment, Gipe : Generation of interactive programming environments, TSI, 9 (1990). S. J. Garland, J. V. Guttag, A Guide to LP, The Larch Prover, tech. rep., MIT Laboratory for Computer Science, Dec. 1991.
318
[81 [9] [io] [11] [12] [13]
[14] [15l [16]
[17] [18]
[191
[20] [21]
[22] [23]
I24]
(7. Durrieu and M. Lemaitre A. Gupta, Formal hardware verification methods 9 A survey, Tech. Rep. CMUCS91193, Carnegie Mellon University, Pittsburg, PA 15213, Oct. 1991. N. Halbwachs, P. Caspi, P. Raymond, D. Pilaud, The synchronous dataflow programming language LUSTRE, Proceedings of the IEEE, 79 (1991), pp. !3051320. P. N. Hilfinger, Silage, a highlevel language and silicon compiler for digital signal processing, in IEEE Custom Integrated Circuits Conference, Portland, Oregon, May 1985, pp. 213216. W. A. Hunt, B. C. Brock, A Formal HDL and its use in the FM9001 verification, in Mechanised Reasoning and Hardware Design, Prentice Hall, 1992. S. D. Johnson, Synthesis of digital designs from recursion equations, The MIT Press, Cambridge, Massachusetts, 1983. G. Jones, M. Sheeran, Designing arithmetic circuits by refinement in Ruby, Science of Computer Programming, 22 (1994), pp. 107135. C. E. Leiserson, J. B. Saxe, Retiming synchronous circuitry, Algorithmica, 6 (1991), pp. 535. J. D. Man, J. Vanslembrouck, Transformational design of digital circuits, in EUROMICRO Cologne, 1989. C. Mauras, Aplha 9 un langage dquationnel pour la conception et la programmation d'architeetures parall~les synchrones, th~se de doctorat, Universit~ de Rennes I, d~cembre 1989. O. Coudert, J. C. Madre, A Unified Framework for the Formal Verification of Sequential Circuits, in Proc. of IEEE International Conference on Computer Aided Design'90, Santa Clara., California, Jan. 1990. / K. K. Parhi, D. G. Messerschmitt, Pipeline Interleaving and Parallelism in Recursive Digital Filters  Part I, IEEE Trans. on Acoustics, Speech and Signal Processing, 37 (1989), pp. 10991117. J. G. Samson, L. J. M. Claesen, H. J. deMan, Correctness Transformations on the Hough Algorithm, in CompEuro 92, The Hague, May 1992. J. B. Saxe, J. J. Homing, J. V. Guttag, S. Garland, Using transformations and verification in circuit design, Formal Methods in System Design, 3 (1993), pp. 181210. V. Stavridou, Formal Methods and VLSI Engineering Practice, The Computer Journal, 37 (1994), pp. 96113. V. Stavfidou, T. F. Melham, R. T. Boute, eds., Theorem Provers in Circuit Design, IFIP Transactions A: Computer Science and Technology, NorthHolland, 1992. G. Thuau, B. Berkane, A unified framework for describing and verifying hardware synchronous sequential systems, Formal Methods in System Design, 2 (1993), pp. 259276. Verilog, AGE/SAGA Langage LUSTRE, manuel de r~f6rence, tech. rep., VERILOG, 38330 Montbonnot, France, Janvier 1994.
Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.
319
HEURISTICS FOR EVALUATION OF ARRAY EXPRESSIONS ON STATE OF THE ART MASSIVELY PARALLEL MACHINES
V. BOUCHITTI~, P. BOULET, A. DARTE, Y. ROBERT 1 Laboratoire LIP, CNRS (U.R.A. no 1398)
Ecole Normale Supdrieure de Lyon 69364 LYON Cedex 07 vbouchit,pboulet, darte,yrobert@lip.enslyon,fr
ABSTRACT. This paper deals with the problem of evaluating High Performance Fortran style array expressions on massively paraUel distributedmemory computers (DMPCs). This problem has been addressed by Chatterjee et al. under the strict hypothesis that computations and communications cannot overlap. As such a model appears to be unnecessarily restrictive for modeling stateoftheart DMPCs, we relax the restriction and allow for si. multaneous computations and communications. This simple modification has a tremendous effect on the complexity of the optimal evaluation of array expressions. We present here some heuristics, which we are able to guarantee in some very important cases in practice, namely for coarsegrain, or finegrain computations. KEYWORDS. Parallelism, array expressions, distributed memory, communications overlap.
1
INTRODUCTION
We focus in this paper on the evaluation of HPF (High Performance Fortran) style array expressions on massively parallel distributedmemory computers (DMPCs). The difficulty of such an evaluation is to choose the placement of the evaluation order and location of the intermediate results. Chatterjee et a1.[9,3,4] have adressed this problem under the strict hypothesis that computations and communications cannot overlap. However stateoftheart DMPCs can overlap computations and communications, so we relax this restriction. This 1Supported by the Project C3 of the French Council for Research CNRS, and by the ESPRIT Basic Research Action 6632 ~NANA2~ of the European Economic Community.
320
I/'. Bouchittd et al.
If'\
/: /:'\ A
B
C
E
F
D
Figure 1" a simple expression tree simple modification has a tremendous effect on the complexity of the optimal evaluation of array expressions. 1.1
T h e Rules of the G a m e
In this section we set the basic definitions and ground rules for our work, and illustrate them on a simple example. Our problem is to evaluate a binary expression T. We assume that T is given as a binary tree: commutative or associative rearrangement is not allowed, and there are no common subexpressions. Also, without loss of generality, we assume that there are no unary operators. Hence T can be captured as a locally complete binary tree: all internal nodes have indegree 2. All nodes but the root have outdegree 1. See in figure 1 the tree corresponding to the expression T = f o ( f l ( f 2 ( A , B ) , f a ( C , D ) ) , f 4 ( E , F ) ) . For a moment we forget about distributed arrays and HPF expressions, i.e. we do not relate the nodes of the tree with data items. Rather, we give an abstract interpretation of our problem as follows: leaves represent locations, while internal nodes represent computations (we also say that such nodes are evaluated). An internal node can be computed if and only if both sons have been evaluated (leaves require no evaluation) and both sons share the same location. In the previous example, we have six leaves   nodes A, B, C, D, E and F   and five internal nodes fi, 0 _< i _ 4. If both sons of an internal node are not located at the same place, at least one communication is needed. For instance to evaluate node f2, we can decide to transfer data stored at location A to location B and evaluate node f2 at location B, or viceversa. But we can also decide to have two communications, for instance from location A to location C, and from location B to location C, then node f2 will be evaluated at location C: this could enable the computation of node fl without communication at location C, provided that node f3 has also been evaluated at location C owing to a communication from leaf D to leaf C. W h a t are the rules of the game for the abstract problem? We have to evaluate the expression T as fast as possible, while respecting the partial order induced by the tree. Communications and computations can occur in parallel. More precisely, we assume that
Heuristics for Evaluation of Array Expressions
321
Table l ' a simple execution one communication and one independent computation can take place simultaneously. We suppose that communications are sequential among themselves and that computations are also sequential among themselves. In some sense, it is like a twoprocessor scheduling problem, but with one machine devoted to computations and the other to communications. 9 See Table 1 for an example. Here we assume that all communications cost 3 units of time while computation costs are listed in the table. Of course there is a priori no reason for the communication costs to be the same, but this turns out to be an important case in practice, (see section 1.2). We stress some characteristics of the problem" 9 The location where the final result should be available is not specified in the problem. If T is an assignment involving a binary expression, we may be imposed the location of the result. This does not change the problem much, as one can always add one communication at the end. 9 As for the model of evaluation we assume two things: 
At most one computation can occur at a given timestep. This is a natural consequence of the original formulation of the problem in terms of Fortran 90 array expressions (see section 1.2).

At most one communication can occur at a given timestep. This hypothesis is more questionable, since it comes from modeling actual DMPCs.
9 The same location can appear several times in the expression. This is not the case in our example, but we could replace leave E by a second occurrence of, say, A. Then the expression can no longer be captured with a tree: instead we use a DAG (Directed Acyclic Graph). 1.2
HPF
Array
Expressions
The original motivation for tile above problem comes from the evaluation of HPF array expressions. Consider again the expression
T = fo(fl(f2(A,B),f3(C,D)),f4(E,F)).
V. Bouchitt# et al.
322
Assume that we have a 2Dtorus of processors of size P x P (such as the Intel Paragon), and arrays a, b, c, d, e, f and res of size N by N. Consider the following loop nest: for i = 3 to N  2 do for j = 7 to N  5 do res(i,j) = ( ( a ( i  2 , j + 3) + b ( i , j  6)) • ( c ( i  1,j + 3) + d(i + 2,j + 5))) + ( e ( i  1,j + 2)/I(i,j)) Here we have an assignment, we could easily modify the expression tree to handle it. Suppose we have the following HPF static distribution scheme: CONST N = 100, P = 10 CHDIR$ PROCESSORS PROC(P,P) CHDIR$ TEMPLATE T(N,N) CttDIR$ ALIGN a, b, c, d, e, f, res WITH T CttDIR$ DISTRIBUTE T(CYCLIC,CYCLIC) ONTO PROCESSORS for i = 3 to N  2 do for j = 7 to N  5 do res(i,j) = ( ( a ( i  2,j + 3) + b(i,j  6)) x ( c ( i  1,j + 3) + d(i + 2,j + 5))) + ( e ( i  1,j + 2)/f(i,j)) The distribution says that all arrays are aligned onto the same template, which is distributed along both dimensions in a cyclic fashion. Each processor of the 2Dtorus thus holds a section of each array. We can now make the link with the locations A to F which occur in the expression T. Consider for instance node f~:
f~" temp(i,j) = a ( i  2,j + 3) + b ( i , j  6),3 < i < N  2,7 < j < N  5. The array element a ( i  2 , j + 3) is stored in processor proca(i,j) = ( i  2 rood P,j + 3 mod P) while array element b ( i , j  6 ) i s stored in processor proc~(i mod P , j  6 mod P). Therefore each element b ( i , j  6) must be sent according to a distance vector (  2 , 9 ) t if we decide to compute the temporary result ternp(i,j) of node f2 in proca(i,j). Let r(u, v) denote the translation of vector (u, v) t. This communication amounts to a global shift along the grid of array b to "align" the origin of array a translated by r (  2 , 3) with the origin of array b translated by r ( 0 ,  6 ) (because a and b are aligned onto the same template). As already said, nothing prevents us from evaluating node ]'2 In a different location. We can choose another "origin" and communicate both arrays a and b accordingly. We understand now why we have assumed that there is a single computation that can be done at each time step: this is because all processors operate in parallel on different sections of the distributed arrays involved in the expression. However, we may have several global shifts in parallel, depending upon the communication capabilities of the target machine.
Most current DMPCs are capable of performing at least one communication in parallel with computations, hence our key assumption that communications and computations do overlap.
Heuristicsfor Evaluation of Array Expressions
323
Evaluating communication costs is much harder than for computation costs, and this for many reasons: 9 Stateoftheart DMPCs have dedicated routing facilities, so that the distance of a communication may be less important than the number of conflicts on communication links [5, 7, 15]. 9 Even in a simple case like the tree example, the size of the communications depends upon the distribution of the array and the alignment. Therefore we assume that communication costs are fixed parameters that may be calculated according to some intricate formula, where important parameters include the size of the message, the startup cost, the distance cost and principally the contention cost s. However an important practical case is to assume that these communication times are constant because it would be very complicated to compute an approximation of these times. Indeed, we do not know what algorithms are used for the routing of the messages and therefore we can not guess when there would be contentions or conflicts. So we consider here the case when the communication times are constants (the maximum or an average of real costs). This models particularly well the communication patterns that come from uniform data dependences. Another important practical case is when aU communication costs are smaller than any computation cost (coarsegrain formula). We deal with this case in section 3. 1.3
Paper Organization
The paper is organized as follows: first in Section 2 we survey previous work related to this paper. We propose then in Section 3 a heuristic that is optimal in a restricted but useful case, that of coarsegrain computations and also in a subcase of fine grain computations. Finally, we give some conclusions in Section 4. 2
S U R V E Y OF P R E V I O U S W O R K
This work fits into some Many people have dealt Unobe [12], Li and Chen Sadayappan [16], Huang
work done about parallel compilers dealing with data placement. with the alignment problem: Anderson and Lain [1], Lukas and [11], Feautrier [8], O'Boyle and Hedayat [13, 14], Ramanujam and and Sadayappan [10] and Darte and Robert [6].
In the field of parallel evaluation of array expressions, Gilbert and Schreiber [9] proposed an optimal algorithm for aligning temporaries in expression trees by characterizing interconnection networks as metric spaces. Their algorithm applies to a class of socalled robust metrics. :More machine dependent experimentations would be necessary to determine an approximation formula that would give the communication time in function of the size of the data and of the communication pattern.
324
V. Bouchittd et al.
Chatterjee et al. [3] extended Gilbert and Schreiber's work in two ways. They reformulated the problem in terms of dynamic programming and proposed a family of algorithms to handle a collection of robust and nonrobust metrics. They also allowed non conformable arrays by studying the case of edgeweighted expression trees. Chatterjee et al. [4] extended their previous work in dealing with a complete program and not only an array expression. They propose alignment for all objects in the code, both named variables and compiler generated temporaries. They do not restrict themselves to the "ownercomputes" rule. We concentrate our work on array expressions but unlike Chatterjee et al., we do not focus on the shape of the interconnection network and the minimum communication time, but rather on the largest overlapping of the communications by computations. Indeed, we consider another model of modern DMPCs. Actually, most machines are able to overlap computations and communications, and moreover, their interconnection network is based on routers. So, the communication time cannot be derived easily from the layout of the processors but mainly depends upon contentions. 3
HEURISTICS
3.1
Introduction
We propose in this section heuristics to solve the general problem. Fortunately these heuristics give an optimal time in useful practical cases. Here is the notation used in this section:
expresslontree 9 is an expressionDAG whose internal nodes form a tree. n represents the number of internal nodes of the expressiontree. ~cale(i) is the time needed to compute the operation of node i. ~com is the time needed to move the data from their position on one node to their position on an other node.
r o o t represents the root node of the expressiontree, i.e. the only node with outdegree 0. Af is the set of the nodes of the expressiontree. s is the set of the leaves of the expressiontree. I is the set of the internal nodes of the expressiontree.
3.2
W h e n All Leaves Have Different Locations
3.~.I Hypotheses We assume here that the leaves are all distinct. We are then in the c ~ e where the expressiontree is in fact a locally complete binary tree.
P r o p e r t y 1 For a locally complete binary tree with n + 1 leaves, there are at least n communications to do.
Heuristics for Evaluation of Array Expressions 3.2.2 B
Lower bound Let = $com+max(
325
6c,lc(i), ( u  1)$com) + ~fcalc(root) i~z\ ( roo~}
B is a lower bound on the time necessary to compute the expression. The first stage of the execution is a communication to move one of the operands of the first computation to the same location as the other operand, hence the first term Scorn. The last stage is the computation of the root of the tree, hence the $calc(rOot). In between, as the machine can only do one computation at a time, and all the computations are sequentialized, hence the term ~,iez\{roo~} 6calc(i). Likewise, the communications are also sequentialized, hence the ( n  1)ticom. As computations and communications are assumed be simultaneous, B is a lower bound on the execution time of the expression tree.
3.2.3
The heuristic
T h e e v a l u a t i o n of one n o d e To minimize the number of communications, we compute one intermediate result at the location of one of the two operands it depends upon. So we have one communication and one computation for each node. T h e general strategy We consider a total order on the internal nodes of the tree. We then evaluate the nodes following the given order as soon as possible. Here is the order ~ that we consider: 1. The evaluation is done by depth level, starting with the deepest leaves, once one level has been computed, going up one level. Each level is evaluated from the left node to the right node. 2. The communication direction is always from the left son, except when the right son is a leaf, in which case, the communication direction is from the right son. 3. The communications are executed in the order ,: of execution of the nodes, as soon as possible. T h e o p t i m a l case We assume that either all computation times are lower than the communication time, or that they are all greater than the communication time. In this particular but realistic case (fine grain or coarse grain computation), we are able to describe a set of strategies that give a communication time equal to the lower bound B. T h e o r e m 1 The evaluation proposed above gives an optimal execution time. The proof of theorem 1 can be found in [2].
3.3
W h e n Leaves Can Share the Same Location
3.3.1 The case under study We study here the case when the data are not necessarily all stored at different locations (in terms of processors). But we impose in this case that
326
V. Bouchittd et al.
there is no temporary storage of the data. We mean that a data array stored at location a may be moved to location b for a computation but is not available at location b for another computation. We can thus represent the expression by an expressiontree a.
3.3.2 An algorithm to determine the minimal number of communications We present here an algorithm that computes the minimal number of communications needed to evaluate an expressiontree. The algorithm is built by following the dynamic programming paradigm: it is based on the labeling of each node of the expressiontree by the minimal number of communications needed to compute the subtree of this node and by the list of locations where this number can be reached. We will denote this label (n,,sc) where s~ = {11,12,...,lk}.
/~ Initialization */ f o r e a c h l in leaves(DAG) do label(/)=(O, {l}) enddo f o r e a c h n in internal_nodes(DAG) do label(n)=(O, 0) enddo
/* First Phase */ e x p l o r e the DAG backward from the leaves to the root l e t c be the current node l e t (n2, s2)=label(right_son(c)) if slns2 ~ @ t h e n + n else
+
+ 1,
u
endif endexplore
/* Second Phase */ l e t (nr, sr)=label(root) choose a location l for the root from sr label(root)=(n~, {/}) e x p l o r e the internal nodes of the DAG forward from the root to the leaves l e t c be the current node let (n~,s~)=label(c) l e t (n/, {l/})=label(father(c)) " i f 11 E s~ then choose a location I from s~ label(c)=(n~, {l}) else
3See section 3.1 for the definition of the expressiontree.
Heuristics for Evaluation of Array Expressions
327
label(c)=(nc,s I) endif
endexplore T h e o r e m 2 The algorithm described above gives the minimal number of communications
needed to compute the input expression and a strategy to allocate the intermediate results that uses this minimal number of communications. The proof of this theorem can be found in [2].
3.3.3 The heuristic We consider here a set of algorithms all based on the same scheme. The first assumption is that to compute quickly, we should use as few communications as possible. This is not always optimal. To do this we use an allocation of the intermediate results given by the algorithm of the previous section. This allocation and the structure of the expressiontree induce a partial order L(u)). The scope of the paper is limited to single nested DOlike loops whose body is a basic block. A loop is represented as a labelled directed graph G(V, E, A, ~), called rgraph, where V is the set of nodes (operations), E is the set of edges (data dependences), and A and are two labelling functions representing the iteration index (,~) for each operation and the number of iterations (distance) traversed by each data dependence ($). The paper is organized as follows: Section 2 shows how a 7rgraph can be transformed while maintaining the semantics of the loop. Section 3 describes the two graph transformations used by UNRET. Bounds on the initiation interval (II) and the resource utilization of any schedule of a loop are estimated in Section 4. Section 5 explains the UNRET algorithm, as well as some details about scheduling. An algorithm to reduce the number of registers required to execute the loop is described in Section 6. Some results obtained by using wellknown examples are presented in Section 7, with conclusions in Section 8.
2
EQUIVALENT
GRAPHS
Initially, a loop ~r = G(V,E,A,$) is represented by a ~rgraph in which A(u) = 0 for all operations u E V. However, a loop can be represented by different, though equivalent, rgraphs. Two rgraphs, r = G(V,E,A,$) and 7r' = G(V,E,)~,~), are equivalent [21] (represent the same loop) if V(u, v) E E:
~ ( v )  )~(u)} ,~(u, v) = A'(v)  ~'(u)t $'(u, v)
Figure 1: Scheduling of equivalent 1rgraphs
(1)
Resourceconstrained Software Pipelining
379
In general, the scheduling constraints imposed by dependences decrease as their distance increases. The example shown in Figure 1 depicts two equivalent rgraphs and their schedules (assuming all operations are additions that can be executed in one cycle, and the architecture has three adders). Edge labels identify the distance (8) of each dependence. An unlabelled edge denotes a data dependence with distance 0. Such edges represent a data dependence between two operations belonging to the same iteration, a.nd are called Intra. Loop Dependences (ILD). Edges e E E with 8(e) > 0 represent data dependences between operations from different iterations. They are called LoopCarried Dependences (LCD). On the other hand, node subscripts denote the iteration index (A) for each operation. Thus, operation Ai denotes the execution of operation A at the ith iteration. Both rgraphs in Figure 1 represent the same loop (Equation (1) is fulfilled for each dependence). Each iteration of the loop ill Figure l(a) requires two cycles to be executed due to the existence of ILDs (an ILD, Ai ~ Bj, states that operation A from iteration i must be executed to completion before starting the execution of operation B from iteration j). The LCDs are always honored because of the sequential execution of the steady state. Due to the existence of ILDs, no schedule less than 2 cycles exists for the rgraph from Figure l(a). This rgraph corresponds to the initial representation of the loop. Its schedule does not require either prologue or epilogue 2. However, the loop body in Figure l(b) may be scheduled in only one cycle (H = 1) since no ILD exists a. This schedule contains operations belonging to two different iterations of the original loop (i and i + 1), and the execution of the new loop requires the execution of a prologue and an epilogue.
3
GRAPH
3.1
TRANSFORMATIONS
Dependence Retiming
Since ILDs constrain the scheduling of the loop body, we are interested in increasing their distance by transforming them into LCDs. The transformation dependence retiming is defined to achieve this goal. Given a dependence e = (u, v), dependence retiming transforms (~(e) according to equation (1) by performing the following steps: * A'(u):= A ( u ) + 1
9 9
w):=
+ 1,
V(u,w) E E
l,
V(w, u) E E
Dependence retiming is equivalent to operation retiming, as was defined by Leiserson and Saxe in [13]. Dependence retiming yields a rgraph equivalent to the original one. ~The k iterations of the loop are executed by the steady state. ZThe distance of dependences A * B and A ,C has been updated by changing the iteration index of A according to Equation (1).
F. Sanchez and J. Cortadella
380
3.2
Loop Unrolling
In general, finding an optimal schedule of a loop requires more than one instance of the loop body [9]. The loop unrolling transformation [21] of a 10op ~r generates a new loop body lr K in which each operation and each dependence are repeated K times (the loop is unrolled K  1 times)J20]. The effectiveness of loop unrolling is iUustrated by using the example in Figure 1. Figure l(a) shows a possible schedule when two FUs are available. One iteration is executed every two cycles (H = 2). However, if the loop is unrolled once, a schedule with shorter initiation interval can be found by applying dependence retiming, as shown in Figure 2. In this schedule, 2 iterations are executed every 3 cycles (H = ~).
Figure 2: Schedule with 2 FUs after loop unrolling and dependence retiming 4
4.1
BOUNDS ON THE INITIATION INTERVAL AND UTILIZATION
THE RESOURCE
M i n i m u m Initiation Interval
The initiation interval of a loop schedule is limited by the set of FUs of the architecture and the cycles (recurrences) formed by the dependences of the xgraph. Two lower bounds for the initiation interval can be distinguished: 9 recMII: the minimum initiation interval due to the recurrences of the loop body. 0
if the loop has no recurrences
9
recMII =
T(u)
(u,v)en maxRc_E Z ~u~v)
iftheloop has recurrences
(u,v)r
where R is a cycle (recurrence) of the dependence graph. 9 resMII: the minimum initiation interval due to the resources.
resMII = max vu~n~ VR~ ni where Ri is a resource type of the architecture, ni is the number of resources of type Ri available in the architecture, u M Ri states that operation u is executed in a FU of type Ri, and L(u) is the latency of u.
Resourceconstra#wd Software Pipelining
381
The minimum initiation interval ( M I I ) achievable for any schedule of the loop is the maximum of the two previous lower bounds [22]. MII = max(recMII, resMII)
4.2
M a x i m u m Resource Utilization
A schedule so that each iteration takes MII cycles achieves a maximum utilization of the resources. The Resource Utilization (U) of a schedule depends on' 9 the sum of the latencies of all operations in the loop, L = ~
L(u)
u
9 the number of instances of the loop (K) involved in the schedule 9 the number of available resources (R) 9 the number of cycles of the schedule 4 (IIK) The resource utilization of any schedule of r K in IIK cycles is the fraction: x y
L.K R . IIi~"
For a target resource utilization, the loop unrolling degree (K) and the expected initiation interval (IIK) of the schedule can be computed by solving the following 2variable linear Diophantine equation [2]: x . R . IIK  y . L . K = 0
5
(2)
UNRET
5.1
Farey's Series
Since the initiation interval of a loop schedule is bounded by MII, an upper bound for the resource utilization (MaxU)also exists. For a given loop and for a target architecture, all the possible values for the resource utilization of a loop schedule can be ordered in decreasing order of magnitude starting from MaxU. This sequence is defined by Farey's Series [23]. Farey's Series or order D (FD) defines the sequence (in increasing order)of all the reduced fractions with nonnegative denominator _< D. For example, Fs in the interval (0,1] is the series of fractions:
F5
01112132341 1'5'4'3'5'2'5'3'4'5'1
Let ~ be the ith element of the series. FD can be generatedby the following recurrence: 9 The first two elements are respectively ~ : ~ and "~1 = ~i 4Note that IIK is tile number of cycles to execute K iterations of the loop. Every iteration is executed in II = ~h" cycles.
R Sdnchez and J. Cortadella
382 9 The generic term ~
XK+2
can be calculated as:
IY +D I
= t Y g + i J "XK+,
 Xz~
.
IYK+DI
11:'+2 = I.' Yg+t J " YK+I


YK
Since U must be explored in decreasing order (starting from U = MaxU), and the range for U is U E [0, 1], we are interested in the series 1  FD. For each value of U, the pairs (IIK, K) can be computed by solving equation (2). Figure 3 shows an example of generation of pairs (IIK, K) (the architecture has 4 adders, each of which performs an addition in 1 cycle). Figure 3(a) shows an example of a rgraph, in which all operations are additions. Figure 3(b) shows a diagram representing all possible pairs (IIK, K). Each point in the diagram represents a possible schedule. Point A represents a schedule of 3 instances of the loop in 4 cycles. The existence of such a schedule depends on the topology of the dependences of the loop. Point B represents a timeoptimal schedule. Point C represents a schedule with the same resource utilization as point A, but with a longer IIK (the initiation interval for each iteration is the same). Figure 3(c) shows the schedule found by UNRET, which corresponds to point A after dependence retiming.
Figure 3: Exploration of resource utilization (a) Example of loop (b) Diagram representing the resource utilization for 4 adders (c) Schedule found by UNRET 1
Farey fraction
T
(ILK, K)
(5,4)
31
16
3"2"""1"7
I__5
29
16
31 "'" 19
(4,3) (8,6)
I__7
2~
8
28
9 " " " 31
(7,5)
26
5_ 6
(3,2) (6,4)
Table 1" Farey's Series F32 and legal pairs (IIK, K) associated with IIK Tk. Otherwise, i.e. if there are no NO replies but at least one RYES reply, then a RYES reply is sent. If Tt > Tk, then the RYES can be converted to YES.
Different Approaches to Distributed Logic Simulation 4.2
395
Deadlock Recovery with the Vector Method
In the conservative approach, an alternative to deadlock avoidance is to allow deadlock to occur, then detecting and recovering from it. Our implementation of deadlock recovery is based on Mattern's vector method [10]. Two variants of this deadlock detection algorithm have been implemented: a circulating control vector and a parallel version of the vector method. During deadlock detection the next event time is collected fi'om each simulator. Deadlock is broken by computing the minimum of these times. All simulators with minimum next event times are restarted.
The circulating control vector. The vector method detects deadlock by having each process count the number of messages that are sent to and received from other processes. Each simulator Si has a (local) vector Li. If Si sends a message to Si, Li[j] is incremented by one; if Si receives a message, Li[i] is decremented by one. A circulating control vector collects this information on its way through the simulators. A simulator Si that has received the control vector keeps it until it has to suspend its simulation because li[j] < Ts for some j. Then it updates C by adding its local vector to it which is then reset, i.e. C '= C + L'~; L, = 0. The control vector is passed to a process Sj with C[j] > 0. If C = {~ upon update, deadlock has been detected: all processes have suspended simulation and there is no event message in transit. 4.3
Time Warp
In the Time Warp parallel simulator state information is saved incrementally instead of periodically saving the state as a whole (checkpointing). Upon execution events are not removed from the event list. Instead, the signal value prior to event execution is stored in the event data structure. If a rollback to time tr occurs, a forward search is started in the event list beginning at time tr. The value of a signal s is restored from the first event affecting s that is found in this search. Incremental state saving is preferred to checkpointing in logic simulation because checkpointing would result in very inefficient memory usage since each event changes only a small part of the system state. Both methods for undoing external events have been implemented: aggressive and lazy cancelation. With aggressive cancelation, an antimessage m  is sent for each event message m + generated in the rolled back period immediately upon rollback. With lazy cancelation, an antimessage m  is not sent before local simulation time (LVT) reaches the time stamp of m +. Only if m + is not generated once again in the resimulation, m  will be sent. 1 The idea behind lazy cancelation is that resimulation will regenerate most of the events undone in the rollback. 2 Global virtual time (GVT) is approximated using Samadi's GVT2 algorithm [11]. Despite being one of the earliest GVT algorithms, runtime measurements have shown a I By resimulation we mean the renewed simulation of the rolled back period of simulated time. ~Strictly speaking, this assumption doubts Time Warp's efficiency. However,several studies have shown that lazy cancelation can be more efficient than aggressive cancelation.
396
P. Luksch
sufficiently close approximation of GVT. GVT2 outperformed a newer algorithm proposed by Lin/Lazowska [8] which does not require simulators to stop computation temporally but requires more messages to be sent. In our implementation of GVT2, however, the requirement of stopping simulation could be relaxed so that simulators may continue computation but must refrain from sending messages. Anyway, investigating newer GVT algorithms such as the one proposed in [2] will be an interesting application of the test environment. Two extensions to the basic Time Warp mechanism have been implemented within our testbed: being motivated by the same assumption as lazy cancelation optimized resimulation aims" at reducing the number of element evaluations during resimulation, which is especially useful for circuits containing complex elements. Dynamic repartitioning attempts to compensate uneven load distribution by moving elements from a heavily loaded processor to a lightly loaded one. Even if static partitioning has generated equally sized partitions, load may be distributed unevenly if elements have different rates of activity or if activity distribution in the circuit changes over time.
Optimized ReSimulation. Assume, an element E has been evaluated during the rolledback simulation at (simulated) time t resulting in an event el to be generated. If during resimulation, E is evaluated once again at time t, el will be generated again if the state of E is the same as in the corresponding evaluation before rollback. The state of an element is defined as the vector of its input signals and its internal state variables. The idea of our optimization is to reuse the event generated before rollback instead of evaluating tim element once again if the above condition is met. More precisely, optimized resimulation works as follows: during "normal" simulation the simulator keeps track of the causal relationship between events and element evaluations, i.e. it stores information of the form "event el caused elements El, E2 to be evaluated. Evaluation of E1 generated ca, evaluation of E2 generated e4." (e3, e4 are called a follow events of el caused by the evaluation of E1 and E o, respectively.) In addition the element state has to be remembered for each evaluation. At rollback, locM events are marked as "undone" instead of removing them from the list. If during resimulation element E1 is evaluated at time t, the simulator checks if there is information stored about follow events. If so, it compares El's state at the corresponding evaluation before rolling back to its current state. If the states are identical, e3 (which is a follow event of el) is rescheduled by removing the "undone" mark. Only if there is no follow event information stored or states do not match, E must be evaluated.
Dynamic RePartitioning.
There are three alternatives for implementing load balancing
at simulation time: 1. There are several simulator processes on each node. The operating system determines the load on each node and migrates processes as necessary. However, currently no multiprocessor operating system supporting dynamic load balancing is available for production use. In addition, the Time Warp protocol makes it hard for an operating system to measure load since Time Warp simulators are ready to compute all the time. Moreover, optimal scheduling of several simulators on one node, i.e. lowest LVT first, cannot be implemented with any of the existing operating systems.
Different Approaches to Distributed Logic Simulation
397
2. Each simulator processes several partitions each of which has its own LVT. They are scheduled such that the partition with minimal LVT is simulated first. If load imbalance is detected, then a heavily loaded process gives one or Inore of its partitions to a lightly loaded process. Load is measured as the minimum LVT of a simulator's partitions as reported in the snapshots taken for GVT computation. As LVT's may move forth and back quickly due to speculative computations followed by rollbacks, mean values taken over a number of snapshot provide a more realistic image of load distribution. Since partitions have their own LVT's, migrating them is relatively straightforward. However, as partitions may only be moved as a whole, their number must be much larger than the number of processors in order to be able to balance load exactly. Since the communication structure is fixed to a great extend by statically clustering elements into partitions, a good static partitioning policy is required. 3. To compensate load imbalance, a set of elements is selected from the partition of a heavily loaded simulator and is moved into that of a lightly loaded one. Load is measured by observing LVT's as in method 2. Compared to methods 1 and 2, elementwise repartitioning allows very finegrained redistribution of computational load. Communication relations between processes can be rearranged freely because there are no restrictions due to static partitioning. However, migrating elements is not as straight forward as migrating partitions which have their own LVT's. Usually, the source partition's LVT, T, re, is lower than the destination partition's LVT, Tae,~. Therefore, the simulator processing the destination partition, Sd~st, must perform a modified form of rollback to Tsrc in order to simulate unprocessed events for the "new" signals. (This rollback does not require local events to be undone.) Unprocessed events for signals that migrate into the destination partition have to be sent from S, rc to Sd~t. Also, rollback at Sd~,t may require the signal history for [GVT, T, rc] to be transferred to Sd~st. In the current version of the testbed, method 3 has been implemented. Comparison of methods 2 and 3 will be an interesting point for applications of and extensions to our testbed.
5
EXPERIMENTAL
RESULTS
The testbed has been implemented based on the machineindependent parallel programming library MMK which has been developed in our research group and is currently available for the iPSC/2, iPSC/860 and networks of Sun Sparc workstations. Runtime measurements have been performed on the iPSC distributed memory multiprocessors using the ISCAS89 benchmark circuits as workloads. Function decomposition has a theoretical speedup of 34. Parallelization overhead (without communication cost) has been measured to be less than 50%. Nevertheless, no speedup has been observed in our runtime measurements because of the implementation platform's high communication latency which is about 600 l*s for MMK on the iPSC/860 and 2 ms on the iPSC/2. For function decomposition to be efficient communication latency must be low or circuits must be very large so that data exchanged between pipeline stages can be packed in long messages while keeping the pipeline busy.
398
P. Luksch
Parameter s: number of simulation time units between application of successive input vectors to primary Inputs
Figure 3: Time Warp and Deadlock Recovery: experimental results
In our measurements, performance of the parallelizations based on model partitioning has shown to depend strongly on the circuit being simulated and on the stimuli being applied to its primary inputs as depicted in fig. 3 for some examples. Maximum speedups are about half the number of simulators involved in the simulation. However, in many cases no clear relationship can be established between the number of simulators and the achieved speedup. As the function of the ISCAS benchmarks is not known, random sequences of input vectors have been applied to the circuits at different frequencies. The parameter s in fig. 3 denotes the number of simulation time units between two successive input vectors. The examples shown suggest that Time Warp outperforms conservative synchronization with deadlock recovery. However, our measurements do not clearly favor any of the three approaches that have been analyzed. Circuit topology and stimuli have impacted performance much more than the method of synchronization did for both of our static partitioning procedures. Runtime statistic revealed the reason for this rather unexpected behavior: Load has been distributed very unevenly among the simulators. Further analysis has shown that activity rates vary by several orders of magnitude from element to element. Also the "center of activity" within a circuit tends to move during simulation. In Time Warp, uneven load distribution has resulted in an extreme divergence of LVT's. Fig. 4 shows the result of an observation of LVT's and GVT with the TOPSYS distributed monitoring system. GVT approximation is sufficiently close. One simulator increases its LVT without rollbacks, another one proceeds at nearly the same rate but with frequent and short rollbacks. The other simulators periodically run far ahead of GVT and then rollback over long periods of simulated time. As a result of being far ahead of GVT the latter processes use up all their memory for state saving if large circuits are simulated. In order to get such simulations finished, Time Warp's optimism had to be limited by suspending simulators which are running short of memory if they are more than a predefined amount
Different Approaches to Distributed Logic Simulation
399
Figure 4: Time Warp: LVT's and GVT's observed with the TOPSYS distributed monitoring system
of simulated time ahead of GVT. 6
CONCLUSIONS AND FUTURE WORK
A test environment has been designed which allows easy implementation of a great number of parallelization strategies by providing a comprehensive library of functions and enables an unbiased evaluation of different parallelization strategies. Four parallelizations have been implemented and analyzed. However, the number of runtime measurements has been limited by the instability of both the iPSC multiprocessors and the programming environment. Since some of the results obtained have been quite unexpected, further runtime measurements should be carried out in the future including larger circuits and circuits of known function for which input stimuli can be provided that "make sense". From our measurements performed so far, the following conclusions can be drawn: 1. Given its limited potential for speedup and its sensitivity to communication latency, the function decomposition approach can be applied successfully only in combination with the model partitioning approach. In future multiprocessors where each node has several CPU's sharing a common memory, a simulator running on one node may be parallelized using function decomposition while simulation is distributed among the nodes using the model partitioning approach. 2. Different activity rates must be accounted for in the static partitioning procedure. Most heuristic algorithms can be modified to have individual weight factors for elements and signals. Since in the design phase of a circuit typically a number of nearly identical simulations is run in a sequence (e.g. for debugging the design), these weight factors can be easily obtained from statistics collected in a previous run at no extra cost. Dynamic repartitioning has proved to reduce the LVT divergence in Time Warp. However, further
400
P. Luksch
measurements will be necessary in order to evaluate its effects comprehensively. Topics for future research include using the testbed as a basis for the implementation and analysis of optimizations of the existing and new parallelization strategies, and porting the testbed to a more widely used programming model, e.g. PVM or P4. Enlarging the set of hardware platforms where the testbed is available will allow us to evaluate different multiprocessors with respect to their appropriateness for distributed discrete event simulation. Considering other application areas of discrete event simulation will show to what extent results obtained from logic simulation can be generalized to other types of simulation problems. Parallelization of a commercial simulator designed for modeling production processes in factories has just begun. Acknowledgements This work has been partially funded by the DFG ("Deutsche Forschungsgemeinschaft", German Science Foundation) under contract No. SFB 342, TP A1. References [1] W.L. Bain and D.S. Scott. An algorithm for time synchronisation in distributed discrete event simulation. In Distributed Simulation, 1988. [2] H. Bauer and C. Sporrer. Distributed Logic Simulation and an Approach to Asynchronous GVTCaleulation. In Proceedings o/the 199~ SCS Western Simulation Multiconference on Parallel and Distributed Simulation (PADSg2), pages 205209, Newport Beach, California, January 1992. [3] K.M. Chandy and J. Misra. Asynchronous Distributed Simulation via a Sequence of Parallel Computations. Communications of the ACM, 24(11), April 1981. [4] C.M. Fidueeia and R.M. Mattheyses. A LinearTime Heuristic for Improving Network Partitions. In 19th Design Automation Conference, pages 175181, 1982. [5] R.M. Fujimoto. Parallel Discrete Event Simulation. Communication of the A CM, 33(10):3053, October 1990. [6] D. Jefferson. Virtual Time. A CM Transactions on Programming Languages and Systems, 7(3):404425, July 1985. [7] T.H. Krodel and K. Antreieh. An Aeeurate Model for Ambiguity Delay Simulation. In 27th ACM/IEEE Design Automation Conference, pages 122127, 1990. [8] Y.B. Lin and E.D. Lazowska. Determining the Global Virtual Time in a Distributed Simulation. In Proceedings of the 1990 International Conference on Parallel Processing, volume III, pages 201209, 1990. [9] Peter Luksch. Parallelisierung ereignisgelriebener Simulationsverfahren an/Mehrprozessorsystemen mit verteiltem Speicher. Verlag Dr. Koran., Hamburg, 1994. [10] Friedemann Mattern. Verteilte Basisalgorithmen, volume 226 of lnformatik.Fachberichte. SpringerVerlag, Berlin, 1989. [11] B. Samadi. Distributed Simulation, Algorithms and Performance Analysis. Technical Report, University of California, Los Angeles, (UCLA), 1985. [12] Gopalakrishnan Vijayan. MinCost Partitioning on a Tree Structure and Applications. In 26th A CM/IEEE Design Automation Conference, pages 771774, 1989.
Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 1995 Elsevier Science B.V.
401
A SIMULATOR FOR OPTICAL PARALLEL COMPUTER ARCHITECTURES
N. LANGLOH, H. SAHLIt, A. DAMIANAKISt, M. MERTENS, J. CORNELIS Vrije Universiteit Brussel Dept. ETRO/IRIS Pleinlaan 2 B 1050 Brussel Belgium nlangloh @etro.rub. ac. be
raise Ecole Royal Militaire, Brussels, Belgium ~also FORTHHellas, Crete, Greece
ABSTRACT. With the demonstration of optical data transcription of images and logic operations on images, it has been shown that it is feasible to build optical computer architectures with arrays of differential pairs of optical thyristors. This paper describes a simulator for the execution of (image processing) algorithms on arbitrary optical architectures. The results of the simulations will allow the estimation of the execution speed of different architectures and the improvement of the architecture itself. KEYWORDS. Optical computing, simulation, parallel computer architecture.
1
INTRODUCTION
The PnpN optical thyristor is one of the most promising elements for parallel optical information processing [4]. Currently, PnpN devices with a very good optical sensitivity (250 fJ) at an operation cycle frequency of 15 MHz [5] are available. These fast switching times are achieved through the fabrication of a new type of optical thyristor which can be completely depleted by means of a negative electrical pulse [3]. A physical implementation of an array of differential pairs of PnpN optical thyristors with the possibility of performing optical data transcription and optical logic has been shown in [10]. The ability to execute AND, OR, and NOT operations with these PnpN optical thyristor arrays allows us to design optical computer architectures capable of executing all possible Boolean functions [6].
N. Langloh et al.
402
Massively parallel optical computer architectures can be designed with arrays of differential pairs of PnpN optical thyristors. When the architecture is used for image processing, each differential pair of an array represents a pixel of the image. Executing a Boolean operation with this architecture means that every pixel of the same optical thyristor array undergoes an identical Boolean operation. In [6], it was shown using a worst case analysis that, for images of at least 64x64 pixels, the calculation of an arbitrary Boolean function containing 10 different variables needs fewer clock cycles on an SIMD architecture based on optical thyristors than on a sequential architecture. The design of the architectures, built with the PnpN thyristor arrays, must be carried out carefully, so that they will be competitive with currently existing parallel and sequential (electronic) computers. We have therefore developed two simulators. A first prototype (OptoSim), which is capable of simulating a fixed SIMD architecture containing 6 optical thyristor arrays has already been developed in [8]. The simulator gives the sequence of operations that the optical thyristor arrays must perform to execute a program. The architecture is fully SIMD, because only one plane at a time can perform an operation. The simulator optsim that is currently being developed will not have this disadvantage. One of the objectives of this simulator is to simulate architectures consisting of several primitive computer architectures (standard cells), connected with each other through an optical communication (bus)structure. All of these standard cells must be able to perform operations simultaneously. At the hvel of pixel data, this architecture is still SIMD, but at the level of image data, it can be viewed as an MIMD architecture. If the program to be executed can be partitionated such that many standard cells simultaneously contribute to the solution, then the degree of parallelisation will be some orders of magnitude higher than in [8].
Outline In section 2, the typical optical components which are used in the architectures will be described. In section 3, some examples of elementary optical computer architectures will be given. Section 4 will describe the implementation of a first simulator (OptoSim). In section 5, a hierarchical description of an optical parallel computer architecture will be given. Section 6 will describe the simulator optsim, currently still under development. In section 7, we will draw conclusions.
2
THE BASIC ELEMENTS
Several optical components are needed to build an optical computer architecture. Beside the PnpN thyristor array, which enables logic operations to be performed, one also needs elements which allow the blocking of optical signals (like a shutter), and the routing of optical signals to more than one destination (like a beam splitter). A system description for these components will be given in this section.
Optical Parallel ComputerArchitectures 2.1
403
The PnpN Thyristor Array
The basic component which allows to perform logic operations is the completely depleted optical PnpN thyristor [3]. Depending on the anodecathode voltage of the thyristor, it can be in one of the following four states. (1) When we apply a high positive voltage (around ten volts), a current will flow through the device and it will emit light. Also a huge amount of charge will be accumulated in the device. (2) When we apply a low positive voltage (a few volts), the device will remain idle; it will not emit light, and it will not accumulate nor lose charges. (3) When we apply a zero voltage, then it will accumulate charges proportional to the optical energy of the light that falls on the gate of the thyristor. The factor of proportionality depends on the wavelength of the light that shines on the gate of the device. (4) When we apply a high negative voltage, all charges accumulated in the device will be sucked out. The removal of the charge can happen in a few nanoseconds because the device is completely depleted.
Figure 1: A differential pair of optical thyristors and an electronic model of the pair It is interesting to set up two thyristors as a differential pair [4] (see Figure 1). This pair then behaves like a "winner takes all" network: when we apply a high voltage over the differential pair, then the thyristor with the most accumulated charge will conduct and emit light. The other thyristor will remain idle. In other words, only one of the two thyristors will send out light. If we assume the convention that a logic "true" corresponds to the situation where one of the thyristors of the differential pair emits light, and a logic "false" corresponds to the situation that the other thyristor sends out light (see Figure 2a), then it is possible to perform an AND and an OR operation with the differential pair. The AND operation is depicted in Figure 3: (i) The AND Plane (array) and the Buffer Plane will be reset with a high negative anodecathode voltage. Then the contents of the A Plane will be copied to the Buffer Plane and the AND Plane will receive an optical bias (every pixel of the AND Plane will receive a logic "false") from the Bias Plane. (ii) The contents of the Buffer plane will be transmitted to the AND Plane. Thereafter, the Buffer Plane is reset. (iii) The content of the B Plane is copied to the Buffer Plane. (iv) The
404
N. Langloh et al.
a) Logic representation of a thyristor pair vertically (resp. horizontally ) oriented: logic "trig" when left (resp. upper) thyristor emits light; logic "false" when right (resp. lower) thyristor emits light.
b) Normallyconnected and crossconnected electrodes (e.g ncv is normallyconnected vertical).
Figure 2: Logic representation of a differential thyristor pair
Figure,3: The steps of an AND operation with optical input data in the A Plane and the B Plane, and the result stored in the C Plane
Optical Parallel Computer Architectures
405
Buffer Plane transmits its content to the AND Plane, and then the Buffer Plane is reset again. (v) The contents of the AND Plane will be send to the Buffer Plane. According to the "winner takes all" principle, a pixel of the AND Plane will be logic "true" if the corresponding pixel on the A Plane and the B Plane were logic true, because the thyristor of the differential pair which corresponds with the logic "true" received twice as much light (it received light from the A Plane and the B Plane) than the thyristor corresponding with the logic "false" (only from the Bias Plane). If at least one of the pixels on the A Plane or the B Plane was logic "false", then the logic "false" thyristor of the AND Plane receives more light than the logic "true" thyristor, and according to the "winner takes all" principle the logic "false" thyristor of the AND Plane will emit light. (vi) The contents of the Buffer Plane will be copied to a C Plane. An OR operation is selected when the logic "true" thyristor is optically biased before the two optical input signals are put on the differential pair. The NOT and shifting operations needs a differential pair with a special electrode configuration. Every thyristor of the differential pair has two top electrodes, so that it will be possible to control the position of the optical output of the activated thyristor. The electrodes can be normallyconnected or crossconnected (see Figure 2b). This makes it possible to have inverting logic and shifting [10].
2.2
The B e a m Splitter
A cube beam splitter is a well known optical component. When we send light in one of the faces of the cube, a fraction x (smaller than one) of the optical energy of the light comes out on the other side of the cube, and a fraction 1  x comes out of a third plane of the cube. This is when we ignore the optical losses in the component. 2.3
The Shutter
A shutter can be used to decide electronically whether light will pass through the component. These shutters are usually made with liquid crystal displays and have the disadvantage of being slow. But with arrays of differential pairs of PnpN optical thyristors and beam splitters, one is able to build an active high speed shutter. The Figure 4 describe such a shutter, containing only one optical thyristor array and one beam splitter. This shutter allows to change its state in several clock cycles (less than 100 ns) instead of several milliseconds. When the optical thyristor array receives an optical input, it is possible to decide to send out or not this same optical signal through the beam splitter.
3
SOME BASIC ARCHITECTURES
Different optical computer architectures have already been examined. In [6], a basic optical computer architecture consisting of just three thyristor arrays and one beam splitter was examined (see Figure 5). It has been shown in [6] that the proposed simple architecture
N. Langloh et al.
406
world
:____ _~
I~
outside
world
Figure 4: A dynamic shutter was capable of performing all possible Boolean operations. A straightforward extension to grey value image processing was also demonstrated. Outside
~~
World Input I
Output
And Processing Plane
Or
Processing
%
Plane
Shifter Inverter
Processing Plane
Figure 5: A basic architecture [6]
4
OPTOSIM
A FIRST COMPILER AND SIMULATOR
A simulator of an optical computer architecture containing six optical thyristor arrays and six beam splitters was developed in [8]. The simulator was also capable of compiling CLIP instructions [2] and SSL instructions [1] to the low level instructions suitable for controlling the PnpN optical thyristor arrays. But the simulator has two major drawbacks. Firstly, it can compile CLIP programs and SSL programs only for one specific computer architecture, and secondly the execution time of the program is rather long. The simulator and the compiler are implemented on a Macintosh.
Optical Parallel ComputerArchitectures 5
407
H I E R A R C H I C A L D E S C R I P T I O N OF AN O P T I C A L P A R A L L E L COMPUTER ARCHITECTURE
Our pupose here is to present a more powerful simulator which permits to: (1) simulate general purpose optical parallel computer architectures, and (2) map image processing algorithms on to these architectures. An optical parallel computer architecture can be viewed as a distributed system formed by connecting several primitive processing units (standard cells), through a communication (bus) structure. We suggest a unified hierarchical approach for describing and implementing standard cells/more complexe optical parallel computers. Such a hierarchical description considerably simplifies the analysis, the design and the implementation of the simulator. Thus we distinguish four levels: (i) the physical implementation; (ii) the functional description; (iii) the graph representation; and (iv) the algebraic description.
Figure 6: Hierarchical standard cell description and corresponding simulator structure These description levels can be summerized as follows (see Figure 6a): The physical implementation of the architecture (or the standard cell) is a scheme of how the architecture (or the standard cell) is built. This contains the optical components used to implement the architecture (or the standard cell), such as lenses, beam splitters, optical thyristor arrays, holograms, diffractive elements, shutters, etc. The functional description of the architecture (or the standard cell) only contains the
N. Langloh et al.
408
elements which are necessary to describe the operations that the architecture (or the standard cell) can perform. Optical elements like lenses, etc. are not present here because they just ensure that the light rays emitted by the optical thyristor arrays will not spread out. The graph representation of the architecture (or standard cell) describes the architecture (or standard cell) in terms of nodes and links between these nodes. Each node is an element capable of processing data (an element that must be described with internal state variables) like the optical thyristor array. A link represents the communication path between the nodes (the path transmitted light can follow), e.g. free air. Each node must be described by the transformation of data it can perform as a function of the input data and the internal variables. The algebraic description of the standard cell is the sequence of instructions this standard cell must perform to execute a Boolean operation. The algebraic description of the architecture is the sequence of operations all elements in the architecture must perform to execute a given program. A formal language has been defined for the design of algorithms. 6
O P T S I M  A S I M U L A T O R F O R G E N E R A L O P T I C A L C O M P U T E R ARCHITECTURES
To develop the simulator we decided to follow a bottumup approach which we will describe below (see Figure 6b). For each level of the hierarchical description of section 5, we defined and developed dedicated tasks (processes). We started with the development of a simulator for the optical components element, which allow to simulate the internal and external state of an element after a given time At, given the internal and external initial state of the element. This corresponds to the functional description of the optical components (see section 5). The communication between the elements is simulated by the process graph. This corresponds to the graph representation of the architecture (see section 5). In order to obtain the complete functional description of the architecture a third process kernel will simulate the state changes on a complete optical computer architecture. The algebraic description of the architecture is given by two tasks" assembler and compiler. The process assembler allows the simulation of a sequence of instructions of the optical processor. It has also debugging capabilities which allow the stepbystep tracing of the program. The process compiler will allow higher level languages to be developed. As shown, the sinmlator is viewed as a collection of concurrently executing processes, These processes are communicating with each other using a message passing model. All of these processes have been developed on an UNIX workstation using the C language. The facts that the processes are concurrently executed and that the Communication happens via. a nonblocking message passing model allow the exploitation of the eventual parallelism of the workstation on which the simulator is running. 6.1
The Processes
6.1.1 The Element Process The process element allows to simulate optical components of the computer architecture. Each optical component is described by the transformation of the input data to output data as a function of its internal state and applied instruction.
Optical Parallel Computer Architectures
409
Given the kind of component (e.g. optical thyristor array, beam splitter, shutter, diffractive element, ...), given a description of the current state of the element (e.g. the charge already accumulated in the junction of a PnpN thyristor), given the optical inputs of the element, and given the time period At that the component will have these inputs, this process will calculate the new state of this element and the optical outputs of the element after the time period At.
6.1.2 The Graph Process The process element contains the information of all the optical components of the optical computer architecture, but it does not know how these components are connected with each other. This is the task of the process graph. It will contain a graph description of the architecture and will make sure that the optical input images of the component simulated by the process element are equal to the output images of the components to which it is connected. It also checks the architecture coherence. 6.1.3 The Kernel Process The task of the process kernel is to calculate the new state of the optical computer after a time period At, given the current state of each component of the architecture. At first sight, this seems to be a straightforward problem. But there can exist optical loops in the graph, and then only an iterative process can calculate the new state of the architecture. Knowing that passive optical components just dissipate light and active optical components generate light independently of the received optical energy, this problem can be easily solved. Firstly components of the architecture will be simulated assuming that they have no optical input. This way, the optical energy generated in the system is known. Then the process kernel will iteratively ask the process element to calculate for each component the optical output, knowing that its optical inputs are the optical outputs of the components connected to it. The process kernel then can ask the optical energy dissipated in the element. The process will iterate until the total optical energy dissipated in the architecture is close enough to the optical energy generated. 6.1.4 The Assembler Process With the process element, the process graph and the process kernel, it is possible to define tile components and how these components are connected with each other in order to form an optical computer architecture. It is also possible to calculate the new state of the architecture after a time period At, given the current state of the architecture and the instructions every component of the architecture must perform. But the level of the instructions is very low. E.g. the instructions for an optical thyristor ( Reset, Receive, Idle, and Send) correspond to the voltages to be put over the thyristor (see section 2.1). The process assembler will allow the user to give a sequence of instructions to be processed by the optical architecture. The process assembler will send the sequence instruction by instruction to the process kernel. It will also make sure that the new state of the components will become the old state of these components for the next instruction. 6.1.5 The Compiler Process It is clear that the process assembler will only generate a sequence of very low level instructions. But in most application domains, some sequences of instructions will appear at different occasions in a program. The aim of the process compiler is to allow the user to construct high level commands which will be translated in a sequence of low level commands. The most emergent application where the computer
410
N. Langloh et al.
architectures based on the PnpN optical thyristors will be used is in image processing. With the process compiler it will be possible to define operations on images as the sequence of low level instructions. 6.2
T h e C o m m u n i c a t i o n B e t w e e n T h e P r o c e s s e s of O P T S I M
The processes are so designed that they are communication driven. This means that a process will be idle until another process will ask it to perform a task. Each process can be treated a.s a software "black box" communicating with the outside world via message passing [11, 12]. This service is provided by the communication software. The communication software handles the messages to be transmitted to (received by) any process. If process A receives a request from another process B, then this request will contain (i) the name of the process B that started the request, (ii) an identification number that must be returned with every answer to the request, so that the process B knows to which request the answer belongs, (iii) an answer identification number of the request to which this is the answer (this number will be zero if it is not an answer, but a newly generated request), (iv) a command field containing the command to be executed (or the answer), and (v) a body field containing supplementary data. The communication software supports the notion of recovery functions. If a process has a request for another one, it will use its own communication software to send the request, and put a recovery function call in a recovery function queue by passing (i) a pointer to a function, (ii) the answer identification number, and (iii) a status structure corresponding to the current execution context. When a process receives a request, it will first check the answer identification number. If this number is zero, the process will assume that it received a new command, so it will interpret the command field and execute the associated command. If the answer identification number is non zero, the process will check to which recovery function this answer corresponds, and will start executing the code of this function. 6.3
The User Interface
It is the user who will control the operations that the processes perform. In the beginning, the user uses a command line as interface to communicate with the other processes. With this interface, the user can send commands to the processes and he also can receive the answers from the processes. The form of the command line is: "command" "body", where command is the command to be executed while body contains the parameters supplied by the command. For example:
element create
/* Allocate a new element and return an element id number. */
element load type
/* Load the type of an element (e.g PnpN type). b e g i n b e g i n element_id e n d begin type_id e n d e n d
element request ports
/* Ask for the number of ports of an element.
*/ */
Optical Parallel Computer Architectures
411
begin element_id end
element calculate /* Calculate the optical output signals of an element*./ begin element_id end graph load connection /* Connect pairs of elements. begin begin element_id end begin port_id end end begin begin element_id end begin port_id end end
,/
graph calculate next element /* Return the next element to be processed, and ask process element to update the influenced inputs. */ It is obvious that a command line is largely insufficient as a user communication interface. Therefore, we are currently developing a graphical user interface which will allow the user to communicate with the other processes in a very intuitive manner. The graphical user interface will be developed in Motif under XWindows.
7
CONCLUSIONS AND FURTHER DEVELOPMENT
A first simulator and compiler of a fixed optical computer architecture is implemented. It allows the user to write a program in CLIP or SSL, and translate the program to low level instructions suitable for a direct control of the PnpN thyristor arrays. A more powerful simulator is currently under development which permits to simulate general purpose optical computer architectures. This simulator allows us to: (1) describe several optical primitive processing units (standard cells), (2) simulate optical components functionality and (3) ma.p algorithlns on to optical computer architectures. ~re have adopted a unified hierarchical approach for describing and implementing standard cells/optical parallel computers. First results show the usefulness of simulating even simple basic cells, whose functioning would be otherwise untractable. Tile next step will be the introduction of a new hierarchical layer which will combine different basic optical processing units in order to build complex optical computer a.rchitectures which can act as parallel MIMD architectures and which will allow coarse grain parallelisation of algorithms.
Acknowledgements This work is supported by a joint IMEC/VUB project and a Human Capital and Mobility network Vision Algorithms and Optical Computer Architectures, contract no ERBCHRXCT930382. The authors also wish to thank the Applied Physics department of the Vrije Universiteit Brussel.
412
N. Langloh et al.
References
[1] K.H. Brenner, A. Huang, N. Streibl. Digital optical computing with symbolic substitution. Applied Optics 25, pp. 3054, 1986. [2] M.J.B. Duff, T.J. Fountain. Cellular Logic Image Processing, Academic Press, 1986. [3] P. Heremans, M. Kuijk, R. Vounckx, and G. Borghs. The Completely Depleted PnpN Optoelectronic Switch, abstract sent in to Optical Computing 94, Edinburgh, August 1994. [4] M. Kuijk, P. Heremans, R. Vounckx, and G. Borghs. The Double Heterostructure Optical Thyristor in Optical Information Processing Applications. Journal of Optical Computing 2, pp 433444, 1991. [5] M. Kuijk, P. Heremans, R. Vounckx, and G. Borghs. Optoelectronic Switch Operating with 0.2 fJ/m2 at 15 MHz. Accepted for Optical Computing 94, Edinburgh, August 1994. [6] N. Langloh, M. Kuijk, J. Cornelis, and R. Vounckx. An Architecture for a General Purpose Optical Computer Adapted to PnpN Devices. In: S.D. Smith and R.F. Neale (Ed.), Optical information technology: state of the art report, Springer Verlag, pp 291299, 1991. [7] N. Langloh. A Simulator for Optical Parallel Computer Architectures. HCM ERBCHRXCT930382 note, Vrije Universiteit Brussel, 1994. [8] M. Mertens. Een compiler voor beeldverwerkingsalgoritmen op een PnpN optische computer. Engineering Thesis, VUB, 1993. [9] M. Mertens. A Simulator for Optical Parallel Computer Architectures: description of a standard cell. HCM ERBCHRXCT930382 note, Vrije Universiteit Brussel, 1994. [10] H. Thienpont, M. Kuijk, W. Peiffer, et al. Optical Data Transcription and Optical Logic with Differential Pairs of Optical Thyristors. Topical Meeting of the International Commission for Optics, Kyoto, Japan, April 48 1994. [11] C. tloar. Communicating sequential processes, PrenticeHall, 1985. [12] Parallel C User Guide TexasInstrument TMS320C$O, 3Lltd, 1992.
413 AUTHORS INDEX
Archambaud D., 155 Arioli M., 97 Arvind D.K., 203 Astr6m A., 215 Bouchittd V., 319 Boulet P., 319 Brown D.W., 37 Cardarilli G.C., 109 Catthoor F., 131 Catthoor F., 191 Champeau J., 245 Christopoulos C.A., 235 Clark J.J.,85 Cornelis J., 235 Cornelis J., 401 , ~
!
11
"lr
n,;~,
Le Pape L., 245 Lemaitre M., 307 Lojacono R., 109 Luksch P., 389 McWhirter J.G., 25 Megson G.M., 283 Mertens M., 401 Pennd J., 155 Perrin G.R., 341 Pirsch P., 179 Pirsch P., 353 Popp O., 167 Pottier B., 245 Proudler I.K., 25 Rangaswami R., 295 1~
' ~ ~" T
nnn