Lecture Notes in Computer Science Edited by G. Goos and J. Hartmanis
216 I
I
III
Christer Fernstrom Ivan Kruzela Ber...
63 downloads
722 Views
14MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Lecture Notes in Computer Science Edited by G. Goos and J. Hartmanis
216 I
I
III
Christer Fernstrom Ivan Kruzela Bertil Svensson
LUCAS Associative Array Processor Design, Programming and Application Studies
Springer-Verlag Berlin Heidelberg New York Tokyo
Editorial Board
D. Barstow W. Brauer P. Brinch Hansen D. Gries D. Luckham C. Moler A. Pnueli G. Seegm(Jller J, Steer N, Wirth Authors Christer Fernstrom Ivan Kruzela Bertil Svensson Department of Computer Engineering, University of Lund P.O. Box 118, 22100 Lund, Sweden
CR Subject Classifications (1985): B.1.4, B.2.1, B.3.2, C.1.2, D.3, F.2.1, G,2.2, H.2.6, 1.4.0, 1.4.3, 1.4,6, 1.4.?
ISBN 3-540-16445-6 Springer-Verlag Berlin Heidelberg New York Tokyo ISBN 0-387-16445-6 Springer-Verlag New York Heidelberg Berlin Tokyo
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specificallythose of translation,reprinting, re-use of illustrations,broadcasting, reproduction by photocopyingmachineor similar means, and storage in data banks. Under § 54 of the German Copyright Law where copies are made for other than private use, a fee is payable to "VerwertungsgesellschaftWort", Munich. © by Springer-VerlagBerlin Heidelberg 1986 Printed in Germany Printing and binding: Beltz Offsetdruck, Hemsbach/Bergstr, 2145/3140-543210
PREFACE
Performance requirements for computers are steadily increasing.
New application areas
are considered that pose performance requirements earlier thought unrealistic. history of computing, increased c i r c u i t
In the
growing demands have to a substantial degree been met through
speed,
However~
in
the
most powerful computers of
each t i m e
parallelism has also been introduced because improvements in circuit speed alone have not been sufficient to produce the required performance, The /40 year history
of computing shows that concepts introduced in high-performance
computers often become part of the design of more moderately sized (or at least more moderately priced) wide-spread computers a few years later,
The rgpid progress of Very
Large Scale Integration (VLSI) technology also helped increase the use of parailetismo New computer architectures often originate from the need to efficiently solve problems arising in some specific application areas, these problem classes.
However~
They are~
in a way~
demonstrate great similarities with each other,
Thus~
algorithms
areas on specific
for
solving
machines is evident,
tuned specifically to
many architectures are of a general purpose kind or
problems from
different
the need to discover efficient classes
of
parallel
Existing programming languages are strongly influenced by classical
computer architecture and thus not suited for expressing these algorithms.
Therefore a
need for new languages is aiso evident, The
necessity
of
abandoning the
von
Neumann
architecture
performance systems has been advocated by many authors, 3ohn Backus.
in
the
design
of
high-
One of the most prominent is
He maintains that we are also hampered in our way of designing algorithms
by the habit of always breaking them down into sequential form: "It is an intellectual bottleneck that has kept us tied to w o r d - a t - a - t i m e thinking instead of encouraging us to think in terms of the larger conceptual units of the task at hand" [Backus78], This view points to t h e i m p o r t a n c e of i m p l e m e n t i n g radically new c o m p u t e r a r c h i t e c t u r e s and using them in practice. of computer scientists for With new architectures~ become essential.
Many computational problems have engaged a iarge number decades due to the continued relevance of these problems.
some of these problems may be less important while others wiJl
For example,
when working with a highly parallel computer,
find sorting to be of l i t t l e interest,
we may
yet the problem of routing large amounts of data
IV
between different parts of the machine without conflict now becomes a salient problem, The LUCAS project (Lund University Content Addressable System) is an a t t e m p t to design and evaluate
a
highly
parallel system
while
keeping its size within
still
the
limits
necessary for a university research project. The
initial
plans,
greatly
inspired by
Processors" by Caxton Foster [Foster76], autumn of that same year.
the
monograph "Content
were drawn in 1978.
Addressable Parallel
The project started in
A f t e r simulations and implementation of a prototype,
the
final design (with 128 processors and a general purpose intercennection network including the perfect shuffle/exchange) was decided upon in t980 and fully implemented in 1982.
In
1985 a dedicated input/output processor was added to the system. The main objective of LUCAS design and implementation was to allow a research vehicle for the study of architectural principle% associative array processors.
programming methodology and applicability of
With certain principles and design details fixed (such as
bit-serial working mode and the use of conventional memory circuits),
the implementation
of LUCAS allowed modification of architecture parts to suit certain applications. parts
include
system~
the
and the
network
that
instruction
interconnects
sets at
the
different
processing units,
architectural
processing elements in the design is not limited in itself,
the
levels.
These
input/output
The number
of
but has been fixed to 128 in
the implemented version used for application studies. The algorithms that have been programmed and evaluated on the machine mainly concern three large areas - image processing,
signal processing,
and database processing.
New
programming toots and languages were developed to express parallelism and associativity. This
book
is
an
attempt
to
compile
the
underlying
programming tools and experiences from the project. material
from
Svensson83a].
three Also
PhD
theses
published
in
1983,
included
is
summing
up
of
a
architecture tuned for signal processing,
thoughts,
design
principles,
The greater part of the book is [Fernstrom8%
continued work
(described in [Ohlsson84]),
Kruzela83,
on
an
improved
and on design of a
dedicated I/O Processor [Kerdina85]. The book is organized as follows: Part
1,
Processor Design,
processing,
starts with a chapter introducing parallel and associative
tt continues with a rather detailed description of LUCAS System architecture,
followed by an overview of the basic instructions.
Part 1 concludes with a comparison of
LUCAS to related designs published in the literature. Part 2 is devoted to programming aspects, programming level.
A
both on microprogramming and application
new microprogramming language which
mastering of parallel computing structure is presented. suitable
for
expressing
parallel
algorithms
is
also
greatly
simplifies the
A high level language (Pascal/L), defined.
Comparison
with
other
proposed languages is made. Part
5
of
the
book
comprises
three
chapters
on
applications.
The
first
Chapter 7 - treats "some well known problems" implemented on LUCAS. are taken from three important classes of computations, computation problems.
of
the
discrete
Fourier
Transform,
and
of
these
-
The problems
namely m a t r i x multiplication, solution
of
graph
theoretical
Chapter 8 discusses the use of LUCAS in relational data base processing and
shows that many of the operations in this field can be efficiently implemented. 9 shows the
implementation of image processing algorithms.
Chapter
Chapters 8 and 9 both
compare the results with reported results from other designs. Part /4,
the epilogue,
contains conclusions and description of continued research.
The
proposal for an improved processing element with a bit-serial multiplier is included here, as are the conditions for VLSI implementation of the processor array. Many people have been helpful during the work that resulted in this book. thank Rolf
3ohannesson and the staff
University of Lund.
at
We want to
the department of Computer Engineering at
We are deeply indebted to Lennart Ohlsson and Staffan Kordina for
the permission to include their results in the book.
Anders Ardo has implemented the
t e x t f o r m a t t i n g system which greatly simplified the work of preparing the manuscript. The Swedish National Board for Technical Development has provided financial support. Professor Dines Bjorner,
who served as scientific advisor to the Board,
valuable constructive criticism. Corporation,
University
has given us
We are also grateful for the support from Lund Science
of Halmstad and Cap Sogeti Innovation in France for having
made the publication of this book possible.
Christer Fernstrom Ivan Kruzeta Berth Svensson
CONTENTS
PART 1.
Chapter 1
PROCESSOR DESIGN
Parallel and Associative Processing
1.1
INTRODUCTION
2
1.2
PERFORMANCE IN PARALLEL MACHINES
4
1.3
ASSOCIATIVE ARRAY PROCESSORS
6
1.3.1 Associative Memories
7
1.3.2 Bit-serial Working Mode
11
1.3.3 A Bit-serial Associative Processor
13
INTERCONNECTION NETWORKS IN SIMD SYSTEMS
15
1.4.1 Introduction
15
1.4.2 The Perfect Shuffle
17
Chapter 2 LUCAS System Architecture
27
1.4
2.1
SYSTEM OVERVIEW
2.2
CONTROL UNIT
32
2.2.1 Overview
32
2.2.2 Instruction Timing
33
2.2.3 Microprogram Sequencer
34
2.2,4 Address Processor
35
2.2.5 Common and Mask Registers
39
2.2.6 I/O Buffer Register
39
2.2.7 Status Register
40
PARALLEL PROCESSING ARRAY
41
2.3.1 Processing Elements
41
2.3.2 Memory Modules and Input/output Structure
/45
2.3.3 Communication Between Elements
46
2.3.4 I/O Processor
50
2.3.5 Physical Description
52
2.3
Chapter 3 3.1
Basic Instructions
CLASSIFICATION
OF INSTRUCTIONS
27
54 54
VIII
3.1.1 Basic Types of Instructions Operating on the Associative Memory 3.2
58
MOVES,
60
PERMUTATIONS AND MERGES
3.2.1 Introduction
3.3
3.4
3.5
54
3.1.2 I/O Instructions
60
3.2.2 Basic Moves
60
3,2,3 Use of the [nterconneetion Network
61
3.2.4 Automatic Routing
63
SEARCHES,
66
COMPARISONS AND LOGICAL INSTRUCTIONS
3.3.1 Type --> <selector>
66
3.3.2 Type --> <selector>
67
3.3.3 Type --> <selector>
67
3.3.4 A More Complex Search
68
ARITHMETIC INSTRUCTIONS
69
3.4.1 Addition and Subtraction
69
3,4.2 Multiplication
70
SUMMARY OF EXECUTION TIMES
74
Chapter 4
Comparison with Related Designs
75
75
/4.1
STARAN
4.2
DAP
76
4.3
PROPAL 2
77
4.4
Vastor
77
4.5
CLIP4
78
4.6
MPP
78 79
/4.7 CONCLUSION
PART 2.
Chapter 5
PROGRAMMING
LUCAS Microprogramming Language
80
84-
INTRODUCTION
84
5,2
MICROPROGRAMMER'S VIEW OF LUCAS
87
5.3
INTRODUCTION TO THE LANGUAGE
89
54
LANGUAGE ELEMENTS
93
5.4.1 Constants
93
5.1
5./4.2 Variables,
Assignments
94
5.4.3 Subroutines
95
5.4.4 Microprograms
97
IX
5.4.5 Statements I - Program Flow Controt
98
5./4.6 Statements II - Array Operations
102
5.5
PROGRAM EXAMPLES
106
5.6
MICROPROGRAM COMPILER
109
5.6.1 Introduction
109
5.6.2 Intermediate Code
109
5.6.3 Code Improvement
tlt
Chapter 6
P A S C A L / L - A High Level Language for LUCAS
114
6.1
INTRODUCTION
114
6.2
OVERVIEW OF PASCAL/L
117
6.3
LANGUAGE DESCRIPTION
119
6,3.1 Declaration of Data
119
6,3.2 Indexing of Parallel Variables
122
6,3.3 Expressions and Assignments
123
6.3.4 Control Structure
125
6.3.5 Standard Functions and Procedures
127
6.3.6 Microprograms
129
6./4
6.5
EXECUTION ON LUCAS
130
6.4.1 Pascal/L Pseudo-machine
130
6.4.2 Paraltet Expressions
t37
6.4.3 Where Statement
138
PROGRAMMING EXAMPLES
142
PART 3.
Chapter 7
7.t 7.2
APPLICATION
STUDIES
Some Well-known Problems Implemented on LUCAS
INTRODUCTION MATRIX MULTIPLICATION
147
7.2.1 n x n Matrices~
n Processors
147
7,2.2 n x n Matrices,
n 2 Processors
151
7.2.3 n x n Matrices~
More Than n But Fewer Than
7.2.4 n x n Matrices,
7.&
1/46
t&6
n 2 Processors
7.3
145
155 More Than n 2 Processors
FAST FOURIER TRANSFORM
156 158
7.3.1 The Discrete Fourier Transform
158
7.3.2 The Fast Fourier Transform
159
7.3.3 Implementation on LUCAS
162
THREE GRAPH-THEORETIC PROBLEMS
168
7.4.I Shortest Path Between Two Given Vertices.
Unit Path
Length
168
7.4.2 Shortest Path Between at] Pairs of Vertices in a Weighted~
Directed Graph
171
7,/4,3 Minimal Spanning Tree
173
7.4.4 Discussion
177
Chapter 8
LUCAS as a Backend Processor for
Relational Database Processing
179
8.1
INTRODUCTION
179
8.2
RELATIONAL ALGEBRA ON LUCAS
181
8.2.1 Introduction
181
8.2.2 Representation of a Relation in the Associative Array
t82
8,2.3 Some Basic Operations in the Associative array
185
8.2./4 Internal Algorithms for Algebraic Operations
189
8.2,5 Performance Analysis
198
8.2.6 Comparison of LUCAS with Alternative Designs
207
8.3
8.4
8.5
8.6
INTERNAL QUERY EVALUATION IN A SIMPLE DATABASE COMPUTER
208
8.3.1 Introduction
208
8.3,2 Database
210
8.3,3 Evaluation of a Query
212
8.3.4 Discussion
217
COMPARATIVE PERFORMANCE EVALUATION OF DATABASE COMPUTERS
217
8./4.1 Introduction
217
8./4,2 Specification of Characteristics of Database Machines
219
8,4,3 Database and Queries
220
8~/4./4 Response Times of LUCAS
223
8./4.5 Performance Comparisons
225
8.4.6 Influence of the Size of the Associative Array
227
8./4.7 Conclusions
229
EXTERNAL EVALUATION OF ,]DIN
229
8.5,1 Introduction
229
8.5.2 System Description
23q
8.5.3 Algorithm and Timing Equations
232
8.5°/4 Discussion
235
CONCLUSIONS
237
Xi
Chapter 9
LUCAS as a Dedicated Processor for Image Processing
241
9.1
COMPUTATIONAL DEMANDS IN IMAGE PROCESSING
241
9.2
DIFFERENT ATTEMPTS TO MEET THE DEMANDS
243
9.2.1 Fast Neighbourhood Access
243
9.3
9.2.2 A Small Number of Special Purpose Processors
243
9.2.3 A Large Number of Conventional Microprocessors
244
9.2./4 A Very Large Array of Simple Processors
245
9.2.5 LUCAS Compared to Other Machines
245
9.2.6 The Advantages of Image Parallelism
246
ORGANIZATION OF PROCESSOR ARRAYS FOR IMAGE PROCESSING
9.4
95
247
9.3.1 Introduction
247
9.3.2 Two-dimensionally Organized Arrays
248
9.3.3 Linearly Organized Arrays
248
IMAGE OPERATIONS ON LUCAS ORGANIZED AS A LINEAR ARRAY OF PROCESSING ELEMENTS
250
9.4.1 Introduction
250
9.4.2 Genuinely Local Operations.
Small Neighbourhood Sizes
252
9~4.3 Genuinely Local Operations.
Larger Neighbourhood Sizes
262
9.4,4 Semi-local Operations
265
9.4.5 Measurements
273
9,4.6 Global Transforms
277
9.4.7 Input/output
278
9,4.8 Larger Images
278
9.4.9 Comparison of Execution Times
280
CONCLUSIONS
283
P A R T 4.
EPILOGUE
Chapter 10.
CONCLUSIONS AND CONTINUED
284
RESEARCH
285
10.1 GENERAL
285
1([I.2 A PROPOSAL FOR A MORE POWERFUL PE ARCHITECTURE
286
10.2.1 The New Design 10.2.2 Execution Times with the New Design 10.3 VLSI IMPLEMENTATION OF THE PROCESSOR ARRAY
286 291 292
10.3.1 Off-chip Memory
293
10.3.2 On-chip Memory
296
10.3.3 No Interconnection Network
296
XII
10.4 FINAL WORDS
297
Appendix 1.
ALU Functions
298
Appendix Z.
LUCAS Micreprogramming Language
301
Appendix 3.
Pascal/L - Syntax in BNF
309
References
312
Part 1 PROCESSOR
DESIGN
Chapter 1 PARALLEL AND
ASSOCIATIVE
PROCESSING
1.1 I N T R O D U C T I O N The rapid development of computers during the last decades has pushed the state of the aft
in two
different
directions:
computers are becoming smatter and they are becoming
more powerful. Advances
in
different
fields
have
progress has influenced speed,
contributed
to
the
development:
cost and size of the components,
been developed for the basic operations,
the
new algorithms have
such as a r i t h m e t i c operations,
of organizing the entire systems are used,
technological
and new forms
where parallel operation between the system
components is exploited. All
these
areas
Unfortunately
have
had
impact
on
the
development
of
more
powerful
we are approaching the speed l i m i t s of gates and flip-flops,
machines.
which means
that the enhancement in c i r c u i t technology atone wilt only allow a r e l a t i v e l y small gain in speed.
It is clear that questions concerning the organization of systems t o g e t h e r with the
development
of
new
algorithms
will
play
an
of
Flynn
increasingly
important
rote
for
further
advances. According
to
a
classification
scheme
categories of computer organization.
kind processes data by a sequence of instructions. and an instruction stream,
[Flynn66],
there
are
four
different
The basis of this scheme is that a processor of any Based on the context of a data stream
the following possibilities exist:
* SISD - Single Instruction stream Single Data stream * SIMD - Single instruction stream M u l t i p l e Data stream * MISD - Multiple Instruction stream Single Data stream * MIMD - M u l t i p l e Instruction stream M u l t i p l e Data stream
The von Neumann architecture belongs to the SISD category. each processing unit
executes the same instruction~
In an SIMD architecture
but on different data.
systems many processors cooperate to solve a common computational task, assigned to the individual processors can all be different. MISD architecture is not fully agreed upon. category,
In MIMD
but the tasks
The exact structure
of the
Some authors put pipehned processors in this
others claim that pipelined processors belong to the SIMD category~
in which
case the MISD category becomes empty. We w i l l in the following only deal with the SIMD category of parallel computers.
This
class of computers is well suited for applications where the same (often rather simple) operation is performed on a large number of well structured data elements, Different
taxonomies for
SIMD computers have been presented,
We w i l l
borrow
the
following definitions from Thurber [Thurber76].
SIMD
processor
a
computer
architecture
characterized
by
an
SIMD
o r i e n t a t i o n of data and procedure streams. Array
processor/parallel processor -
an SIMD processor in which the cells
usually bear some topological relationship to each other. Associative
processor
-
an SIMD
processor
in which
element a c t i v a t i o n is an associative process.
the prime
means of
(The meaning of this w i l l be
explained in Section 1.3.) Generally the celts of an associative processor have a
loose
processor
topological is
relationship
usually
designed
and
are
around
functionally an
very
associative
simple.
memory
The system.
We w i l l use the t e r m associative array processor to denote an associative processor, defined by Thurber~
as
in which a communication network defines a topological relationship
between the processing elements. t e r m processing element ,
As proposed by Stotnick [Slotnick82]~
or PE for short,
suggests a simpler internal structure,
we will use the
rather than "processing unit",
since this
as is commonly the case in SIMD systems.
We t e r m i n a t e this section with the observation that the name "array processor" sometimes is used to designate a processor which is "suitable for processing arrays".
These "array
processors" are usually pipelined back-end computers which serve as attached resources to minicomputers.
In
our
terminology
processor as described above.
"array
processor"
stands
for
an SIiVlD organized
1.2 P E R F O R M A N C E IN P A R A L L E L MACHINES It
is the
computer
need for
larger capacity
system.
Therefore
it
which
is the reason for
is i m p o r t a n t
to
introducing
have accurate
parallelism
methods to
in a
decide the
influence of d i f f e r e n t design parameters on the c a p a c i t y . Three
aspects
bandwidth ~
of
capacity
are
frequently
the speedup and the e f f i c i e n c y
referenced .
in
the
literature~
namely
the
By the bandwidth we mean the number of
operations that can be performed in the system per t i m e unit.
The speedup indicates how
much faster a computation is done in the parallel machine than if i t was executed on a sequential computer.
Efficiency,
finally,
measures the u t i l i z a t i o n of the parallelism in
the machine for a certain computation. To obtain
a value of
the bandwidt%
we
assume that
operations which can be performed simultaneously° are
independent and can be executed w i t h o u t
a computation C consists of n
We assume further that the operations
any f o r m
of
diagram shows the hardware u t i l i z a t i o n as a function of time. diagram
for
the computation
on an array
processor with
interaction,
A
space-time
Figure 1.1 is a space-time
p processing elements~
where
p-
Control
Unit
Processor Array
J
Figure 2.1 Overview of the LUCAS system.
The Master Processor takes care of user interaction, Processor)~
input/output (together with the I/O
file handling and is responsible for sending instructions to the ControI Unit.
It is interchangeable and an ordinary mini- or microcomputer can be used.
Presently a
Prolog Z80 microcomputer system is used~ which includes a MATROX graphica[ display
28
processor,
a
printer
with
graphic
capabilities
and
a
[ink
to
the
LUNET
data
communication network at the U n i v e r s i t y of Lund. The Processor Array,
or Associative A r r a y ,
consists of 128 processors (Figure 2.2).
The
main parts of a processor are the Memory Module (MM) and the Processing Element (PE). The Memory Module is a /4096 bit memory,
where one bit is accessible at a time.
Memory Modules receive the same address f r o m the C o n t r o l Unit, whole memory
i.e.
area is accessible at a t i m e as indicated in Figure 2.2.
The Processing
Elements work on one-bit data and have four internal one-bit registers. uses data from "its own" Memory Module, to
route
data
from
modules as well.
the
memories
to
the
However, PEs.
All
a b i t - s l i c e of the
Most often a PE
an interconnection network is used This
allows data
access f r o m
other
The system works in full synchronism and the Processing Elements all
get the same control signals each clock cycle from the C o n t r o l Unit.
I
~J II El
I I I
i
! I !
!
t
i
I
I I i
I
I/0 shift registers
Memory modules
Interconnection
Processing elements
network
Select First network
F-igure 2.2 Schematic drawing of the Processor Array.
The ensemble of Memory Modules may be seen as one single content addressabte memory (associative memory). issued to template.
all the
Match/non-match
in each PE. Often,
A t y p i c a l case when this view is adequate is when a command is
Processing Elements to compare the memory contents w i t h is marked in one of the one-bit registers,
When a PE has its Tag set to one~
a certain
the Tag register,
we say that the PE is selected .
data f r o m the Memory Modules of selected PEs are to be read out to the Master
Processor.
To allow sequential access to these,
This takes the f o r m of a Select First network,
a m u l t i p l e match resolver is included. which keeps the f i r s t selected Processing
Element and resets the Tag registers in all the following PEs. can be used i t e r a t i v e ] y to select the PEs in sequence.
The Select First network
29
With every memory module there is an 8-bit shift register,
the I/O Register,
used for
input to and output f r o m the memory moduleo The
Control
Unit
(Figure
2.3)
receives
executes these on the Associative Array.
instructions
from
the
Master
Processor
and
Since the PEs work in a bit-serial fashion~
the
most obvious task for the Control Unit is to translate operations on data items (e.g.
81
16
is
or
32
bit
words)
microprogrammable. Unit itself~
to
sequences
of
The microinstructions
bit partly
p a r t l y direct the Processor Array.
operations. direct
The
Control
Unit
the sequencing of the Control
The set of control signals sent to the
Processor Array consists of:
the bit address to the Memory Modules a function code for the PEs an interconnection selector code * control of memory writing~ The Instruction Register, the
IR,
operands are specified
the [/O Registers and the Select F i r s t - n e t w o r k
holds the current instruction.
in the
Parameter Registers~
parallel addition of two vectors is shown in Figure 2.4. illustrated in this example, same instructions.
The locations and length of PRI...PR4.
As an example,
An important feature of LUCAS,
is the a b i l i t y to t r e a t operands of d i f f e r e n t lengths with the
No extra bits that do not carry any information have to be brought
through the computations. An i m p o r t a n t part of the Control Unit is the Address Processor~
which performs fast
computations of addresses to bit-slices in the Memory Modules (increment, add constant~
compare,
The Control U n i t
also contains a Common (Comparand) Register and a Mask Register,
each 4096 bits wide.
The Common Register holds arguments~
computations by all the PEs~ e.g. item
in all the PEs.
Register
may
decrement~
etc.).
influence
which are used in the
search arguments or constants to be added to a data
Through a test input to the Sequencer the contents of the Mask the flow
of
microinstructions.
Register is to mask out certain bits in search operations.
The normal use of
the
Mask
30
[
JR
]
PRI PR2 PR3
I nstructions TO MASTER PROCESSOR
PR4 Status Microprogram Sequencer
Address Processor
Microprogram Memory Common R e g i s t e r Mask R e g i s t e r
Processor Array Control
Bit-slice Address
Status
TO PROCESSOR ARRAY
Figure 2.3 LUCAS Control Unit.
In applications where very fast input/output is required the dedicated I/O Processor acts as a microprogrammabte interface between various peripheral devices and the Processor Array,
The I/O Processor inctudes a Buffer Memory and an Address Processor.
The
l a t t e r is capable of generating various address sequences to the Buffer Memory in order to reconfigure data according to the input/output conditions at hand. The Master controls the I/O Processor and the Controt Unit in a similar way,
31
CONTROL UNIT
I
ADDFA
3
source
IR
source 1
sou rce 2
i
I
]
PR 1
source 2
PR 2
dest
PR 3
length
PR 4
dest
I I I I I
I
I J
l I
I
I
I
i
I
i
I
I
i
PEs
]ength Figure 2./4 I l l u s t r a t i o n source2,
]ength
of the f o u r - p a r a m e t e r i n s t r u c t i o n dest~
length,
"add fields a l l ' :
ADDFA
sourcet~
32
2.2 CONTROL UNIT 2.2.1 O v e r v i e w The Control Unit,
see Figure 2.5,
has two parts: an interface to the Master and a
microprogrammable execution unit which commands the bit-serial processing in the PEs.
Master Buses
PRI Bus Contro] Status
PR2 PR3 PR4 Mi crop rog ram Sequencer
~I-12 Address Processor
Microprogram Memory
~
j
80
12
ipeline Register
r
I
......l Stack
' Common Register Mask Register
1 SOME
1 16 ProcessorArray Control
12 Memory Module Bit-slice Address
Common
Figure 2.5 LUCAS Control Unit.
LUCAS and the Master communicate via the following registers, the memory space of the Master in the current implementation;
which are mapped into
33
* Instruction Register - I R * P a r a m e t e r Registers - PR1 ...
PR4
* Status R e g i s t e r * I / O B u f f e r Register * Common I / O Register * Mask I / O Register * PE ] / O Registers (one in each PE)
The Control Unit is m i c r o p r o g r a m m a b i e with a m i c r o p r o g r a m memory of 4 k words.
The
m i c r o p r o g r a m memory works in two modes=
* In run mode,
when it
is logically
organized as /4096 words,
each 80 bits
wide * In load mode,
when it is l o g i c a l l y organized as /40960 words,
each 8 bits
wide The mode of operation is determined by the Master Processor. Microprogramming controls
is horizontal,
the same part
of
in
the
sense that
the hardware.
each bit
or group of
bits
always
This organization allows parallelism between
microopevations on d i f f e r e n t parts of LUCAS. In order within
to support simultaneous
the
Control
Unit,
activities,
between the
several buses are used for
Master
and the
Control
Unit
communication
and between the
C o n t r o l U n i t and the I / O Register area in the Associative Array.
2.2.2 Instruction Timing
The
Master
Parameter
stores
instructions
Registers.
in
the
Instruction
Register
with
parameters
the Bus Control logic senses if the Instruction Register is empty, loaded and the Master gets an acknowledge signal.
Register already
the Bus Control logic sends
a non-acknowledge signal which results in the Master entering a w a i t state. the Instruction Register becomes free, is sent
to
the
the
in which case data is
If the instruction
contains an instruction whose execution has not yet started,
signal
in
When the Master tries to w r i t e data into any of these registers,
Master.
The
As soon as
the new instruction is leaded and an acknowledge Instruction
Register
together
with
the
Parameter
34
Registers are referred to as the instruction pip~line . To
protect
Instruction
the
contents
of
the
Parameter
Registers
from
being
overwritten~
the
and P a r a m e t e r Register area is not released a u t o m a t i c a l l y when the execution
of a new m i c r o p r o g r a m starts~
but is under m i c r o p r o g r a m control.
There is one other mode of communication between the Master processor and the C o n t r o l Unity
namely the " C o n t r o l Unit Driven I n t e r r u p t " mode (CDI mode).
programs that asynchronous
concurrent
instructions
are
instructions
by
Master,
In this mode,
the
are executed in the Master and in the C o n t r o l Unit can be seen as two
located
processes. in
the
interrupting
the
They
Master's Master
are
asynchronous
memory~
and this
the
since
Control
is transparent
even
Unit
to
though
reads
the program
its
all own
in the
The t w o concurrent processes share two common resources: the I/O registers and
the Status Register in the Control Unit. resources~ So far,
To guarantee exclusive access to the common
s o f t w a r e semaphores must be used.
the possibilities of the CDI mode have not been fully investigated.
more used~
more e f f i c i e n t
should be considered.
The reason for including this mode~
handle
more
and
involves
If it wilt be
ways of synchronization and p r o t e c t i o n of common resources
overhead~
synchronization problem has been found~
is
that
once
which seems more d i f f i c u l t to an
efficient
programs can be w r i t t e n
the instruction scheduling mentioned earlier.
solution
to
the
w i t h o u t the need for
As soon as instructions for the C o n t r o l U n i t
appears~
they are moved to an instruction buffer where they are fetched by the i n t e r r u p t
routine.
In a way this instruction buffer would act as a many-leveled instruction pipeline.
2.2.3 M i c r o p r o g r a m Sequencer The M i c r o p r o g r a m microinstructions microinstruction divided into itself
Sequencer~
Am2910 [ M i c k
in the Centre[
Memory.
and Brick 80], At
generates the addresses to
the beginning of
a cycle~
(control word) is loaded into the Pipeline Register.
several fields~
is c o n t r o l l e d
by three
each c o n t r o l l i n g fields:
one field
one part of the hardware. defines the instruction
the
current
The control word is The Sequencer
to the Sequencer~
one field specifies the address in case of branch instructions and one field selects a test condition to be used for conditional branching.
The T e s t m u l t i p l e x e r gates d i f f e r e n t signals
to the test input of the Sequencer:
* The current value of the Mask Register output. * The state of the Busy flag~ been loaded. Zero Address Processor status.
which indicates if the Instruction Register has
35
Zero Loopcounter status. * Some/None status from the PEs.
2.2.4 Address Processor
A 16 bit f i e l d of the m i c r o i n s t r u c t i o n Array=
operations performed
input/output
etc.
is used to control the a c t i v i t y
in the ALU:s,
in the Associative
the set-up of the interconnection
Since all computations are bit-serial,
network,
it is i m p o r t a n t that the Control
Unit can generate the bit addresses needed at a high rate. For example,
to add two fields in the Associative A r r a y (ef.
Figure 2.4),
the f o l l o w i n g
sequence of bit addresses should be generated by the C o n t r o l Unit: source1
-
bit
g
/
used
to
fetch
bit
0 of
source
1
source2
-
bit
0
/
used
to
fetch
bit
0 of
source
2
/
used
to
store
the
result
destination
-
source1
-
bit
source2
-
bit
destination
bit
0
bit
0
1 1 -
bit
1
The part of the C o n t r o l Unit which handles this is the Address Processor.
The Address
Processor is a 12-bit processor capable of doing the integer a r i t h m e t i c needed for address computations. Register
A
and
calculated.
new
used to
address is generated access the
Associative
each clock Memory
cycle,
while
stored
in the
Address
the next address is being
]n this way the Address Processor introduces no delay in the execution of
microprograms. The
Address
(Am2901A)
Processor
registers and an ALU. internal
is
implemented
with
three
4-bit
bit-slice
and an external data memory organized as a LIFO stack.
registers~
The stack holds 32 12-bit values.
the LIFO stack,
Operands to the A L U are the
the P a r a m e t e r Registers,
(see Section 2.2.6) and a field of the control word.
microprocessors It has 16 internal
the I/O B u f f e r
Register
36
Operations can be performed on;
- any single operand -
any pair of internal registers
- any register and the top element of the stack - any register and the data presented on the data input
The result
of
an operation
is always stored in the Address Register.
It
may atso be
w r i t t e n back into one of the internat registers and it can be pushed on the stack,
37
Parameter Register Microinstruction Constant I/0 Buffer Register
Stack R0
RI R2
X D~ata
R15,
Inp!t/
t
oou 1
t-
Instruction
l;,:oL r tr
.T I Address Register 1
Bit Address
ZAP
ZLC
Figure 2.6 The Address Processor.
In order to reduce the number of bits needed to control the Address Processors
a subset
of sixteen instructions have been chosen and a PROM is used to code a 4-bit field of the microinstruction
into the 10 bits needed to control the Am2901A,
instruction to the Address Processor~
In addition to the
the microinstruction contains two 4-bit fields which
independently select two of the internal registers for operands (REGA and REGB), Table 2.1 summarizes the instruction set of the Address Processor.
In the tables
REGB
stands for the internal register addressed by the REGB field of the microinstruction and (REGB) for the contents of this register,
D A T A stands for the value of the data inputs.
38
Mnemonic
Internal
Operation
Output
None DATA -> REGB (REGA) - > REGB ( R E G B ) + I - > REGB ( R E G B ) - I - > REGB (REGA)+DATA -> REGB None (REGB)+(REGA) - > REGB ( R E G B ) - ( R E G A ) - > REGB ( R E G B ) + I -> REGB ( R E G B ) - I -> REGB (RECdk)+DATA -> REGB (REGB)+(REGA) -> REGB ( R E G B ) - ( R E G A ) ->REGB DATA -> REGB None
TBZ LDBD LDBA IPCCB DECB AD/~D TAEQB /MZ)BA 5UBBA A lNC:B /~Z)ECB AADAD A~BA ASUBBA ALDBD PLADR
( REGB ) DATA (REGA) (REGB) + t ( REGB ) - 1 ( REGA ) +DATA ( REGB ) - ( REGA ) ( REGB ) + ( REGA ) ( REGB ) + (REGA) ( REGA ) (REGA)
( RECA ) ( REGA ) ( REGA ) ( REGA ) DATA
Table 2.1 Address Processor Instructions
An
instruction
typically the
the
Master
to
the
Control
Unit
often
indicate the location of data within the PEs.
Master
started~
from
includes parameters which
These parameters are stored by
in the Parameter Registers and once the execution of the instruction the
contents
of
the
Parameter
Registers
can be loaded into
the
has
internal
registers of the Address Processor through the data input. A zero-indicator senses an all-zero output f r o m the Address Processor and generates the ZAP (Zero Address Processor) signal~
which is fed into the Testmultiptexer.
This allows
the use of the internal registers as loop counters for m i c r o p r o g r a m looping. Since
the
inner
loops
of
bit-serial
arithmetic
operations
tend
Processor busy in each clock cycle (recall the example above)~ been incorporated
in the Control
Unit,
This 12-bit
sources as the data input to the Address Processor.
to
keep
the
Address
an extra loopcounter has
counter is loaded f r o m
the same
In this way i t is possible to load Jt
d i r e c t l y f r o m a Parameter Register to be used in unnested loops,
in nested loops the
i n i t i a l loopcounter value is held in one of the internal registers of the Address Processor. To set up the innermost [oop~ into
the
Ioopcounter.
Address Processor,
this value is pushed on the stack and then popped back
Loopcounters
for
the
outer loops are handled internally
in the
39
2.2.5 Common and Mask Registers The Control Unit also includes a Common and a Mask Register. size as the PE memories, Register.
4096 bits;
They are of the same
and are addressed by the contents of the Address
Input and output of data is handled through their corresponding [/O Registers
which are 8-bit shift registers,
also accessible f r o m the Master.
has been loaded from the Master, under microprogram control. on the COMMON line.
When an I/O Register
the data is moved into the Common or Mask Register
The output from the Common Register is sent to the PEs
The Mask output is connected to the Testmultiptexer.
In operations where one of the operands is a scalar,
for example in parallel search
operations or when adding a constant to a field in the Associative Array, Register is used to hold this operand.
the Common
The role of the Mask Register is to mask out
certain bits when the operations are performed.
In bit-serial operations,
the Sequencer
can use the mask value to conditionally skip certain addresses by "short-circuiting"
the
loops. The need for a Mask Register in the Control Unit is not evident because its task can apparently be handled by the Address Processor. Associative
Array
can
be loaded
into
the
However,
Mask
microprogram makes it useful in several algorithm% Mu!t!plication. value,
Register
the fact that data from the during
the
execution
of
a
for example:
If we want to m u l t i p l y all the elements of a field with the same
the value is moved to the Mask.
an algorithm which skips over strings
M u l t i p l i c a t i o n can now be performed using
of O's and 1"%
resulting in a significant
speed-up over the standard algorithm (adding and shifting in each step), used to multiply two fields on LUCAS. bit-serial computer,
which is
(Note that this is particulary i m p o r t a n t on a
since addition to the p a r t i a l product takes much longer t i m e
than simply shifting it).
2.2.6 I / 0 Buffer Register An i m p o r t a n t communication link in the Control Unit is the I/O Buffer Register. used to move data from selected.
a selected PE to the Control Unit.
First
It is
one PE should be
This is accomplished by either a complete associative search or by loading the
Tag Registers with
a bit
slice from
previous search may be stored. the Select First chain,
the Associative Memory,
To complete the operation,
in order to make the selection unique,
Read Selected) signal to the PEs.
where the result
of a
the Control U n i t activates and sends an IORS (I/O
The selected PE puts the 8 bit contents of its I/O
Register on the I/O Data Bus to be loaded into the I/O Buffer Register.
40
Once the value has been loaded into the I/O Buffer Register it can be used for several purposes:
The
contents
Registers
of
in
the
all
communication
I/O
the
pattern
selects a data source~
Buffer
PEs.
Register
This
may
be
possibility
between the
PEs:
written
permits
a
into
the
very
[/O
flexible
one associative search operation
a second search selects one or several destinations.
The [/O Buffer Register may be read by the Master.
This implements the
data output register of the Associative Memory. Its contents can be copied into the [/O Register of the Common or the Mask Register. where
In this
the
first
way it
is possible to perform
associative
search
selects
the
linked search operations~ key
to
the
next
search
operation. The 8 bits of the 1/O Buffer Register, be
loaded into
padded with zeroes to the [eft,
the Loopcounter or into
the Address Processor,
may
allowing
indirect address links and ]oop values to be stored in the Associative Memory,
2.2.7 Status Register The Master Processor may at any t i m e interrogate a Status Register in the Control Unit, The Master has only read access to this register.
The status information given contains
the following six bits: BUSY
indicates that the instruction pipeline is full.
NONE
When this status bit is TRUE r
none of the PEs has its Tag set.
A f t e r an
associative search it is often usefui to know whether any word matched the search criterion. ZAP ZLOOPC
This bit indicates Zero Address Processor status, This
bit
indicates
Zero
status
of
the
Loopcounter
in
the
Address
Processor.
sllsz
Two
general
purpose
signals
from
the
Control
defined by a t w o - b i t field of the control word,
Unit.
Their
values are
They can for example be
used to indicate the progression in a running microprogram.
41
2.3 P A R A L L E L
PROCESSING A R R A Y
2.5.1 Processing Elements A Processing E l e m e n t (Figure 2.7) consists of four parts:
* A set of
one-bit
* An A r i t h m e t i c
registers:
T(tag),
R(result),
C(carry),
and X ( a u X i l i a r y ) .
Logic U n i t (ALU).
* A D a t a Selector. * Part of a n e t w o r k that i m p l e m e n t s the SELECT FIRST and SOME functions.
Data Select
FROM CONTROL UNIT From Common Register
SELECT FIRST
Part
RO-
0
M
of
SELECT FIRST network
0
From 0 I ntero Connect ion Network o o 0 X 0
Select Chain in
Function
l
S~1 e c t Chain Out
ALU co
T R
/
C ----~. C
XO
Figure 2.7 A Processing Element.
2.5.1.1 Registers The four registers, The
T
(Tag)
corresponding
T,
register Memory
R, has
C, its
Module.
and X,
have slightly d i f f e r e n t features.
output Thus
connected it
to
the
write
control
can be used as a c t i v a t i o n
logic
control,
of
the
inhibiting
42
change in the memory of those PEs where T equals zero. is connected to the SELECT FIRST-network, Unit resets all Tag registers but one, linear ordering of the PEs.
Furthermore,
the Tag register
which on a control signal from the Control
namely the one with the lowest number in the
The Tag register is also the input to the SOME-network.
This network indicates to the Control U n i t whether some or none of the Tags are set. Finally,
the Tag register controls the I/O
Register of the Memory Module in a way
described below. The R (Result) register is the only register from which data may be w r i t t e n directly into the memory, The C (Carry) register is a general purpose register.
In a r i t h m e t i c operations it is used
to hold the carry bit. The X ( A u X i l i a r y ) register is also general purpose. the ALU.
instead,
Its output is not directly connected ta
one of the Data Selector inputs is used,
as shown in Figure 2.7.
2.3.1.2 A r i t h m e t i c Logic Unit A bipolar Programmable Read Only Memory (PROM) serves as arithmetic logic unit. size of the PROM Js 1024 words of /4 bits each.
The
Five inputs are used to specify the
function; thus 32 functions are available.
Al[ PEs receive the same function code f r o m
the Pipeline Register in the Control Unit.
The remaining five inputs are data inputs.
Since the functions are specified by the contents of a PROM they can easily be altered to suit a specific
application.
An " A L U
assembler" has been w r i t t e n to generate the
PROM contents from the boolean expressions defining the four outputs. A general purpose function set which has proved to be applicable in a wide v a r i e t y of computational
areas~
including
image
processing,
signal
processing
and
data
base
processing is listed in Appendix 1.
2.3.1.3 The Data Selector The
Data
Selector allows a Processing Element to
receive data from
eight
Memory Modules according to the wiring of the Interconnection Network.
different
One of the
eight sources is fixed to be the output of the associated Memory Module (the D input). The remaining seven inputs can be wired to any source. Select cede.
All PEs receive the same Data
43
2.3.t 4 The SOME and SELECT FIRST n e t w o r k s The o u t p u t of the S O M E - n e t w o r k indicates to the C o n t r o l Unit w h e t h e r some or none of the Tag registers are set.
i t has the s t r u c t u r e shown in Figure 2.8,
giving a depth of 16
gates when the number of PEs is 128. When the register
Control
is to
combinatorial
Unit
remain network
issues the set to on% with
the
8-input
gates in the
FIRST
command,
all others are reset.
states
signals to the registers as outputs. the
SELECT
of
all
the
only one (the first)
Tag registers
as inputs
To reduce the depth of the n e t w o r k ,
SOME-network
are used as look-ahead.
SELECT FIRST n e t w o r k is shown in Figure 2.9.
TI--~-
T7 - T 9
-
-
TI5 T16-T17--%. T23
-
T120 TI21
-
Tag
This is done by an i t e r a t i v e
-
-
Figure 2.8 The SOME-network. TI2 7 .....
SOME
and C L E A R
the outputs of
The s t r u c t u r e
of the
44
SELECT SOME FIRST IN
\
TOT1
T7
"\
\~'\
\ ....
\\
\
\
\ \
\
\
\
\ \
Figure 2.9 The SELECT FIRSTProcessing Elements.
\
and
\ SOME-
network
of
a board
of
eight
consecutive
45
2.3.2 M e m o r y Modules and i n p u t / o u t p u t s t r u e t u r e With
each of
the
Processing
Elements
f a c i l i t y for input and output of data.
there
is a M e m o r y
Bit address
sh f t
Write a11
and a
Write
WR 4096 bits
DI
~
Modu|e of /4096 bits
Figure 2.10 shows one of the modules.
I~
DO
Data Selector
f
rom R register
shift
I / 0 data reg i s t e r ....
4
Master or
IRead
I/0 Processor
i
Write
Figure 2.10 A M e m o r y Module w i t h its I/O Register.
The m e m o r y , time,
which is an ordinary 4096 x 1 r e a d / w r i t e
receives a 12 bit address f r o m
When W r i t e - A l l writing
m e m o r y chip w i t h 55 ns access
the C o n t r o l U n i t and t w o d i f f e r e n t
is a c t i v a t e d all M e m o r y Modules are w r i t t e n
int%
w r i t e signals.
whereas Write
gives
in a module only if the Tag register of the corresponding Processing E l e m e n t is
set to one, The
4096
Memory,
bit
memory
PEs w i t h
will
sometimes
the Tag register
be r e f e r r e d set w i l l
to
as a "word"
be r e f e r r e d
of
the
Associative
to as selected PEs,
and the
associated words as selected words. The 8 - b i t
l/O
Registers are used for
input and output
A s s o c i a t i v e A r r a y is in the f o r m of vectors, one m e m o r y time,
word.
The m e m o r y
of data.
words can only input or output one bit
but this may be done in all the 128 words simultaneously.
transform
the data f o r m a t b e t w e e n 8 bits in parellel~
bits in parallel,
Usually,
in the orthogona! d i r e c t i o n .
data in the
w h e r e each i t e m of the v e c t o r is stored in of data at a
The 128 ][/O Registers
as seen by the Master,
and 128
46
A data input process can be divided into two phases: one to f i l l the I/O Registers from the Master processor or the I/O Processor, Array.
The first
one to shift the contents into the Associative
phase needs one w r i t e cycle of the Master or the t/O Processor to
transfer 8 bits,
the second phase transfers 128 bits in one Controt U n i t clock cycle.
The Master w r i t e cycle is several times longer than the clock e y r i e period.
The I/O
Processor w r i t e cycle is of the same length as the clock cycJe.
2.3.3 Communication Between Elements
Two
distinct
paths for
communication between PEs are present: the I/O
Bus and the
Interconnection Network.
2.3.3.1 [/O Bus Communication over the I/O Bus is used to send data from one selected PE to all or to a subset of the PEs,
to the Common or to the Mask Register.
One byte is moved from
its source to its destination(s) as follows: t)
Copy the source field into the I/O Registers
2)
Select the source PE by setting its Tag register to One
3)
Perform
a
READ-SELECTED
operation,
which
copies
the
I/O
Register
of
the
selected PE to the I/O Buffer Register Perform an IOWRALL operation. all the I/O
5)
If
Registers,
including
the destination is Common or
corresponding I/O Register. corresponding
Tags
and
The [/O Buffer Register contents is broadcasted to those of
the Common and the Mask Registers.
Mask then perform
a bit-serial input
from
the
If the destination is one or severat PEs then set the
perform
a
tag-masked
input
[n applications where LUCAS acts as an associative memory, used as a data register for input and output.
To r e t r i e v e data,
Buffer Register after step 3 in the procedure above.
from
the
I/O
Registers
the I/O Buffer Register is the Master reads the I/O
Data which should be w r i t t e n into
the Associative Memory is first placed in the I/O Buffer Register whereafter steps 4 and 5 are executed.
47
2.3.3.2 Interconnection N e t w o r k The interconnection network comprises a 128 b i t wide bus and the Data Selector in the PEs.
The outputs of the PE memories are connected to the interconnection bus,
one line is reserved for each PE.
On the input side~
used to connect each PE with its memory,
where
one input of the Data Selector is
and a strapping area on the PE boards allows
the remaining seven inputs to be connected to any of the 128 lines on the bus (see Figure 2.11). Conceptually,
communication
over the interconnection
network differs in several aspects
f r o m communication over the I / O Bus.
* The source and the destination of all links are f i x e d and not data dependent. * Transport pairs,
of
data
is done
in
parallel
between
different
source-destination
resulting in a permutation of the input data.
Communication
between PEs,
where no d i r e c t link exists,
can be obtained
by m u l t i p l e passes through the network.
Interconnection However,
structures
suitable
for
specific
applications
can
there is also a need f o r a general purpose network~
in a f l e x i b l e way,
be
wired
on
LUCAS.
capable of permuting data
hopefully useful in many application areas.
As a general purpose interconnection network for LUCAS, structure has been chosen.
the perfect
shuffle+exchange
The main reasons are: (1) the generality of the network,
(2)
the few connections needed - only two per PE - and (5) the e f f i c i e n c y of the network for calculation of the Fast Fourier Transform (as shown in [Pease68] and [Stone71]), useful in both signal and image processing.
which is
48
Memory
Module ~
I PE ,
Memory Module I PE 1
" Interconnection" Network .
Memory Module 127 ] PE ,27
128-bit data bus F i g u r e 2.11 I n t e r c o n n e c t i o n structure.
The Data Selector of each PE has t w o inputs f r o m this n e t w o r k - one "shuffle" one
"shuffle+exchange"
Selector is used, used,
input - as shown in Figure
a perfect
both p e r f e c t
shuffle
is made.
2.12.
If
the S input of the Data
If the N (=Neighbour's
shuffle and exchange are made.
input and
shuffle)
input
is
Thus the exchange o p e r a t i o n takes
no e x t r a t i m e . Using the Tag registers as a mask,
individual box c o n t r o l may be used on LLICAS.
This
w i l l be described tater. The p e r f e c t shuffle+exchange structure w i t h individual box control is generally regarded as the most useful general purpose i n t e r e o n n e c t i o n scheme for SIMD computers~ capable of r e a l i z i n g the most f r e q u e n t l y used p e r m u t a t i o n s ([Stone71], and L a w r i e 81]).
In spite of thi%
SIMD c o m p u t e r to use this n e t w o r k .
since it is
[Lawrie76]~
[Yew
L U C A S is - to our knowledge - the f i r s t o p e r a t i o n a l
49
from MB no. - ~ - - - -
PE no.
from MM no. ---~
PE no.
Figure 2.12 Connections
to
Data
Selectors
for
the
realisation
of
the
Perfect
Shuffle+Exchange network.
The Interconnection Network above and below,
on LUCAS
also includes data paths from
as shown in Figure 2.13.
the neighbours
These are often used in image processing.
One input to the Data Selector is the output f r o m the X-register of the PE. All Data Selectors receive the same Data Select code from the Control Unit. there is no direct possibility to implement individual box control, that some Data Selectors choose the Shuffle input, i t has to be done in two steps, setting.
In the first step,
Shuffle input.
because this requires
and some the Shuffle+Exchange input.
using the state of the Tag register to decide the box
the R registers of all words are loaded from memory via the
In the second step,
only in selected words,
Therefore,
the R registers are again loaded from memory,
and now using the Shuffle+Exchange input.
but
50
Data Select Code
I
Direct \ Shuffle
S
Shuffle + exchange
a - - A
Above
B
Below
X-register
Figure 2.13 Implemented
connections
From a general point of view,
to
the
Data
Selector of the Processing Element.
data communication over the interconnection network is
useful when the interconnection structure is regular and independent of the data values. It
has
been
Transform.
used
in
e.g.
matrix
operations,
image
processing
and
Fast
Fourier
Communication over the I/O Bus is used in applications where the associative
feature of LUCA5 is predominant, which PEs should communicate,
which means that the current data values determine
i t has mainly been used in database processing.
2.3.4 [/O Processor Computing systems using parallel processing often become [/O bound,
i.e.
the system is l i m i t e d by the capacity of the input/output channel.
To avoid this,
dedicated [/O
Processor [Kordina83] has been attached to LUCA5,
transfer rate from 0.8 Mbytes/second to 10 Mbytes/second.
the speed of a
increasing the data
This allows e.g.
transfer from a disk drive or from an A/D converter sampling a video signal.
fast data
51
Often,
the data comes from the peripheral device in a f o r m a t not suitable for direct
|oading into the Associative A r r a y of LUCAS. line-wise but Processor
For examp[%
an image is usually read
should in some cases be stored in another f o r m a t
is designed to
take care
of
in LUCASo
such necessary rearrangements of
The [/O
data.
It
is
equipped with a Buffer Memory and an Address Processor capable of generating various address sequences.
For rearrangement of data,
the address sequences when w r i t i n g and
reading data t o / f r o m the Buffer are different. The
I/O
Processor diagram
Processor
is
and w i t h
an Am2910
of
the
constructed
processor.
with
Am2901
microprogram MuJti-]eve|
bit-slice
sequencer.
pipelining
is
processors Figure
used
as
the
Address
2.14 shows a brock
because
of
the
speed
requirements.
FROM
CONIROLUNIT
INTERNAL DATA BUS
111111 TO [/0
~GISTB~S
Figure 2.14 The I/O Processor.
The [/O Processor is controlled by the tvlaster Processor through a set of registers mapped into the m e m o r y space of the Master. The
Instruction
Register
(IR)
receives
The registers have the foilowing functions: the
start
address of
the
microprogram
to
be
52
executed.
The register is 8 bits wide.
Parameter Register 1 (PAR1) is used to specify the number of bytes to be read or loaded each time the microprogram is executed.
The register is 12 bits wide.
Parameter Register 2.,.5 (PAR2..,5) are general purpose registers and the use of them depends on the task. Registers, the
in one task where data is loaded into the I/O
the parameters would represent: start address of the Buffer Memory where
data is read, receive
For example~
step size of the buffer address, data and,
finally,
the
number of
start address of the I/O Registers to times this
microprogram should be
executed. Microprogramming of the I/O Processor is similar to microprogramming of the Control Unit.
2,3.5 Physical Descript!q.n The Processor Array,
consisting of
Modules and input/output circuitry, (Photo 2.1 (a)).
128 Processing Elements with associated Memory occupies 16 double layered printed circuit boards
Each board of eight processors measures 220 x 230 mm (Photo 2.1 (b)).
It contains a total of 70 IC packages - 59 implement the processors and 11 are signal buffers and 1/O address decoding circuitry. The Processor Boards are mounted in the top rack of a 2000 x 800 mm cabinet. 2.2).
Immediately below resides the Control Unit which occupies two boards.
(Photo A third
board in this rack contains the circuitry necessary for the loading of mieroprograms into the Microprogram Memory and also a display of the microinstruction word,
which was
used for debugging of the hardware. The next rack contains the Master Processor. bus boards.
It is a Z80 system built from Prolog STD-
It also includes a floppy disk unit and a graphical display system.
53
(a)
(b)
Photo 2.1 (a) The 16 Processor Boards (b) A Processor Board comprising eight Processors
Photo 2.2 LUCAS
Chapter 3 BASIC INSTRUCTIONS
In this chapter we wilt describe the basic types of instructions that can be performed on data in the Processor Array,
Examples of the different instruction types are given and
execution times are calculated.
3.1 CLASSIFICATION OF INSTRUCTIONS At
each moment one bit from each Memory Module is accessible by the PEa.
together these bits form a bit-slice of the total memory. be referred to as the associative memory (AM), a field of the AM.
Taken
The total memory will often
A number of consecutive bit-slices form
(Since the memory chips are of random access type,
that the bits must be consecutive is not necessary~
the constraint
but this is normally the case).
The
contents of the Tag registers may be used to mask out certain words in an operation, We use the term registers.
selector to refer
to the bit vector describing the contents of these
A selector may be stored in a one-bit field of the associative memory.
3.1.1 Basic Types of Instruct!0ns Operat!qg on the Associative Memory The instructions used to manipulate data in the associative memory may be classified according to the types of operands and the types of results. constant,
stored
in the
associative memory. of operand).
Common
P,egister~
or
a vector,
An operand may either be a stored in
a field
of the
(We will use "field" and "vector" alternatingly to denote this type
The result of an operation is either a field or a selector,
i.e a one-bit
field. A selector can always be used in an operation to mask out some PEa. be regarded as an operand,
present in all operation types.
this when we list the different types below.
Thus~ it may also
We do not e x p l i c i t l y indicate
55
This v i e w results in the following six basic instruction classes,
each one illustrated in
Figure 3.1.
(a)
-->
Examples:
Increment field Copy field Permute field (the v e c t o r in the destination field is a certain permutation of the v e c t o r in the source field)
(b)
--> <selector>
Examples;
Maximum value of a field Minimum value of a field
(c)
-->
Examples:
Add fields M u l t i p l y fields M a x i m u m of fietds (The
elements
pairwise
and
of the
the
source
maximum
vectors
of
them
are is
compared
put
in
the
destination field) AND between fields
(d)
--> <selector>
Examples:
Field equal to field (The selector w i l l mark those words where the elements of the two fields are identical) Field greater than field (The
selector
elements
of
will
mark
those
source
field
1 are
words greater
corresponding elements of source field 2)
where
the
than
the
56
(e)
-->
Examples:
Add constant M u l t i p l y by constant Subtract constant Subtract from constant AND w i t h constant
(f)
--> <selector>
Examples:
Exact match Closest match Greater than constant
These basic instruction
types can be combined to
get other types.
consider the merging of two vectors to produce a third vector. selector is used.
In those words,
where the selector is 0,
As an example,
To control the merge,
a
the element of v e c t o r 1 w i l l
be chosen; in those words where the selector is 1 the elements of v e c t o r 2 w i l l be chosen (see Figure 5.2). many different,
i f other inputs to the Data Selector than the direct input are used, useful merges can be obtained.
The merge operation actually is the
result of two successive -to- operations, d i f f e r e n t selectors,
using d i f f e r e n t source fields and
one selector being the complement of the other.
57
(a)
source
(c)
source
2
I
source
2
sou r c e
~
~ <se]ector>
P#/J~
of
field>
~
~ <se|ector>
]
SOU r c e
types
I
(e)
F _ ~ u r e 3.1
dest
(d)
source
dest
sou rce
I
~
dest
(b)
source
dest
instructions
operating
on t h e a s s o c i a t i v e
memory
58
source
source
destination
2
selector
0
A
A
I
I
B
I
0
2
C
2
0
3
D
D
I
4
E
E
I
5
F
5
0
6
G
6
0
7
H
H
I
Figure 3.2 Illustration of a Merge operation
The specification of an instruction includes the operation and up to four parameters. basic instruction types listed above require between two and four parameters. and (f) require two parameters (field address and field length).
Types (a),
require three parameters (two field addresses and the field length). requires four parameters (three field addresses and the field length). special case%
Types (b) (d) and (e)
Type (c),
finally,
Of course,
the value of parameters may be implied in the operation - e.g.
destination field being one of the source fields,
The
in the
or the length always being eight bits.
3.1.2 [/O Instructions For transfer of data between the associative memory and the outside wortd~ between the associative memory and the Common and Mask Registers~ are used.
However~
associative memory. storage,
and aiso
the I/O registers
these can also be used for data transfer between words of the In the l a t t e r case the I/O Buffer Register is used as temporary
as described in Chapter 2.
Instructions that involve the I/O registers are the following:
59
(a)
-->
(b)
--> A selector may be used as a mask.
(c)
--> This instruction puts the contents of the selected I/O register in the I/O Buffer Register. The selector must have one single "1 ".
(d)
--> 4I/O registers>
<Mask I/O register>
This instruction puts the contents of the source register in all 1/O registers simultaneously.
I/O register instructions can likewise be combined to form instructions with other formats. As an example,
consider the case of moving the value of the selected word of one field
to all selected words (according to another selector) of another field.
This instruction,
which may be expressed as --> is a concatenation of the following ones; --> --> --> --> using the second selector as mask.
in the following sections (5.2 through 5.4) we will give examples of basic, instructions.
widely useful,
Most of the example instructions have been implemented on LUCA5 and
they form part of a general purpose instruction set for the machine.
60
3.2 MOVES,
PERMUTATIONS
A N D MERGES
3.2.1 I n t r o d u c t i o n in
this
section
array.
we
Actually~
associative array.
present some basic merely
instructions
m o v i n g data,
without
for
moving data Jn the
changing
it,
is of
little
Since the c o m m u n i c a t i o n n e t w o r k is b e t w e e n the m e m o r i e s and the PEs,
data
array,
use in the
The reason for m o v i n g data is to put i t in position for o p e r a t i n g on it.
and operation can in many cases be combined. pure
associative
movement
takes place.
However,
One e x a m p l e
data m o v e m e n t
there are also situations where
is input
and output
to
and f r o m
the
another is the case when the data r o u t i n g preceding an o p e r a t i o n must be dane in
several steps because the possibilities o f f e r e d by the i n t e r c o n n e e t i o n n e t w o r k are l i m i t e d . A c o n t e x t where data m o v e m e n t s are c o m m o n is data base processing. moved
within
However, much
the
array
to
form
new
as is shown in Chapter
fiuples
8,
in the
tables
D a t a items are
in a r e l a t i o n a l
data base.
the need to move data is still r e l a t i v e l y small,
tess than when the same data structures
are stored in a c o n v e n t i o n a l c o m p u t e r .
3.2.2 Basic Moves An e x a m p l e of a pure move instruction is the "Load Outdata A l l " instruction L D O A source which
outputs
slice,
plus
an 8 - b i t time
for
field
to the I/O
parameter
registers.
loading,
etc,
This takes one clock in
total
11
cycle per bit
clock
cycles.
The
m i c r o p r o g r a m is included in Section 5.4.6.2 as an e x a m p l e of m i c r o p r o g r a m m i n g . A
pure
move
between
two
fields
in
the
associative
array
is
accomplished
by
the
instruction MOVE source,
destination,
length.
Each bit slice takes t w o clock cycles to move (one read, loading ete,
the t o t a l t i m e is 2b+6,
one write).
With
parameter
w h e r e b is the f i e l d length.
The i n s t r u c t i o n BROADCAST
source,
selector,
destination,
takes the contents of the source field
length
in the f i r s t selected m e m o r y word and broadcasts
it to the d e s t i n a t i o n f i e l d of ell words selected by the
"selector" b i t - s l i c e .
First,
data
61
is moved f r o m the source field to the [/O registers of all words, l/O register to the ]/O Buffer Register, destination
field,
24
clock
cycles
then from the selected
next to all I/O registers, are
needed
to
and finally to the
broadcast
an
8-bit
word,
3.2.3 Use of the [nterconnection N e t w o r k The perfect shuffle+exchange network of LUCAS is illustrated in Figure 3.3.
Memory Modules
Figure 3.3 The
principle
of
PEs
the
perfect
shuffle
+
exchange
network
in
LUCAS
In r e a l i t y - as described in Chapter 2 - the exchange function is controlled by choosing the 5 (as in Shuffle) or N (as in Neighbour's shuffle) input to the Data Selector of the PE.
Since all Data Selectors are controlled by the same signals,
chosen for all PEs.
Thus~
interchange state in others~ steps.
the same input must be
if we want to use the straight state in some PEs and the as shown in Figure
3.3,
the transfer must be done in two
The state of the Tag register is used to control which input is taken.
3.3 the Tags are "1" in the PEs that are to use the shuffle+exchange input. are transferred to selected PEs using this input.
data
Then the Tags are complemented and
data are transferred to selected PEs using the shuffle input. receive data over the shuffle input.
In Figure First~
Alternatively,
all PEs first
In PEs with T=t
these data are then o v e r w r i t t e n
w i t h the data arriving over the shuffle+exchange input.
The l a t t e r procedure is slightly
faster,
since the Tags need not be inverted.
62
3.2.3.1 Merges Since
the
transfer
f e t c h e d from
described above is a t w o - s t e p
different
contain data f r o m
fields in the t w o steps.
both the source fields,
and "O's in the Tags.
As an exampl%
process,
In t h a t
data may equaUy well
be
case the destination f i e l d wilt
i n t e r m i x e d according to the p a t t e r n of
"l"s
Figure 3.4 shows how the upper halves of t w o
fields are merged to form a new field.
[]
a~
a~
[]
b~
al
a4
[]
bl
[]
a2
al
[]
b2
[]
a3
a5
[]
b3
[]
a4
a2
[]
b4
[]
a5
a6
[D
b5
[]
a6
a3
[]
b6
[]
a7
a7
[]
b7
[]
a~
Step 2
Step I
O v e r w r i t e s e l e c t e d from s h u f f l e + exchange input
Load all from shuffle input
Figure 3.Z; Merging the upper halves of two fields
In a merge operation any of the eight inputs to the Data Selector can of course be used. A merge operation is fully characterized by * the addresses of the two source fields * the address of the destination f i e l d * the length of the fields * the Data Selector input used for data from source field 1 in PEs with T=I * the Data Selector input used for data from source field 2 in PEs with
T=O
* the selector (the contents of the Tags). All
these
cannot
be specified
in
the
instruction.
The
Data
Selector
inputs
must
be
63
implicitly specified by the operation code. into t h e Tags in advance.
We also assume t h a t the s e l e c t o r ~s loaded
Our example operation using the N and S inputs,
respectively,
is then expressed MEP, QENS source 1, In words,
source 2,
dest,
length.
we express this; Merge (source 1 shuffled+exchanged) with (source 2 shuffled)
into dest where T=I/0.
With t h e same selector, inputs instead of N and S,
the operation MERGEAD,
which uses the Above and D i r e c t
will merge the e v e n - n u m b e r e d e l e m e n t s of the two sources
into the d e s t i n a t i o n field. The execution t i m e for merge operations on b-bit data is 5b+6. clock
cycles,
or 6.0 microseconds using a 5 MHz
With b=8 this makes 50 Another kind of
merging
operation is the following: The elements of two vectors are compared pairwise,
and the
m a x i m u m of them is put in the destination field.
clock.
This merge is not controlled by the
Tag contents but by the data itself.
3.2.4 A u t o m a t i c Routincj
As has been shown in [Lawrie75] many frequently useful p e r m u t a t i o n s can be done by the Omega network,
which in turn can be i m p l e m e n t e d by n passes through a single-stage
p e r f e c t shuffle+exchange network c o n n e c t i n g 2 n m e m o r i e s with 2 n processors. gives a very simple
algorithm
to
find the switch s e t t i n g s necessary
Lawrie also
for performing
a
certain permutation (see Section 1.4.2). When choosing the switch setting of a certain exchange element according to this scheme, conflicts
may
of
simultaneously.
course
arise
Both inputs may,
when
many
data
for example,
items
are
routed
in
the
want to use the lower output.
network Figure
3.5 shows a c o n f l i c t - f r e e permutation; Figure 3.6 shows one which cannot be implemented in n stages.
In the t a t t e r figure,
the unique path bringing a data i t e m from source 00t
to destination 010 has one connection - the last one - in common w i t h the unique path f r o m 01t to 011.
Therefore a c o n f l i c t occurs.
64
source
dest
dest 1
0o0
,00
o00
010
I11
010
01t
101
011
100
011
100
1
101
~
101
110
000
110
111
010
111
Figure 3.5 A c o n f l i c t - f r e e permutation.
The routing of source 101 to destination 110 is
specifically indicated source
dest
000
001
001
0t0
010
100
011
011
100
000
101
110
110
101
111
111
Figure 3.6 A
0
permutation
of
2.3
elements
that
cannot
be
performed
in
3
passes
Many of the permutations that cannot be performed in n passes can be performed in 2n passes - maybe all can.
Y e w and L a w r i e [ Y e w and Lawrie81] have suggested an extension
of I_awrie's algorithm~
which is capable of deciding switch settings for a large class of
permutations that can be performed in 2n passes.
The algorithm works as follows:
When a conflict occurs in the application of L a w r i e ' s algorithm, upper input is given the privilege to decide.
the control bit of the
This is the same as only using the control
bit of the upper input in all situations. Unfortunately,
Y e w ' s and L a w r i e ' s algorithm does not work for all permutations that
65
can be carried out in 2n passes.
The permutation in Figure 3,6 can be realized in 2n
passes,
but Y e w ' s and L a w r i e ' s algorithm does not find the
as shown in Figure 3.7,
necessary switch settings.
However,
used repeatedly,
it finds a switch setting for 4n
passes.
dest
source
dest
000
flOl
OOl
OlO
010
lO0
010
Oll
011
01%
IO0
OOO
1O0
lOl
llO
II0
lOl
111
111
000 OOl
101 II0 111
Figure 3.7 Realization of the permutation of Figure 3.6 in 6 passes
The strategy used to find the switch settings in Figure 3.7 can be expressed as follows: When a c o n f l i c t
occurs,
let the data i t e m
transferred on correctly; i f routed~
that has earlier been routed c o r r e c t l y be
both have been c o r r e c t l y routed earlier~
or both wrongly
choose the "straight" state.
3.2.4.1 I m p l e m e n t a t i o n
Algorithms
of
the
above kind can be used to
desired destinations. used
to
set
the
a u t o m a t i c a l l y route data
items to the
The destination address is simply sent along with the data and is routing
switches.
The
algorithm
of
Yew
and
Lawrie
has
been
implemented on LUCAS in the following way: Each word of the memory is assumed to contain one item of a data field and one i t e m of
a control
Furthermer%
field.
The l a t t e r
contains
the binary
address of
the destination word.
each ward is assumed to have a field containing the address of the word.
For i:= n-1 to 0 do Load the Tag registers w i t h the pattern 101010. 10.
Begin Merge (position i of control field shuffled) with (position i of shuffled+exchanged) where T=I/0. Put the result in the Tags. (Now,
control field pairs of PEs
66
have the same Tag contents). Merge (data+control field shuffled+exchanged) with (data+control field shuffled) where T=I/0. Put the result back in the same fields. End Compare control field and address field. procedure.
If any mismatch is found,
repeat the above
The Master Processor can determine the success of the routing by reading the "SOME" bit of the Status Register. The instruction needs four parameters, ROUTE data,
3.3 SEARCHES,
control,
address,
The format is the following: data length
COMPARISONS AND LOGICAL iNSTRUCTiONS
Search operations on a set of data elements look for elements with specific properties, e.g.
the element(s) with the greatest numerical value,
argument,
those exactly matching the search
or pairs that consist of identical elements.
These examples represent three
different instruction types according to the classification given in part 3.t.1, types (b),
(f)~
and (d),
namely
respectively.
3.3.1 Type --> <selector> The instruction "Maximum of field, MAXT field address~
tagmasked"
length
marks in the Tag registers the word(s) (among the selected ones) with greatest value in the
field.
It
proceeds from
most significant
bit
to
least significant bit
discarding
candidates with a "0" in the current position if there are candidates with a "1". microprogram is included among the examples in Chapter 5. cycles.
Thus,
The
The time is 3b+5 clock
the maximum element in a vector of t28 8-bit data items is found in (3*8
+ 5)*200 ns = 5.8 microseconds,
using a 5 MHz clock.
67
3.3.2 Type --> <selector> The instruction "Compare fields, COMPFT source1,
source2,
tagmasked" length
marks in the Tag registers those words - of the selected ones - where the contents of the t w o fields are identical. R f r o m M-input, ALU.
This is accomplished by a series of successive L R M A (load
all) and CRMT (Compare R to M-input,
The t o t a l t i m e is 2b+6 clock cycles.
4.4 microseconds,
With b=8,
tagmasked) instructions to the 22 cycles are needed,
which is
using a 5 MHz clock.
The instruction "Field greater than field, GREAFT source1,
source2,
tagmasked"
length
marks in the Tag registers those words - of the selected ones - where the contents of field 1 is greater than the contents of field 2. the result only the sign bit is used.
A c t u a l l y a subtraction is made,
]f o v e r f l o w occurs in the subtraction,
inverted so that a correct result is obtained also in this case. clock cycles,
which for b=8 yields 23 clock cycles,
but of
the sign is
instruction t i m e is 2b+7
or 4.6 microseconds,
using a 5 MHz
clock,
3.3.3 Type , --> <selector> The instruction "Compare field to Common, COCOT field_address,
tagmasked"
length
marks in the Tag registers those words - of the selected ones - where the contents of the examined field is identical to the contents of the same field in the Common Register.
Since bits f r o m both operands can be input to the PEs simultaneously,
this comparison
only takes one clock cycle per bit.
In t o t a l ,
b+4 clock cycles are needed.
search
a
of
for
an
8-bit
data
item
in
vector
128
items
is
Thus,
accomplished
in
a 2.4
microseconds. The instruction "Greater than Common, GREACOT field._address,
tagmasked"
length
marks in the Tag registers those words - of the selected ones - where the contents of a certain
field is greater than the contents of the same field in the Common Register.
The A C I M A function of the A L U (see instruction list in Appendix 1) is used to subtract the Common Register output bit f r o m the Memory output bit, and starting with a "1" in the Carry register,
one bit each clock cycle
The t o t a l t i m e is b+4 clock cycles also in
68
this case. We consider two versions of closest match: minimum arithmetic distance and minimum Hamming distance. The instruction "Closest match, CMACOT field_address,
arithmetically,
to Common,
tagmasked"
length
computes the difference between the Common Register value and the field value in each word,
then forms the absolute values of these differences,
minimum.
and finally searches for the
The t i m e is 7b+constant.
The search for minimum Hamming distance, CMHCOT field_address,
length,
counter_length
proceeds by first forming the exclusive OR between the Common Register value and the field
value,
then counting the number of ones in each of these results,
searching for the minimum value among the counts. time,
the second takes 2bc+constant,
used in each word. is not needed). cycles.
where c is the counter length.
The t i m e can be reduced because,
and finally
The first phase takes 2b+constant
initially,
The third phase is the search for minimum,
(A counter field is
the full counter length which takes 3c+5 clock
In t o t a l 2b+3c+2bc+constant t i m e is used for the operation.
3.3,4 A More Complex Search
As an example of a more complex search instruction, The
maximum
we consider the following:
value in a m a t r i x stored one row per memory word is formed by the
instruction M A X M A T m a t r i x start address,
no.
of columns,
data length.
The execution of this instruction proceeds in the following way:
Find the maximum element of the first column. Move this value to t h a t field of the Common Register that has the same address as the second column. Search for a greater element in this column. If there is one, move its value to that field of the Common Register that has the same address as the third column, else move the old value to this place. etc.
69
At the end of this procedure, value.
a field of the Common Register contains the maximum
A bit slice in the Associative Array and a register in the Address Processor are
constantly updated to keep track of where the maximum value so far is to be found.
3./4 A R I T H M E T I C In
this
section
INSTRUCTIONS
we
describe
how
basic
instructions
for
addition,
subtraction
and
multiplication are implemented on LUCAS. 5.4.1 Addition and Subtraction The instruction "Add fields, ADDFT sourcel~
tagmasked"
source2,
destination,
length
adds the contents of one source field to another, field.
writing the result in the destination
The implementation is a repetition of the following three ALU instructions (see
Appendix t),
constantly incrementing the address pointers. LRMA(sourcel) ADMA(source2) WRRT(destination)
Putting the length in the Loopcounter of the Address Processor makes i t possible to perform the loop counting and test without overhead. is 3b+8.
The total time for the instruction
This means that 128 pairs of 8-bit numbers are added in 6.4 microseconds~
using a 5 MHz clock.
The microprogram can be studied as an example in Chapter 5.
The implementation of the subtract-fields instruction SUBFT source1,
source2,
dest,
length
which subtracts source field 2 from source field 1, instead of ADMA,
differs from the above only in that,
the ADMIA function of the ALU is used,
is set before starting.
and that the carry register
This realizes a subtraction according to the t w o ' s complement
method. Addition of a constant to a field,
with the constant stored in the Common Register~
done with the instruction "Add Common to field~ ADCOFT source,
destination~
This takes shorter time~
is
tagmasked"
length
because bits from both arguments can be input to the ALU
70
simultaneously.
The ALU function used is ACMA~
and the execution time is 2b+6 cycles.
Subtraction o f a constant or subtraction from a constant are made in a similar way~ using the ALU functions ACIMA and ACMIA~
respectively.
Multiplication
3.4.2
3.4.2.1 Multiplication of Two Fields Multiplication of two fields of t w o ' s complement numbers MULFA source1,
source2,
destination,
length
is implemented using Robertson's algorithm in the following way. given in Chapter 5). one,
is used.
(The microprogram is
A product field of twice the length of the source fields,
minus
The bits of the multiplier in the source2 field are scanned from right to
l e f t and put in the Tags.
For each bit~
the multiplicand in the source1 field is added to
the proper positions of the product field,
but only in those words where the Tag = 1.
The sign bit in t w o ' s complement numbers actually has weight = -1.
Therefore,
the
into the product field,
but
final step is a subtraction in words with the sign bit equal to one. The execution time is the fallowing (b is the length of the operands): S t e p 1 : Loading the multiplicand~
adjusted to the right~
only in those words where the rightmost bit of the multiplier is "1".
Double sign bit is
used. Time : 3b+3 cycles. Step 2 : Clearing of the rest of the product field. Time : b-2 cycles. Step 3
:
Repeated
conditional
addition
of
the
multiplicand
on
the
to
the
product
field.
Time : (b-2)(3(b+1)+8) = 3b2+Sb-22 cycles. Step
4
:
Conditional
subtraction
based
sign
bit
of
the
multiplier.
Time : 3b+12 cycles. Total time : 3b2+12b-9 cycles. As an exampt%
b=8 gives 279 cycles,
or 55.8 microseconds.
3.4.2.2 Multiplication of a Field by a Constant Multiplication of a field by a constant has a shorter execution time.
The reason is that
71
the number of additions can be reduced; only when the m u l t i p l i e r contains a "1" at the c u r r e n t position,
an addition has to be made.
When m u l t i p l y i n g by a constant,
the constant may be put e i t h e r in the Mask Register or
in a M e m o r y Module of the Processor A r r a y .
Both can be scanned by the C e n t r e ] U n i t
and the bits tested for zero or non-zero value. a r r a y is to be tested, Tag register,
When a b i t of a constant stored in the
a selector is used to point out the word,
the bit is loaded to the
and the SOME-line is i n t e r r o g a t e d .
The average t i m e for m u l t i p l i c a t i o n
by scalar w i l l be half the t i m e for m u l t i p l i c a t i o n
of
fields. A
further
multiplier
reduction is receded
of
the
into
number
a form
code is such a f o r m [Hwang79]. where 1. stands for -1.
of
that
add-type
operations
has f e w e r non-zero bits.
In an SD number,
be
achieved
if
the
The signed-digit (SO)
the allowed digit set is ( 1 ,0,1),
The weights of digits are the usual powers of two.
The SD code of a number is not unique.
For example,
the decimal number 7 can be
coded into a 5 - d i g i t SD v e c t o r in t h r e e d i f f e r e n t ways." 00111, One of
can
these has a m i n i m a l
number
of non-zero
digits,
0100 1 ,
and 1 1 00 1 .
but in general not even the
m i n i m a l f o r m is unique. A
minimal
signed-digit
SD
vector
that
(CSD) v e c t o r
.
contains
no adjacent
Such a v e c t o r
non-zero
digits
is called
a canonical
can be obtained by the f o l l o w i n g
procedure
(Figure 3,8): Starting m-1
at the least significant
"O's preceded by a " !
bit,
replace sequences of m " l " s ,
" and succeeded by a "1"
I 0 1 1 1 0 1 1
i 1 0 1 1 1 ! 0 - 1
/ 1 1 0 0 0 - 1 0 - 1
/ "1
0 -1
0
O
0 -t
0 -1
Figure 3.8 Receding a number to canonical signed digit code
where m > 1,
by
72
By the f o l l o w i n g method~
we can a r r i v e at a m i c r o p r o g r a m to c o n v e r t a v e c t o r stored in
a f i e l d of the associative m e m o r y to CSD f o r m • The procedure described above is reatized by an a u t o m a t o n scanning the bits of B f r o m r i g h t to l e f t and w i t h the state t r a n s i t i o n graph shown in Figure 3,9.
00/0 ~o/o
10/1 oit~
Bi+lBi/Di 00/1 Figure 3.9 State t r a n s i t i o n graph for bit serial recoder
so is the i n i t i a l state. "1"
is found~
Di=l
state is changed.
is output.
When B i + I = B i = I
As tong as Bi=l
cause us to o u t p u t 1 ~ we r e t u r n to s0~
As tong as B i : 0 we stay in So,
is encountered~
we stay in Sl,
stilt staying in s 1,
o u t p u t t i n g D.=O.l
When a single
Di= 1 is o u t p u t and the
outputting Di=l.
Only if t w o consecutive
Bi=0 ~
Bi+l=l
will
"O's are found wilt
o u t p u t t i n g a "1"
+ Coding so as q=0 and s 1 as q=l gives the f o l l o w i n g n e x t - s t a t e function q : + q
= qBi+ 1 v qB i v B i + I B i
The output is three-valued. uI 0
We choose the f o l l o w i n g code=
u2 0
1
0
1
1
1
1
This gives the expressions +
ul=q u 2 = q + B i ~ where + is add modulus 2 If we equip the
ALUs w i t h functions t h a t suit these expressions exactly~
the conversion
73
can be made at maximum speed,
which is three cycles per bit-slice (one read and two
writes).
If we want to use the standard A L U function set,
needed.
The state is held in both the R and C registers.
6 cycles per bit-slice w i l t be The following sequence of
functions implements the above functions (j and j + l are the bits w r i t t e n into when bit no. i is scanned): Contents
of
registers
C
R
q XORP4vlA~ a d d r e s s WRRA,
i
address
j
LP4vIA, a d d r e s s
i
AEt4~,
i+1
q+B
i
+
address
q (=u) 1
+
q(=u)
LRCA
1 WRRA,
address
j+l,
incr
j,
loop
The relevance of doing the conversion to CSD form in the Processor Array w i l l be seen in Chapter 7, algorithm, Therefore
where m a t r i x m u l t i p l i c a t i o n is described.
each element of one of the matrices is to be used as a m u l t i p l i e r once. it
is convenient to
simultaneousiy.
a
CSD
do the
conversion to
CSD form
of
an entire
column
This is probably a rule for data produced in the array: if one i t e m is
going to serve as a m u l t i p l i e r , Using
During the execution of the
coded
many items are.
multiplier,
the
number
of
add-type
subtractions) in a multiplication w i l l on the average be b/% m u l t i p l i e r [Hwang79].
Therefore,
operations
(additions
and
where b is the length of the
the average t i m e for multiplication by a CSD coded
scalar wilt be one third of the t i m e for
m u l t i p l i c a t i o n of fields.
m u l t i p l i c a t i o n takes on the average 19 microseconds,
For exampte~
8-bit
with a 5 MHz clock.
In Chapter 10 we wilt describe the design of a new processing element that includes a multiplier, length.
resulting
in a m u l t i p l i c a t i o n
time
that
grows only linearly
with
the data
74
3.5 SUMMARY O F EXECUTION TIMES
Table 3.1 gives a summary of execution times of basic operations that we have dealt with in this chapter,
To be able to compare the speed of the 128-PE LUCAS with the speed
of a sequential computer we have also calculated the time in which a sequential computer is required to perform the same operation on one data item~ depending on the instruction type, operation",
or EST,
For example, running with
if
or on a pair of data items 9
We call this time the "Equivalent Sequential Time per
It is given in nanoseconds.
a sequential computer is to have the same performance as LUCAS
a 5 MHz clock,
it
including data fetch and storag%
must be able to add two 8-bit numbers in 50 ns,
addressing and loop counting.
8-bit Microseconds
data EST
16-bit Microseconds
data EST
MOVES AIxDMERGES Load Outdata Move
(8-bit
only)
field
Broadcast Merge
field
fields
2.2
17
4.4
34
7.6
59
4.8
38
9.6
75
6.0
47
10.8
84
5.8
45
10.6
83
SEARCHES AND CCF43ARES Maximum of Compare Field
field
Compare Field
4.4
34
7.6
59
4.6
36
7.8
6t
to Corrrnon
2.4
19
4.0
31
t h a n Corrrnon
2.4
19
4.0
3t
6.4
50
11.2
88
4.4
34
7.6
59
fields
greater field
greater
than
field
ARITHVETIC
Add
(Sub)
fields
Add Corfmon to f i e l d Multiply
fields
Multiply
field
by Corrmon ( a v e r . )
Multipl field by C S D - c o d e d Common ~ a v e r a g e )
56
436
190
28
219
95
1484 742
19
145
63
492
Table 3.1 Execution times for some basic instructions on LUCAS running with 5 MHz clock
frequency.
EST
is
the
Equivalent
Sequential
Time
per
operation
Chapter /4 COMPARISON WITH RELATED DESIGNS
The
interest
increasing. 777
in
associative
computing
started
and Hwang and Briggs
associative computers have been mad%
but
the
fifties
Surveys on the topic can be found in [Thurber75~
Parhami73~
actually
in
used,
so far
lack
of
has
been steadily
Thurber76,
You and Feng
Numerous proposals for
the design of
yet only a few real machines have been built and
Many application areas for
the
changing,
84].
and
cost efficiency
associative computers have been identified, use.
This situation is now
due to the development in semiconductor technology,
has l i m i t e d
their
A few machines with
characteristics similar to those of LUCAS have been built.
We w i l l briefly r e v i e w the
main features of six such designs and compare them to LUCAS.
Two of the machines
have been implemented in a university environment (Vastor and CLIP4), (STARAN~
DAP and PROPAL 2) are commercial products~
have
built,
been
STARAN~
MPP
DAP~
processor arrays~
finally~
PROPAL
is
built
by
industry
Three of them
but very few copies of them for
dedicated
use
by
NASA.
2 and Vastor can be considered as being general purpose
whereas CLIP4 and MPP are designed exclusively for image processing.
4.1 STARAN STARAN
[Botcher78]
from
processor array to be built.
Goodyear
Aerospace
Corporation
was
the
first
The first machine was demonstrated in 1972.
bit-serial
tt was built
from standard off-the-shelf integrated circuits. An array in STARAN
consists of 256 Processing Elements.
together,
Each Processing Element has three flip-flops.
from
memory and one bit
the
memory words are 256 bits long~ A
key element of
[Batcher77].
the
STARAN
from
either
Several arrays can be used
Any Boolean function of one bit
of the flip-flops
can be performed.
The
in later versions extended up to 9 kbits, computer is the muitidimensionat access (MDA) store
Elements of data are scrambled in a certain way as they are w r i t t e n into
m e m o r y so t h a t data can be accessed in d i f f e r e n t directions. word or a 256 bit
slice are accessed equally easy.
For example~
a 256 bit
This is achieved by an ingenious
scheme that always puts two bits of the same word in different memory chips and at d i f f e r e n t addresses,
Therefore~
modifications of the bit-addresses to each memory chip,
and passing of data through a so called Flip network,
are needed.
76
The Flip n e t w o r k also accounts for the routing of data between Processing Elements (see Section 1.4.2).
i t has many characteristics
network implemented on LUCA5.
The S T A R A N network is of multistage type,
LUCAS uses a single-stage network, is d i f f i c u l t
in common with the p e r f e c t shuffle/exchange
individual box control,
whereas
as discussed in Section 1.4.2
to achieve on STARAN but can be programmed on LUCAS as shown in Section
3.2.3. Input/output
takes place over a 32-bit wide data bus.
The Flip network allows d i f f e r e n t
subsets of the PEs to be reached in one I/O operation. 32 bits
to
be stored
parallel i n p u t / o u t p u t , In
conctusion~
in parallel
in one word.
A
Also,
provision
the M D A store allows the is made also for
256 bit
but this requires special purpose peripheral configurations.
STARAN
and L U C A S
have
many
features
in common.
The
STARAN
a r c h i t e c t u r e had great influence on the design of LUCAS.
4.2 D A P The
Distributed
company
in
Array
Processor
England.
The f i r s t
-
DAP
[Reddaway79]
experimental
model
was
developed
by
was operable in 1976.
the
ICL
The f i r s t
customer model was delivered to London U n i v e r s i t y in 1980. The pilot D A P has 1024 PEs arranged in a 32 x 32 array. PEs in a 64 x 6/4 configuration. the North, The
PEs
South,
The customer model has 4096
Each PE can communicate with its nearest neighbours in
East and West directions.
comprise
three
one-bit
registers:
Activity,
Associated with each PE is a /4 kbit memory module. same b i t address~ no address m o d i f i c a t i o n
Carry
and
Result
registers.
A l l memory modules receive the
and p e r m u t a t i o n of data,
as in 5 T A R A N ,
is
"highways" in the v e r t i c a l
and
done. For
the purpose of input/output
h o r i z o n t a l directions are used. processor array, same column.
and data broadcasting,
The w i d t h of a highway is the same as the side of the
thus allowing a PE to send data to all PEs in the same row or in the Thus,
DAP allows 32-bit wide I/O transfer in the pilot version and 64-bit
wide transfer in the customer version. A
special
memory
feature module
arrangement space
of
entirely
the
with of
in which Master
different
DAP
the
any 8-bit computer
ways.
accessed at a t i m e r
is that the t o t a l i t y
Host
While
computer. slice
of
This the
via the I/O on L U C A S
of memory is
in
memory registers. an 8-bit
a
modules forms
way
similar
to
a standard the
LUCAS
can be mapped into the address However, portion
of
data
is accessed in
each memory
word
is
on D A P a 32-bit slice over a row or column of the PE is available.
77
The difference lies in the fact that with LUCAS the Host accesses the t/O data registers which
are the
medium
for data f o r m a t
conversion.
With
DAP
the PEs are accessed
directly.
4.3 PROPAL 2 PROPAL 2 [Cimsa79] is a processor array computer designed and marketed by the French company CIMSA.
The number of processors can be between 8 (one board) and 2048 (8
racks of 32 boards),
arranged w i t h one-dimensional connectivity.
is
PE
included
in
each
together
with
a
16-bit
work
A "condition" f l i p - f l o p
register.
A
Boolean one-bit
a r i t h m e t i c / l o g i c unit takes one bit from the work register and the other bit f r o m either the condition flip-flop,
the associated 256-bit memory word or a parallel input.
An 8-bit
shift register is included which speeds up m u l t i p l i c a t i o n by holding the partial products. The work registers play an important role in interprocessor communication. of
a register
can be transferred
in parallel to
the
The contents
neighbouring PE above or below.
End-around c o n n e c t i v i t y can be used between the top and bottom PEs under control of the Host. Input and output can be performed in two ways: One is through the ensemble of work registers which acts as a 1d-bit wide shift register. the PEs is suspended.
During the shifting,
processing in
The other I/O method uses a one-bit input line of each PE.
Thus,
the parallelism of this path can be e x t r e m e l y large. PROPAL
2 is controlled by a microprogrammable minicomputer.
word is divided into two halves of 48 bits each. itself,
The microinstruction
One is used to centre[ the minicomputer
the other controls the parallel processor.
4.4 Vaster Unlike all other systems surveyed in this section, designed at the University of Toronto~
the Vaster computer [Loucks et aL82]
uses a standard,
off-the-shelf processing element
- the Motorola MC14500B~
which is marketed as a control unit for industrial applications.
The number of PEs is 256.
Each PE has a I02/4 bit memory word and all memory words
receive the same bit address. A shift register with one bit per PE is the medium used for interword communication and input/output. required.
Passing data between PEs takes considerable time unless mere shifting is
The conversion from byte oriented to bit-slice oriented data necessary at input
can be done by the shift registers,
but only in every 8th word at a time.
This makes
input/output quite complicated. The size and cost of Vaster are about the same as of LLICAS.
All other machines that
78
we consider represent much larger i m p l e m e n t a t i o n a l efforts. boards are used and a component cost
In the case of Vaster,
of $2000 is r e p o r t e d for a full
16
256 PE system.
(Only 16 PEs have a c t u a l l y been implemented). P e r f o r m a n c e results r e p o r t e d for Vaster indicate that L U C A S is about 10 t i m e s faster on operations t h a t
do not require i n t e r - P E
communication
and more than a hundred t i m e s
f a s t e r on tasks t h a t require such c o m m u n i c a t i o n .
4,5 CLIP4
CLIP4 [ D u f f 7 9 ] is one system in a series of processor arrays which have been constructed at U n i v e r s i t y College,
London.
[t is e x c l u s i v e l y aimed at image processing.
are arranged as a 96 x 96 array.
92t6 PEs
Each PE c o m m u n i c a t e s w i t h eight neighbours,
four of
which are diagonal. 32 bits of processor
memory that
we
are associated w i t h consider.
The
each PE,
PEs
contain
which three
is much less than any o t h e r one-bit
registers
each.
Two
independent Boolean functions of t w o inputs each can be p e r f o r m e d simultaneously.
4.6 MPP The
Goodyear
Areospace
c o n t r a c t e d by NASA
Corporation,
who
designed
and
marketed
in 1979 to build a "massively parallel processor",
for the processing of satellite images.
5TARAN,
was
MPP [Botcher82],
It has an a r c h i t e c t u r e similar to DAP w i t h 128 x
128 processing elements connected in a t w o - d i m e n s i o n a l arrangement. Each PE has associated a 1 kbit m e m o r y word. the DAP
PEs.
Each one contains five
Boolean logic unit.
Furthermore,
The PEs are slightly more p o w e r f u l than
one-bit registers,
a full adder and a t w o - i n p u t
a v a r i a b l e - l e n g t h shift register is included to speed up
multiplication. An a d d i t i o n a l one-bit register, PE.
When inputing data (i.e.
S,
images),
a r r a y and stored in the S-registers. until,
finally,
interrupted
used exclusively for data routing,
I28 bits are input in parallel at one edge of the
Data is shifted f r o m column to column of S-registers
the whole image is stored
and the
input
is included in each
in the S-registers.
data stored in the memories.
Then the processing is
Input and o u t p u t can proceed
simultaneously so t h a t an input image is shifted in at one edge w h i l e at the same t i m e an o u t p u t image is shifted out at the opposite edge. The PEs of MPP can be said to be connected in t w o d i f f e r e n t ways: For processing a two-dimensional
connection
is used,
128 separate 128-bit shift registers.
whereas for i n p u t / o u t p u t
the PEs are connected as
79
4.7 CONCLUSION The machines surveyed share one i m p a r t a n t property: they all comprise a large amount of b i t - s e r i a l l y working processing elements.
It is probably not the case that l i m i t e d budgets
and technological limitations are the main reasons for this~ of the generality of the bit-serial approach. are in fact rather similar~ others. The
but rather an underst&nding
The processing elements of all the designs
although some of them are slightly more powerful than the
The processing elements of LUCAS fall somewhere near the middle of the scale.
differences
between
the
various
designs
arrangement and the input/output structure.
mainly
concern
the
interconnection
This mirrors the fact that the designers have
had d i f f e r e n t application areas in mind for the d i f f e r e n t machines.
Machines p r i m a r i l y
intended for image processing t y p i c a l l y use a two-dimensional interconnection network and fast input/output, more elaborat%
while machines designed for general purpose show other~ schemes for communication.
As for PE interconnection~ that
LUCAS is most similar to STARAN,
which is the only design
uses a network akin to the perfect shuffle/exchange network.
does not
sometimes
have the
complicated
memory
storage method that
However~
STARAN
utilizes.
LUCAS The
input/output system is also quite different. LUCAS is similar to OAP and MPP in the s i m p l i c i t y of data storage but d i f f e r e n t with regard to interconnection and input/output, The size and cost of LUCAS are like those of Vastor° more expensive,
All other machines are larger and
Part 2. PROGRAMMING
All
applications
programs
on L.LJCA5
are
divided
into
two
parts:
one part
which
is
executed sequentially in the Master Processor and one part which executes in parallel on data }orated in the Associative Array. the
Control
Unit
of
LUCAS.
The parallel part consists of microprograms for
The sequential pert,
assembly language or in high-level language, l/O.
In addition to
this,
the Master
instructions to the Control Unit.
which
may be w r i t t e n
either in
takes care of scalar processing and external
program
is also responsible for
providing the
input and output of data to and from the Associative
Array involves both the Master Processor and the Control Unit: the Master toads or reads the
PE
I/Q
Registers
and the Control
Unit
executes microprograms for
between the PE memories and the [ / 0 Registers.
moving data
81
LUCAS has four programming levels with d i f f e r e n t support tools:
PRYING LEVEL
BASIC SUPPORT TOOL
HIGH-LEVEL
Procedure Library
L~AGE
~ACHINE CODE
File
Micro Program Assembler
CODE
Table
The lowest level,
OTHER TOOL
Pascai/L
lv~cro Library
MICRO CODE
PICO
ELABORATE SUPPERT TOOL
LUCAS M i c r o Prograrrming Language
Handler
Microprogram Debugger
Generator
"picocode",
is used to define the instruction set of the PE ALUs.
These are implemented as programmable read-only memories (PROM),
which allows a
change in the instruction set in order to tune LUCAS to a certain application. generator has been developed,
and produces the PROM contents in the form of a binary hex file. standard
format
accepted
A table
which takes Boolean expressions of the inputs to the ALU
by most
PROM
programmers.
The file f o r m a t is a
The current
instruction
set,
which is of general purpose type and supports both a r i t h m e t i c and Boolean operations,
is
listed in Appendix 1. Microprogramming is supported by several tools:
*
Microprogram assembler
*
High-level like language for microprogramming
* Microprogram debugger
The microprogram assembler was developed at an early stage of the design phas%
when
it
It
proved
to
be
an
important
tool
for
debugging and testing
the
hardware.
is
implemented in the form of a macro definition library to a c o m m e r c i a l l y available macro processor (MAC from D i g i t a l Research) and runs on the Master system. A high-level like language has been designed for microprogramming LUCAS.
It allows
82
microprograms to be w r i t t e n on an "algorithmic level", does not need to deal with
which means that the programmer
low level hardware control (enabling buffers,
waiting for
signals to propagate through the system etc.).
It uses a control structure which is similar
to
etc.).
high-level
languages (if-then-else,
while
An
optimizing
compiler
produces
microcode of approximately the same quality as hand coded microcode. A Microprogram Debugger allows microprograms to run at full speed on LUCAS with a breakpoint
facility.
The
contents
of
both
Associative Memory may be displayed,
the
changed,
Microprogram
Memory
and
of
the
loaded from or w r i t t e n to disk.
The
program has four functional modules;
* Microprogram
Memory
loaded from disk, in
symbolic
Monitor.
This
module allows microprograms to
be
the contents of the microprogram memory to be displayed
form
(microprogram
disassembler)
and
the
contents
of
the
microprogram memory to be changed using symbolic notation (microprogram line assembler). * Execution
Monitor.
This
module
allows
Registers and start a microprogram. case the
microprogram
runs
at
the
user
to
load the
Parameter
A breakpoint may be inserted in which
full
speed up to
this point,
where the
execution is halted. * Associative Array Monitor.
The contents of the PE memories and of the
Common and the Mask Registers may be displayed and updated using several different numbers).
formats (hexadecimal,
ASCII characters,
integers or fixed point
The PE registers may be inspected and an area of the Associative
Array may be stored on disk or loaded from disk. may be output to a graphical display processor,
The PE memory contents
resulting in a "pixel mapped"
image on the TV screen of the display unit. * Communication
Module.
communication
network
Allows at
transparent
the
University
communication of
Lund,
with
the
LUNET.
data The
microprogram compiler runs on a VAX 11/780 which is connected to LUNET. The communication module allows mieroprograms to be developed on the VAX and
transferred
directly
to
the
microprogram
memory
of
For machine level programming a collection of standard microprograms, used rio implement
the parallel part
of
most
programs,
LUCAS.
which can be
has been developed.
These
instructions constitute the standard software interface between the Control Unit and the Master Master.
Processor.
Assembly
programs
are w r i t t e n
in the assembly language for
A macro library defines constants and macros,
with the Control Unit.
the
which are used to communicate
The instructions to LUCAS have the following general format:
83
INSTRLJCTION~
INSTRUCTION serve
PI~
P2~P3~
P4
is the instruction name and the Pis are the optional parameters,
as operand descriptors
to
the
instruction,
The
meaning of
which
the parameters is
standardized as follows: P1
gives the location of the first source operand
P2
gives the location of the second source operand
P3
gives the location of the destination
P4
specifies the length of the operand(s)
A File Handler is defined as a collection of subroutines which can be linked to user programs.
This provides a file handling capacity for the Associative Array.
In order to use LUCAS
in a high-level language environment~
a procedure library has
been defined which constitutes a high-level software interface to a Pascal system~ runs on the Master. simple,
which
Since the interaction between LUCAS and the Master is e x t r e m e l y
the interface,
which is w r i t t e n in assembly code~
could be kept small (less than
1K bytes). This approach allows programs w r i t t e n
in a high-level language to interact with LUCAS
w i t h o u t too much e f f o r t needed for the implementation,
Once the basic structur%
as the memory allocation scheme for the Associative Array~ can
is decided~
easily be added when new microprograms are w r i t t e n .
results in a performance reduction for various reasons.
such
new procedures
However~
the
approach
Memory allocation is done at run
t i m e instead of at compile t i m e and every instruction which is sent to LUCAS results in Pascal procedures being called.
Also~
error checking is minimal and operations on field
variables can not be described in the f o r m of expressions. To overcome these deficiencies~ This
language~
which
is
an
a high-level language has been developed for LUCAS. extension
to
Pascal~
is
described
in
Chapter
6.
Chapter 5 LUCAS MICROPROGRAMMING LANGUAGE
5.1 INTRODUCTION LUCAS
is used in applications where its
parallel
can
be utilized
standard minicomputer.
to
obtain
capabilities of
a significant
processing structured
increase
in processing
image
and
database
processing.
However,
it
which is the case in for is
inefficiencies in the implementation of the bit-serial algorithms. often very tight,
a
The bit-serial nature of the computations performed by LUCAS is
both convenient and efficient for processing low-precision data, example
data in
speed over
t y p i c a l l y 2-5 machine cycles,
extremely
sensitive
to
The innermost loops are
and one or two unnecessary instructions
in the loops have a tremendous impact on the processing speed.
This means that most
users w i l l be forced to w r i t e some parts of their programs in microeode. Disadvantages of programming in microcode are well known,
and not specific for LUCAS.
The programmer must be very well acquaintanced with the underlying architecture of the machine
and
pipetining.
take If
into
consideration
microprogramming
microoperations, simultaneously.
is
all
peculiarities
horizontal
(as
in
with
for
LUCAS)
which control d i f f e r e n t parts of the machine~
example it
is
timing
possible
to
and let
overlap and be executed
Deciding which operations could overlap requires a detailed knowledge of
the internal workings of the computer and optimal or near optimal solutions are very hard and t i m e consuming to find.
On the other hand,
it is important to e x p l o i t this possible
paratlelism. With
an increasing number of microprogrammable computers appearing,
microprogram
development tools is growing.
microprogramming Js appealing. such as block structure,
the interest in
The use of high-level like languages for
A language wh}ch incorporates typical high-level qualities,
procedures and structured control statements of type " i f - t h e n -
else" for selection and "while" for repetition, which f a c i l i t a t e both development,
w i l l support modern programming methods,
maintenance and documentation of programs.
Another
i m p o r t a n t aspect is that the language compiler could be responsible for the tedious task of
forming
the
complete
microinstructions
between the microoperations.
with
optimal (or near optimal)
overlapping
Su['~h a language is not automatically machine independent
but it is obvious that a unified approach to the design of languages of this kind could be adopted.
85
Current
research in the area of
microprogramming techniques has two directions: the
design of suitable programming languages and the development of methods for optimizing the microcode. The request for highly efficient microcode influences the definition of languages. designers agree that the firm demand for machine independent languages, in
the
area
of
high-level
languages,
is
not
applicable
to
Most
which prevails
microprogramming.
Ramamoorthy and Tsuchiya formulate their point of view in [Ramamoorthy74]= "The
desirable properties of
compromise explicit
between
and
a high-level
machine
implicit
microprogramming language must be a
dependency,
parallelism,
and
the
ease
of
innate
detection
and
"naturalness"
representing
required of
all
programming languages to establish effective man-machine communications."
It
deserves to
be mentioned
that
the
independence of high-level languages, different computers.
essential
point
in the argument for
is that programs should be transportable between
It is reasonable to assume that the need for this,
microprogramming,
is
very
machine
small.
Therefore,
microprogramming languages differ in their details.
it
is
less
in the context of important
if
two
What is important is that they use
similar basic constructs and that a unified machine independent methodology far their implementation is developed with
machine independent techniques for
compilation and
optimization. The
MPG
system [Baba and Hagiwara 81] is an example of
approach to high-level microprogramming.
a machine
independent
A program w r i t t e n in the ianguage (MPGL) has
two parts: a machine description (MDS) and an algorithm description (ADS).
The MD5 is
used to initialize the compiler and to prepare it for translation of the ADS part,
which
consists of machine specific algorithms. A
similar
technique has been proposed by DeWitt
[DeWitt76].
His language includes
"extension statements" and "extension operators" which allow the definition of new data types and operations.
The core language includes a high-level like control structure,
integers and p r i m i t i v e assignment statements.
The result is similar to the MPG system: a
program consists of a declaration part where the extensions are defined and an algorithm part which is machine dependent. According
to
Dasgupta
[Dasgupta80],
a
p r i m i t i v e and still be called high-level if independent characteristics, "describe and name,
microprogramming language could
be quite
it has certain machine- and implementation-
such as structured control constructs and a possibility to
arbitrarily,
microprogrammable data objects".
His approach to
machine independent microprogramming takes the form of a generic family of languages, called a schema .
The schema includes all the possible constructs and data types that
can be used when the schema is instantiated into a language for a particular machine.
86
This
instantiation
defines
the
specific
properties
of
the
machine
declarations of the data objects which are available (registers~ flops
ere.)
and specifications
of
the
memories~
a p r i m i t i v e data typ% stack~
The LUCAS
the
form
of
stacks~
flip-
possible operations on these data objects.
The
schema includes several high-level con[rot structures (if-then-else~
array~
in
which is the bit,
cas%
whi[%
repeat)~
and five structured data types: sequence of bit,
record and associative memory. microprogramming language~
which w i l l be presented in this chapter was
defined with the following design objectives: 1)
It should have a machine independent f r a m e w o r k defining a genera[ program structure which resembles that of high-revel languages.
z)
The machine dependent characteristics which are added should allow a change of the PE
instruction
set
without
the
need to re-define
the
language and w i t h
minimal
changes needed in the compiler,
3) The semantics of the language should be close enough to the machine to allow an efficient
implementation,
where the cede produced is of the same quality as hand
w r i t t e n microcode The language has a s~ructure where microprograms and subroutines they use are grouped together in modules.
Variables and constants may be declared either on the module level
or locally in the microprograms or the subroutines. level type.
The control constructs are of high-
The language allows microprograms to be w r i t t e n without dealing with low-
level hardware c o n t r o l
Experience from microprogramming in microprogram assembly code on LUCA5 has shown t h a t the part of the system which is the most d i f f i c u l t to program is the Control Unit~ and especially the Address Processor w i l l become a bottleneck if not properly handled. The use of local and global variables in the tanguage~
a r i t h m e t i c expressions on variables
and automatic handling of parameter passing to subroutines make the Address Processor disappear from the programmer's view. In fact,
no explicit programming of the Control Unit needs to be done at art,
means t h a t the programmer deals w i t h a much less complex architecture. with
the high-level like control
constructs
powerful tool for microprogram development. good~
which simplifies the documentation.
of the language~
This,
which together
have resulted in a very
The readability of programs is also very
87
5.2 MICROPROGRAMMER'S VIEW O F LUCAS
This section describes the architecture of LUCAS as seen by the microprogrammer when using the microprogramming language. array of 128 processing elements (PEs), Register, Figure 5.1).
a Mask Register,
According to this view,
LUCAS consists of an
128 memory modules (PE memories),
Processing of data is done in a bit-serial fashion by the PEs.
I/O Buffer Reg is te r
E--t Common Register Mask Register
I---[
..... I / 0 Register
Figure 5,.,IThe
a Common
an I/O Buffer Register and an Interconnection Network (see
PE Memory
Interconnect ion Network
Microprogrammer's view of L U C A S .
ALU + Regi sters
88
From Common
Above Below Shuffle N.-Shuffle Direct
ALU
Write back to memory
Figure 5.2 Processing Element (PE) in LUCAS.
I c°mm°n V/////d
.
...........] Mask
Associative Array i I
I I
i i
i r
i
I
)
I
Figure 5.3 Bit-slice and fietd in the Associative Array.
A
PE
(see
Figure
5.2)
comprises
an
ALU~
four
one-bit
input/output register and a m u l t i p l e x e r for data selection. (T)~
is used to select the PE.
every instruction it receives.
registers~
an
eight-bit
One of the registers,
the Tag
When the Tag has the value ONE~ the PE w i l l execute If the Tag has the value ZERO~
only instructions which
are destinated for all the PEs w i l l be executed and other instructions wit[ be interpreted as no-ops by the PE. All data paths are one bit wide. t i m e f r o m the PE memories)
This means that only one bit of data is accessible at a
the Common)
the Mask and the I/O Registers.
This is
89
commonly referred to as a bit-slice of data. one bit wide,
Every operation on data which is more than
must be executed as a sequence of operations.
is composed of a number of bit-slices,
Data of this kind,
which
is called a field (see Figure 5.3).
Instructions to the PEs are of the following kind:
Input/Output .
(1) Write the I/O Register output value into the PE memory.
(2)
memory
Shift
the
output
value
into
the
I/O
Register.
(3)
Similar
instructions for the Common and Mask I/O Registers. Register operations with data from PE memory . in the ALU.
Instructions are executed
Operands are the four one-bit registers and the memory output
bit,
which is connected to the A L U via the multiplexer (see Figure 5°2).
The
result
of
the
operation
is
stored
Register operations with data from Common . leaving the result in the registers,
in
the
four
registers.
The same type of operations,
but now the output from the Common
Register is used as an additional operand. Memory w r i t e .
The value of the R Register is w r i t t e n into the PE memory.
The data does not pass through the Interconnection Network,
so data is
w r i t t e n into the corresponding PE memory. I/O Buffer Register operations . selected
PE
into
the I/O
(1) Copy the value in the I/O Register of a
Buffer
Register.
Register contents to at[ the I/O Registers,
(2) Broadcast the I/O
Buffer
including those of Common and
Mask. Operations on the Tag Registers .
Several of the operation types described
above are affected by and a f f e c t the Tags. instructions
to
set
all Tags (to
ONE)
In addition to these,
and to
there are
reset all Tags except one.
5.3 INTRODUCTION TO THE LANGUAGE The entire program t e x t presented for compilation constitutes a compilation unit . syntactical
construct
module starts w i t h
that represents the basic compilation unit,
is the module .
a module heading and terminates with the keyword endmod .
body of the module has two parts~
The A The
a declaration part and a submodule part .
An e n t i t y declared in the declaration part of a module is known,
or visible within the
module and may be referenced from any point following its declaration.
Declared entities
90
are of the
following types:
constants variables subroutines The submodule part can take t w o forms. nested within
the basic module,
It can consist of one or several new modules~
which is then called the 91obal module ,
or i t can be
composed of a number of microprograms. The
microprograms
microprogram Master started,
are
the
only
entities
defines a machine level
computer
when
it
sends an
visible
instruction
instruction
outside for
the
LUCAS
to LUCAS.
it runs to completion w i t h o u t being interrupted.
compilation
unit.
A
and is invoked from
the
Once a microprocjram
has
It terminates its execution with
a microcode branch to a predefined location and LUCAS informs the Master that a new instruction may be sent.
The only place inside a c o m p i l a t i o n unit where a m i c r o p r o g r a m
is visible is in its own body - it can be e x p l i c i t l y aborted. The concept of a subroutine differs from t h a t of a m i c r o p r o g r a m in that the subroutine is visible inside the module where it has been declared~ unit.
A subroutine
may be invoked from
but not outside the c o m p i l a t i o n
a microprogram,
from
another subroutine or
recursivety from itself. Both microprograms and subroutines can have parameters and may declare local constants and
variables,
Figure
5,4 gives the
syntax
diagrams
for
the
general
structure
of
a
c o m p i l a t i o n unit. The scope of a declaration is defined as the part declaration has an effec%
i.e.
of the program t e x t over which the
where the name of the declared e n t i t y is known.
5.5 illustrates the general scope rules of the language.
Figure
91 CompilationUnit Module
Module
[-~Dec]arati°n]~ Subm°du]e~ENDMOD~ part part Declaration part ['~ I l e ~ S u b r o u t i n e ~_~Constant ~__~L~Var'ab L [declaration |declaration dectarationj I
Submodule,,part
t
~
~I Module
-{ Microprogram~
FLqure 5,4 The generalstrucLureof a compilationunit,
j
92
ModuTe var VO
I
..,Subroutine SO
] Externally visible
Module 2 var V1
Subroutine $I
]
IMicroprogram MI va r V2 ......
~I
I
I-
I
Module 3 var V3
\
I Subroutine $2
1
I
Microprogram M3 var V4
i Microprogram M4
Figure 5 5 Scope rules in nested modules,
A slightly modified form of the Backus-Naur form (BNF) wit! be used to represent the syntax of the language.
":~
This notation has the following metasymbots;
I
/
/x
93
"" enclose non-terminats~ rewritten denote slash
"::="
should be i n t e r p r e t e d
"1" should be i n t e r p r e t e d as "or".
as".
repetition
of
is indexed~
the enclosed symbols such
,/x.
as
the
zero or more
meaning
as "means"
A pair of siashes~
is t h a t
times, the
"/'~
or "can be is used to
When the t e r m i n a t i n g
enclosed
symbols
may be
r e p e a t e d at most x times. " < e m p t y > " denotes the e m p t y string. Programs
in a free
are w r i t t e n
numbers~
etc
(Pascal-like
format
notation),
with
the "standard" texical rules for identifiers~
Comments
are
surrounded
by curly
brackets
or
b r a c k e t - a s t e r i x pairs: "(~" "~)',
5,4 L A N G U A G E
ELEMENTS
5.4.1 Constants A constant is an i d e n t i f i e r which is associated w i t h an integer value at the t i m e of its declaration,
This i d e n t i f i e r may l a t e r be used in the program t e x t in place of the integer
value it represents.
The constant d e c l a r a t i o n has the f o l l o w i n g form:
-:= const =<sign> /i=<sign>/; <sign> ::= <empty> I -
::= I The
association
between
the
identifier
and the
value
unless the same i d e n t i f i e r is r e d e c l a r e d in a module~
is valid
nested w i t h i n the module where it was o r i g i n a l l y declared.
const
cl
=
c2
= 256;
100;
throughout
the
modul%
subroutine or m i c r o p r o g r a m which is
94
5.4.2Variables, Declaration
of
microprograms within
the
Assignm.ent..s variables
is either
(see Sections
body
of
5.4.3
implicit and
or e x p l i c i t ,
5,4,4)
are
the subroutine/microprogram,
Parameters
treated
to subroutines
as locally
The f o r m a t
of
the
and
declared variables explicit
variable
declaration is: ( v a r i a b l e declaration>
::=
var ( i d e n t i f i e r > / , ( i d e n t i f i e r > / ;
A
variable
may be assigned values in the range 0 to 4095.
The same rules of scope
apply to variables as to constants,
I var vl,v2,v3;I A variable is used in the following contexts:
* Pointer to data in the Associative A r r a y ,
The current value of the variable
is used to indicate a bit-slice in the Associative Array, * Test variable in control constructs .
The value of the variable is tested in
the condition part of the control statements. A value assigned to a variable in one m i c r o p r o g r a m is valid in any other m i c r o p r o g r a m in the same module until either the variable is re-assigned or a microprogram of some other module is executed. [n the current
implementation of the language,
visible at any point in the compilation unit. variables which restriction
are located
does not
normally
in the registers cause
any
a maximum of sixteen variables may be
(This is due to the allocation scheme for the of the Address Processor.) However,
problem
if
this
local variables are used whenever
possible. The assignment statement has the following general form:
95
::= := <sign> I := I < v a r i a b l e l > : = < v a r i a b l e l > I < v a r i a b l e l > : = < v a r i a b l e l >
::= + I -
This means that a variable can be assigned:
* the value of a constant or number,
optionally negated,
* the value of any variable (including itself), * its current value plus or minus a constant or a number, * its current Variables
can
value plus or minus the value of any variable (including
also
be pushed onto
or
popped from
a predefined stack
itself),
by use of
the
standard procedures SPUSH and SPOP.
v1:=256; vl:=v41 v1:=v1+256; vl:=vl+v2; vl:=vl+vll SPUSH(vl): SPOP(vS);
5.4~3
A
Subroutines
subroutine
declaration
executable body.
consists
of
a subroutine
heading,
local
declarations
and
an
96
<subroutine declaration> ::= <subroutine heading> < s t a t e m e n t part>
<subroutine heading> ::= subroutine ~ I subroutine < i d e n t i f i e r > ( < f o r m a l subroutine p a r a m e t e r list> )i
< f o r m a l subroutine p a r a m e t e r list> ::= < i d e n t i f i e r > /~/
::= <empty> t I I
< s t a t e m e n t part> ::-begin < s t a t e m e n t list> end
<statement
list> ::=
<statement>
/~ <statement>/
The p a r a m e t e r passing convention is c a l l - b y - v a l u e .
The identifiers which are ( i m p l i c i t l y )
declared
subroutine
equivalent program
in
the
to part~
formal
local
parameter
variables of the
list
of
the
subroutine,
whereas any a d d i t i o n a l
local
They
heading
are i n i t i a l i z e d
variables have undefined
are from
conceptually the calling
values when
the
e x e c u t i o n of the subroutine starts, The
subroutine
declared,
may
be invoked
from
any
executable
A c t u a l parameters can be variables~
part
of
constants or numbers.
<subroutine r a i l > ::= call < i d e n t i f i e r > t call ( )
::= /, /
::= I <sign>
the
module where
it
is
97
5.4.4 Microprograms The declaration of a mieroprogram is similar to the subroutine declaration,
only the
heading is different.
<microprogram heading> ::= microprogram ; I mieroprogram ( ( f o r m a l microprogram parameter list> ); ::= <mieroprogram parameter> /,<microprogram parameter>/3 <microprogram parameter> ::= <empty> I The formal parameter list includes a maximum of four parameter% parameter which
occurs in the list is i m p l i c i t l y
separated by " , ' .
A
declared as a local variable to the
microprogram. The difference between a subroutine and a microprogram lies in the invocation procedure. A
subroutine
mechanism
is called is
defined
from in
inside the
the
module
language
for
where calling
it a
is declared~
whereas no
mieroprogram.
Instead,
microprograms are invoked from programs at another level: the machine code level of the Master processor. The parameter passing is of
call-by-value
type;
variables before the microprogram is started. and the Master allows at most four Registers of the Control Unit. parameter
may
be
left
the
Master
initializes the parameter
The hardware interface between LUCAS
parameters to be passed through the Parameter
As seen in the format definition of the parameter list~
blank~
which
denotes that
no value is passed through
a the
corresponding Parameter Register. Microprogram Ml(pl,p2)~ means that parameters are passed through Parameter Registers 1 and 2. Microprogram M2(pl ,,,p2); means that parameters are passed through Parameter Registers 1 and 4. A
mieroprogram
OPFETCH~
terminates
its
execution
which reads the next instruction
corresponding microprogram.
by
branching
to
a
predefined
routine,
coming from the Master and starts the
98
Communication of values between two microprograms is possible via variables dec|ared on the module level. module
together
[f a microprogram needs more than four parameters, with
an
auxiliary
variables declared on the module level,
micreprogram,
which
assigns
its
it is put in a parameters
to
as follows:
Module M1; vat par1 ,par2,parS,par4;
Microprogram Loadparam(pl ,p2,p3,p4); begin p a r l : = p l ; par2:=p2; parS:=p3; par4:=p4 end;
Microprogram Mic(parS,par6,par7,par8); begin ... / [f Loadparam has been executed, _. / par1 - par8 w i l l be defined when Mic starts endl endmod.
5.4.5 Statements I - Progra m Flow Control
5.4.5.1 General The body of a subroutine or a mlcroprogram contains executable statements which are grouped in a statement list.
A statement list is a (possibly empty) list of statements
separated by semicolons. Two basic groups of statements are defined: those that specify operations on data in the Associative A r r a y and those that control the execution flow.
Statements can be of the
f o r m "compound statements" in which case a statement list,
preceeded by the keyword
begin and followed by the keyword end ,
replaces a single statement.
Statements may
also be empty. [n previous sections we have already come upon two kinds of statements= the assignment statement and the subroutine call,
These w i l l not be further discussed,
99
5.4.5.2 Conditions
Most of the constructs used for program flow control are conditional in that they specify two possible ways to proceed in the execution depending on the value of the condition part of the construct.
Conditions can take the values " t r u e " or "false".
::= I 0 I TRUE I FALSE I SOME I NONE I ZMASK() I NZMASK()
::= = I
::= I The f i r s t
two conditions test if two variables have the same value and if a variable has
the value zero,
respectively.
" t r u e " and "false". set.
TRUE and FALSE are predefined conditions with the values
SOME has the value " t r u e "
NONE is the complement of SOME.
has the value zero in position address.
if at least one PE has its Tag Register
ZMASK(address) is " t r u e " if the Mask Register
N Z M A S K is the complement of ZMASK.
(When comparing the possible test conditions in the language with the test conditions in the C o n t r o l Unit~
as described in Chapter 2,
it is noted that the conditions which are
generated in the C o n t r o l Unit do not appear in the language.
However,
these eonditons
are i m p l i c i t l y tested by the control mechanisms of the language.)
5.4.5.3 If-Then-Else
The
if-then-else
flow. empty.
construct
is used to
select
one of
In an abbreviated form of the construct~ It
is possible to nest several if-then-else
two
possible paths in the program
the i f - s t a t e m e n t , statements,
should be associated with the most r e c e n t l y encountered then .
the else part is l e f t
in which case each else
100
::= if then <statement> else <statement>
::= if then <statement>
5.4.5.4 Loop Constructs The language has three
loop constructs
for specifying r e p e t i t i o n of statements:
while
repeat and i t e r a t e .
<while statement> ::= while do <statement>
Before each r e p e t i t i o n of the s t a t e m e n t part, the s t a t e m e n t w i l l be executed.
the condition is evaluated.
If it is "false",
If it is "true",
the loop terminates and the execution
continues with the s t a t e m e n t f o l l o w i n g the while construct.
::= repeat <statement list> until
The
repeat
statement
is similar
to
the
white
statement.
condition which is used to control the r e p e t i t i o n ,
The difference
is that
is tested at the end of the loop,
the not
at the beginning as in the while s t a t e m e n t . A minor d i f f e r e n c e is that the construct specifies r e p e t i t i o n of a list of statements rather than
of
one single
statement.
purpose of statement brackets, In many This
cases,
The
the number of times
is especially common
-
until
serves the a d d i t i o n a l
replacing a begin - end pair.
in b i t - s e r i a l
executed a f i x e d number of times, white statement~
keyword-pair repeat
a loop should be repeated is known in advance. processing,
where the basic loops have to be
depending on the precision of the operands.
such a [cop has the f a l l o w i n g form:
Using a
101
b-=noofbits~
w h i l e b>O do begin ...
b:=b-1 end{
We note t h a t the loop c o n t r o l v a r i a b l e , loop,
b in the e x a m p l e above,
is accessible w i t h i n the
w h e r e it f o r e x a m p l e may be used as a p o i n t e r to data in the A s s o c i a t i v e A r r a y .
However,
very o f t e n the loop c o n t r o l v a r i a b l e needs not be accessed in the loop since it
is
used
merely
to
control
the
iteration.
This
kind
of
loop
can
be very
efficiently
i m p l e m e n t e d on L U C A S (by the use of special-purpose loopcounters). A loop construct of this kind is defined in the language:
< i t e r a t e s t a t e m e n t > ::= i t e r a t e t i m e s <statement>
::= I
5.4.5.5 E x i t
An
exit
statement
microprogram.
specifies
a structured
termination
of
a loop,
a subroutine
or
a
Within a loop the e x i t s t a t e m e n t w i l l cause an i m m e d i a t e t e r m i n a t i o n of
the loop and e x e c u t i o n wilt continue w i t h the f i r s t s t a t e m e n t f o l l o w i n g the loop s t r u c t u r e . E x i t f r o m a subroutine means t h a t the c o n t r o l is t r a n s f e r r e d to the calling program.
The
e f f e c t of an e x i t f r o m a m i c r o p r o g r a m is a branch to the OPFETCH m i c r o p r o g r a m .
iterate b times begin CMCT(direct,fielda); if NONE then e x i t ; fietda:=fielda+l ; end;
The e x a m p l e shows the innermost loop of a p a r a l l e l search operation,
where data f r o m a
field
Common
in the
Associative
Array
is compared
to
the
contents
of the
Register.
102
(The CMCT
instruction compares the Common Register contents to the contents of the
PE memory at bit address "fietda".) Normally the loop is executed b times, the word length,
but with the use of the e x i t statement,
where b is
the execution of the loop [s
t e r m i n a t e d when all Tags are false. In the
case of
nested loops,
enclosing the e x i t statement, may
be t e r m i n a t e d
a simple However,
by specifying
subroutines and microprograms, the subroutine/micreprogram, label,
its
exit
causes termination of
any enclosing loop, name
in
the
exit
the smallest loop
subroutine or microprogram statement,
In
the
case of
the name used is the name given in the declaration of A loop structure
may be given a Local name by using a
which preceeds the Loop heading,
LOOPA: while b0 do begin
repeat .o,
if b=0 then exit(LOOPA); ,.,
until c=0; .,o
end;
5.4.6 Statements II - Array Operations 5./4.6.1 PE Instructions The PE instructions embrace operations, performed on the registers and on the memory in the Processing Elements, Without parameters,
The instructions are of three kinds: These instructions use the PE registers both as operands and to
store the result, With one parameter,
These instructions use the PE registers as operands but store
the result in the PE memory, With two parameters,
The p a r a m e t e r specifies the PE memory address,
These are instructions where one of the operands comes from
the interconnection network,
The first parameter gives the PE memory address of
103
the source bit.
The second parameter specifies the permutation of data over the
network.
The PE instruction set may be altered by reprogramming the ALU PROMs, instruction set is given in Appendix 2,
LTRA
Load T from R in All PEs
LTRT
Load
WRRA(adr)
The current
The following are examples of PE instructions=
T
from
R
Tagmasked
(in
selected
PEs)
Write R into the PE memories in All PEs at bit address "adr"
LRMA(adr,ABOVE)
Load R from Multiplexer in All PEs.
ABOVE specifies
that the data should come from the memory of the PE immediately above in the Associative Array,
5.4.6.2 Input and Output Input and output of data is physically handled by the I/O Registers, either from the Master Processor,
which are loaded
the t/O Processor or by microcode.
The language
includes instructions for this purpose. Output of data is accomplished by shifting the I/O Registers while specifying the bit-slice address of data to be output. RSHIFT(address) Normally one byte of data is output at a time,
Microprogram LDOA(location); begin iterate 8 times begin RSHIFT(location); location:=location+" end; end; Input of data is handled in a similar fashion.
starting with the least significant bit.
104
WRIA(address) o£ WRIT(address) WRIA
causes the I/O Register output bit to be w r i t t e n into the memory of all PEs.
WRIT is the tag-masked correspondence to WRIA. shifts the I/O Register.
None of these instructions actually
This must be carried out by means of the SHIFT instruction,
which causes all the I/O Registers to be shifted right one step.
iterate 8 times begin WRIT(location); SHIFT; location:=location+l ; end; Note that input and output of data may be performed in one single loop.
This will result
in a more efficient code than if two separate loops are used.
iterate 8 times begin WR]T(inadr); RSHIFT(outadr); /send data to the/ / I/O Register and shift/ inadr:=inadr+l; outadr:=outadr+l; end;
5.4.6.3 Common,
Mask and [/O Buffer Operations
The Common Register and the Mask Register are both 4096 bit random access memories, similar to the PE memories.
They receive the same bit address as the PEs.
The output
from the Common Register may be used as an operand in certain PE instructions and the Mask output is used in conditions (see Section 5.4.5.2). Both the Common and the Mask Register communicate with a corresponding I/O Register. These I/O Registers are either loaded from the Master Processor, by microcode~
exactly as the PE I/O Registers.
the I/O Processor or
Data is output to the I/O Registers
with the same instruction that outputs data to the PE I/O Registers: RSHIFT(address). However,
note that,
since the data in the Common and Mask Registers is static,
output of data is normally not meaningful.
105
Input of data is accomplished with the instructions V~CCM( a d d r e s s ) and V~4VCkSK( add r e s s ) A w r i t e instruction does not shift the I/O Registers.
This must be specified separately.
iterate 8 times begin WRCOM(location); SHIFT; location:=tocation+l end; The I/O Buffer Register provides a flexibte communication link between PEs, PEs and the Common and the Mask Registers,
between
and also between PEs and the Master.
It
can be loaded either with the I/O Register contents of a selected PE or from the Master.
LDIOBS
LOAD from
][/O BUFFER the
I/O
SELECTED.
Register
of
a
Load the I/O Buffer selected
PE
(must
be
Register uniquely
selected). IOBWRALL
Copy
the
I/O
Buffer
contents
to
all
the
I/O
Registers.
106
Example Move
one
byte
at
location
"source"
in
the
tag-selected
PE
to
location
"destination" in every PE where the R Register is ONE.
Microprogram BroadcastSelected(source,destination); begin iterate 8 times begin RSH[FT(source); source:=source+l ; endl SELF;
(* Make selection unique *)
LDIOBS; IOBWRALL; LTRA;
(* Load Tags from R *)
iterate 8 times begin WRIT(destination); SHIFT; destination==destination+l; end; end;
5.5 PROGRAM EXAMPLES The first example shows a microprogram for the operation MAXT - Maximum of Field Tag Masked - see Section 3.3,1.
Microprogram MAXT(Fieid,Length); (* Parameter Field is address to most significant bit *) begin iterate Length times begin CMOT(Field,DIRECT); (* old Tags -2 X~ Select ONEs from remaining *) if NONE then (* restore old Tags *) LTXA; Field:=Field-1 ; end; end;
The next example is the Add Fields Tag Masked - ADDFT - operation from Section 3.4.1:
107
Microprogram ADDFT(Sourcel,Source2,©est,Length); begin CCA; (* Clear the Carries *) iterate Length times begin LRMA(Sourcel~DIRECT); (* Load source bit in R *) ADMA(Source2,DIRECT); (* Add second source bit to R *) WRRT(Dest); (* Write Dest Tag-masked *)
Sourcel :=Sourcel +I; Source2:=Source2+1 ; Dest:=Dest+1 ; end; end;
The
next
example
shows
a
multiplication of integer fields.
complete
module
with
routines
for
2"s
complement
The module contains the microprogram MULFA and two
subroutines which implement addition and subtraction of fields of arbitrary length. algorithm is described in Section 3.4.2.
The steps in the description of the algorithm are
indicated.
(* (*
The
Module for field multiplication
(*
ModuJe FieldMult;
Subroutine AddFieldsT(S1 ,S2,D~L); begin CCA; iterate L times
begin LRMA(SI,DIRECT); ADMA(S2,DIRECT); $I:=$I+I; $2:=S2+I; D:=D+I; end end; Subroutine SubFieldsT(S1,S2,O,L); begin SCA; iterate L times begin LRMA(SI,DIRECT); ADMIA(S2,DIRECT); $I:=$I+I; 52:=$2+1; D:=D+I; end end;
WRRT(D);
WRRT(D);
108 Microprogram MULFA(Multiplicand~Multiptier,Destination,Length); var D,M~K; begin (*STEP 1") LTMA(Multiplier,DIRECT); Muttiplier:=Multiptier+l ; M:=Multiplicand; D:=Oestination; (* First iteration needs no addition since part. prod.= 0 *) iterate Length times begin LRMA(M,D[RECT); ANDTRA; WRRA(D); M:=M+I T D:=D+I; end; WRRA(D); D:=D+I; (* double sign bits *)
(*STEP 2*) (* Clear rest of part. prod. field *) K:=Length-2; iterate K times begin WRRA(D); D:=D+I ; end;
(*STEP 3*) (* Multiplication loop. Use double sign bits *) D:=Destination+l; Length:=Length+l; iterate K times begin LTMA(MultipIier,DIRECT); Multiplier:=Multiplier+l ;
call AddFieidsT(D,Muttiplieand,D,Length); end;
D:=D+I;
(*STEP 4*) LT MA(Multiplier,DIRECT)} call SubFieldsT(D,Multiplicand,D,Length); end; endmod.
109
5.6 MtCROPROGRAM COMPILER
5.6.1 Introduction The translation process for a microprogramming language resembles t h a t of an ordinary high level language in many aspects and standard compilation techniques may be employed in several parts of the compiler, The horizontal microinstruction functional
parts of
performance. and
the
The main difference lies in the code generation scheme. format
provides a potential for parallelism between the
machine which
must be utilized
in order to
assure maximum
A number of strategies for recognizing the parallelism in microprograms
several
algorithms
have
been
proposed
for
compaction is most often used in this context,
microcode
optimization.
The
term
since the algorithms aim at a reduction
in the size of the code w i t h o u t any claims of obtaining an optimal solution. The compiler which has been developed for the LUCAS microprogramming language uses a compaction
algorithm
Tartar76].
This algorithm has been modified to give a better performance and also to
allow
based
microoperations which
on
the
First-Come
First-Served
algorithm
[Dasgupta
and
execute during more than one machine cycle in pipetined
operations, The
compiler
intermediate
has
five
cede,
phases;
code
lexical
improvement
c o n s t i t u t e the f i r s t compilation pass,
i.e.
analysis, and
code
syntax
analysis,
assembly,
The
generation
first
three
and forth
during
phases
when the compiler reads the source code file.
The code i m p r o v e m e n t phase p e r f o r m s several passes over the i n t e r m e d i a t e code, is scanned back
of
this phase.
which
The last phase is basically a two-pass
assembler which produces the final microcode. The p r e s e n t a t i o n given here is a brief outline of the compiler.
For f u r t h e r details the
reader is referred to [Fernstrom83].
5.6.2 I n t e r m e d i a t e Code The first code.
three phases use standard compilation techniques to produce the intermediate
The i n t e r m e d i a t e code consists of a list of pseudo-microinstructions (PMIs).
A PMI
is similar to a microinstruction but with only a l i m i t e d number of its fields defined.
It
controls the smallest meaningful a c t i v i t y in some part of the machine and consists of a microoperation
(MO)
microoperations), C3P ZAP 100
together
with
its
possible
parameters
(which
are
also
For example; Conditional jump to address 100 on zero Address Processor status
110
is a PMI which controls the Sequencer of LUCAS. an instruction
to the Sequencer~
It consists of the MO C,)P~
and its t w o parameters: the MO ZAP~
which is
which controls
the T e s t m u l t i p t e x e r and 100 which defines the value of the Sequencer's data field, The f o l l o w i n g procedures are defined for the generation of i n t e r m e d i a t e code = The
Qen
single
parameter
of
this
procedure
is
the
name
of
an
MO.
The
procedure generates a new PMI containing this MO. 3oin
3oin
Datajoin
This procedure has two parameters; the name of a m i c r o i n s t r u c t i o n an
is
similar
integer
to
value
Gen
but
(number
places
or
the
variable).
MO
The
in
the
last
referenced
generated
fieid
of
PMIo
field and the
last
generated PMI is assigned the integer value. This procedure is used to insert a PMI in the list of PMIs already created.
Insert
It
has two parameters: a p o i n t e r in the list and the name of an MO which will be put
in the
generated PMI.
Subsequent calls
to 3oin and D a t a j o i n
will
operate on the inserted PMI. Chain is used to "connect"
Chain
a sequence of MOs when code for
m u l t i c y c l e ) operations is generated. MOs are moved according last
generated PMI
to c e r t a i n rules.
and the
pipelined (=
During the code i m p r o v e m e n t phase the Calling Chain assures t h a t the
next one produced wilt
keep t h e i r
consecutive
order during code i m p r o v e m e n t .
The PMI in the example above (CJP ZAP 100) is generated by the f o l l o w i n g calls to the code generating procedures= Gen(CJP); Join(ZAP); Oatajoin(Sequencerdata,100);
All the variables of a compiled program are a l l o c a t e d in the sixteen internal registers of the Address Processor. and
it
is t h e r e a f t e r
A l l o c a t i o n is done when a variable is entered in the symbol table never realtocated.
This
simple
solution
is possible since
all the
registers have e x a c t l y the same function. When a subroutine
is cal]ed~
registers b e t w e e n the calling
the c o m p i l e r
checks for
and the called procedure.
overlapping in the a l l o c a t i o n If such an overlap exists~
of the
Address Processor stack is used to save the overlapping registers of the calling program. The i n t e r m e d i a t e code is a symbolic f o r m of the microcode and would~ m i c r o p r o g r a m assembler~
produce executable but i n e f f i c i e n t mierocode.
if processed by a
111
The i n t e r m e d i a t e code is a doubly ]inked list of PMIs.
This structure has been chosen to
allow convenient insertion and deletion of the elements and also to allow scanning of the list in both directions.
The final code which is presented to the microcode assembler (the
last phase of the compiler) consists of a list of microinstructions (MIs).
An MI is formed
in the code improvement phase by merging PMIs according to certain rules.
5.6.3 Code improvement
Two methods for improving the code are used. activities,
The first one preserves the order of art
i f operation j proceeds operation k in the PMI list,
in the resulting code or they are located in the same ML packing of the microcode.
either j still preceeds k
We refer to this method as
Packing is useful for debugging microprograms.
The order of
execution follows the order of operations in the source code and stilt a f a i r l y e f f i c i e n t code is used as compared to executing the PMI list.
A more e f f i c i e n t
microcode is
obtained if the reorganization of the i n t e r m e d i a t e code allows a change of the r e l a t i v e order between the PMIs.
Such a scheme which guarantees that the resulting MI list is
semantically equivalent to the original PMI list is catted a code compaction . The packing process starts with an i n i t i a l l y empty MI list.
Beginning with the first PMI~
subsequent PMIs are merged into one single Mi until a conflict occurs.
The produced MI
is
during
appended to
compilation pas%
the
MI
list.
Actualty~
packing
can
be performed
in which case no list of PMIs has to be created.
outline of the packing algorithm=
the
first
The following is an
1t2
generate empty M[; EMIT==false; CONFLICT:=false; while PMI list not empty do begin get next PMI from list; if PMI has label then CONFLICT:=true (* PMI is a branch target, so must be in new MI *) else if resource conflict between MI and PMI then CONFLICT:=true else if MI must execute before PMI then CONFLtCT:=true else begin add PMI to MI; if PMI is a branch then EMIT:=true; if PMI is part of a pipetined sequence of operations with subsequent operations then EMIT:=true; end; if EMIT or CONFLICT then begin append MI to list; generate empty MI; if CONFLICT then add PMI to MI; EMIT:=false; CONFLICT:=faIse; end~ end (*while *) if MI not empty then append MI to list;
MIs and PMIs are implemented as lists of sets. resources needed,
The sets define data dependencies,
the fields of the microinstruction that are used,
is generated its elements are empty sets.
etc.
the
When a new M1
Adding new PMIs to the MI consists of forming
the union between the sets of the M[ and the PM][ and to assign the resulting sets to the MI.
The meaning of the two Boolean variables EMIT and CONFLICT is as follows: EMIT
is set true when the last PMI was added to the M[ but when a new MI must be generated for the next PMI.
CONFLICT is set true when the last PM] could not be added to the
MI. Local code compaction deals with the compaction of basic blocks ,
where a basic block
is defined as a sequence of consecutive operations (PM][s in our terminology) which may be entered only at the beginning and which is jump-free,
except possibly at its end.
It has been shown [Landskov et al.80] that the problem of finding the optimal solution to the local compaction problem is NP complete.
However,
several non-optimal algorithms
w i t h less computational complexity have proved to be very useful in practice.
113
An i m p o r t a n t the PMIs.
concept in the c o m p a c t i o n process is the data dependency r e l a t i o n between
L e t i and j be t w o PMIs where i precede j in the original PMI list.
is a data i n t e r a c t i o n between i and j, that
the c o m p a c t i o n a l g o r i t h m
must
we say t h a t j is data dependent on i. assure that
If there It is clear
the data dependency relations are kept
i n t a c t when the MI list is produced. The compaction a l g o r i t m in the c o m p i l e r is based on the F i r s t - C o m e First-Served (FCFS) a l g o r i t h m [Dasgupta and T a r t a r 76].
1)
This a l g o r i t h m is as follows:
The PMIs are added to an i n i t i a l l y empty list of MIs.
Every PMI is moved up as far
as i t can go in the list using the rule that a PMI can be moved ahead of an MI if it is not data dependent on any of the PMIs in t h a t MI.
When a data dependeny occurs,
the PMI has reached its rise l i m i t .
2)
Search downwards in the list to find an MI where the PMI may be added with respect to resource conflicts (two PMIs need the same resource or occupy the same field in the
microinstruction).
If
no
such
MI
exists,
a
new
MI
containing
the
PMI
is
appended to the list.
3)
If no rise l i m i t was found in !),
the PMI was not data dependent on any PMI in the
MI list and it may be added to any list element. can be added w i t h o u t a resource c o n f l i c t , of the list. from Practical
If there is no MI to which the PMI
the PMI is placed in a new MI at the top
Placing it at the top rather than at the b o t t o m of the list w i l l keep it
blocking
any
subsequent
experiments [Davidson et
PtvlI
due
to
a
data
al.81] have shown that
dependency
the code obtained is of the
same q u a l i t y as the code produced by the other non-optimal algorithms. FCFS
algorithm
has advantages in both
speed and s i m p l i c i t y .
a l g o r i t h m are described in [ M e z z a l a m a et al.82,
restriction.
In addition,
Implementations
of
the the
Baba and Hagiwara 81].
In the c o m p i l e r we have extended the FCFS a l g o r i t h m by introducing an additional pass. In this pass an a t t e m p t
is made to push PMIs f o r w a r d
in the MI list as far as possible.
If during this pass all the PMIs of an k4I are removed,
the MI is removed from the list.
Using
this
Extended
FCFS
(EFCFS)
algorithm,
the
microcode
obtained
is often
more
c o m p a c t than with the original FCFS and never less compact. In addition to the EFCFS algorithm, for
code i m p r o v e m e n t .
other methods of a more heuristic nature are used
The i m p r o v e m e n t techniques used give good results and o p t i m a l
code is most often obtained.
Chapter 6 PASCAL/L - A HIGH-LEVEL LANGUAGE FOR LUCAS
6.1INTRODUCTION Highly
parallel
machines
of type
assembly like languages.
associative
array
computers
are o f t e n
programmed
in
The reason is t h a t machines of this kind tend to be unique and
d i f f e r in several i m p o r t a n t aspects.
For e x a m p l e :
* N u m b e r of processing elements.
This reaches f r o m 64 elements on I L L I A C IV
[Barnes et at.68] to over 16,000 on the MPP [BateherS0]. Complexity powerful
of
the.....,processing
pipelined
hardware.
elements.
processors,
which
On
ILLIAC
IV,
the
PEs
are
perform
floating
point
arithmetic
in
A t the other end of the spectrum we find machines like S T A R A N
[Batcher79],
the MPP,
DAP [ R e d d a w a y 7 9 ] and LUCAS,
which are all bit-
serial processors. I n t e r c o n n e c t i o n Structure. the PE i n t e r c o n n e c t i o n the
two-dimensional
The t o p o l o g y of the processing a r r a y is defined by
network.
grid
Examples of i n t e r c o n n e c t i o n n e t w o r k s are
on I L L I A C
IV~
the
Staran Flip
network
and the
P e r f e c t Shuffle/Exchange n e t w o r k on LUCAS.
C o m m o n to the languages for SIMD machines is the possibility to declare and to operate on distributed
data,
i.e.
multi-element
data where the single elements are l o c a t e d in
different
processing elements or in d i f f e r e n t
we
that
note
computers,
most
of
are directed
the
languages,
towards a specific
high l e v e l language is to be used, With arose.
the
introduction
of I L L t A C
words of an associative m e m o r y .
which
have been proposed for
machine.
the f a c t
parallel that
if a
it must f i t the a r c h i t e c t u r e welt to be e f f i c i e n t .
IV,
the need for SIMD oriented high l e v e l languages
Several |anguages were proposed of which
used: C F D
This reflects
However,
highly
[Stevens75] - based on F O R T R A N ,
at least three were i m p l e m e n t e d and
Glypnir
[Lawrie74,
Lawrie
based on A l g o l 60 and IVTRAN [ M i t t s t e i n and Muntz 75] - based on F O R T R A N .
et al.75] -
115
The
first
language
designed for
the
ILLIAC
IV
was T R A N Q U I L
[Kuck68],
specified at the same time as the design of the I L L I A C IV system. on Algol 60~
with some minor deletions,
of
and declaration of
statements
declarations, (STRAIGHT
it or
is
possible
SKEWED
to
specify A
the
are stored over the PEs. layout
of
PARTITION
Using its data
variables
in
the
declaration
is
used
previously declared arrays in several ways and to form one (e.g.
was
and includes extensions for parallel execution
arrays that
mapping).
which
T R A N Q U I L is based
subarrays.
PE
memories
to
partition
Parallel execution of
a for-loop) or several statements is specified by means of a SIM statement:
SIM BEGIN ($1; $2; ...
Sn) END
The i m p l e m e n t a t i o n of the language was never completed. Glypnir language
was
the
than
first
implemented
TRANQUIL,
which
language
for
ILLIAC
supports a more
IV.
It
general form
is of
a
less ambitious
parallelism
where
arrays can be of any size and the c o m p i l e r is responsible for "squeezing" them to f i t the I L L I A C IV memory. In contrast
to
this,
all distributed
parallel dimension ( I L L I A C
data variables in Glypnir
IV has 64 processing elements)~
have 64 elements
in its
and it is the p r o g r a m m e r ' s
task to map smaller or larger arrays into this form. C o n t r o l statements are extended~ to control parallel execution.
as compared to Algol~
in that they may also be used
The statement:
IF THEN <Strut-l> ELSE <Stmt-2> results in Stmt-1
being executed in PEs where the corresponding elements of the Boolean
expression are TRUE and Stmt-2 in the PEs where the elements are FALSE. The
Glypnir
compiler
is said
o p t i m i z a t i o n is included.
to
generate
a relatively
efficient
code
even
though
no
A f a c t o r 1.5 to 3 in execution speed is reported as compared to
assembly programming. The I L L I A C
IV I V T R A N
system [ M i l i s t e i n
and Muntz
75~
Millstein73]
is based on the
f o l l o w i n g assumptions"
* The presumptive user is accustomed to programming in F O R T R A N .
The new
language must be of F O R T R A N type. * It
should be possible to
use the system for
existing programs~
written
in
standard F O R T R A N . This resulted
in a new [anguage~
defined in terms of parallel extensions to F O R T R A N .
in order to allow standard F O R T R A N programs to be used~
a pre-processor was added to
116
the IVTRAN
compiler,
where parts
expressed in IVTRAN~
of
are r e w r i t t e n .
the source program,
which
This part of the complier~
could have been
which is called the
"Paralyzer" (Parallelism Analyser and Synthesizer) [Presberg and 3ohnson 75]~ produces an IVTRAN program from the FORTRAN source program.
The form of the tVTRAN program
is either source code (intended for the interested user) or - since the Paralyzer works on an i n t e r m e d i a t e f o r m of the program which is produced by the parser - in a f o r m suitable for the next compiler phase. Parallel Pascal [Reeves et el,80,
Reeves and Brunet 80,
languages which have been proposed for the MPP. extensions to Pascal.
Reeves et al.81] is one of the
The language is defined in terms o f
As compared to standard Pascal,
the new language is extended in
several ways;
* Data can be declared as "parallel",
which means that it should be located in
the parallel processing array of the MPP. * Expressions can be formed w i t h parallel arrays. Several standard included.
functions,
which
may
be used with
parallel arrays are
These are defined for all sizes and shapes of arrays.
Functions
include Shift and Rotate of arrays any number of steps along one or several of its dimensions.
Reduction functions,
functions in APL, * To
specify
based on the p r i m i t i v e reduction
are also included. an
that
operation
should
be
performed
processing elements of the parallel processing array, the
meaning
of
the
if-then-else
construct
in
in
The language,
which
of
the
Parallel Pascal extends
a way
As part of the Phoenix Project [Feierbach and Stevenson 78], has been defined [Perrott79].
a subset
similar
to
Glypnir.
a language called Actus
is based on Pascal,
is in many
aspects similar to both Parallel Pascal and to Pascal/L.
However i t includes constructs,
such as independent indexing in the processing elements~
which could not be e f f i c i e n t l y
implemented neither on the MPP nor on LLJCAS. Special purpose languages have also been described for use in important application areas for associative array processors.
Pixai [ L e v i a l d i et al.80] consists of parallel extensions to
Algol 60 and is directed towards image processing.
Its FRAME
construct
is used to
specify an environment to each ceil in the array upon which operations are performed. Another
language which
defined in terms
of
is suitable
extensions to
executed simultaneously
by
two
or
for
image
Pascal. more
procedures which are declared as parallel.
processing is PascalPL [Uhr79].
Parallel operations (instructions processing
elements) can be included
A STRLJCTURE specification,
it
which
is are
inside
similar to the
117
FRAME construct in Pixal is defined. A
language which
implements a very flexible
indexing scheme is APLISP (A
Language for Image and Speech Processing) [Muelier et at.80]. are t r e a t e d as sets and subsets are chosen by "index sets". such as the Cartesian product, of the operands,
Parallel
Here the parallel arrays
Operations on the index sets,
intersection or concatenation,
allow a powerful indexing
including both alignment and "window specification" similar to the ones
defined in PixaI and PascalPL. Several
languages
for
database
processing
proposed [Resnick and Larson 757
on
associative
Bratsbergseugen et al.79,
computers
have
also
been
Love75].
6.2 OVERVIEW OF PASCAL/L There are two d i f f e r e n t approaches to the design of a high-level language for a parallel compute['.
Either the parallelism of the computer has a correspondance in the syntax of
the language and special constructs are used to express parallel operations on data~ the language does not contain any p r i m i t i v e s for parallel processing,
or
in which case the
compiler is responsible for detecting inherent parallelism in programs t h a t are w r i t t e n in a sequential language. In
the
Both have advantages and disadvantages.
second approach the
user
does not
have to
learn a new language.
programs can d i r e c t l y be moved to the parallel computer.
Existing
Programs can be developed
and tested on an ordinary sequential computer before they are transported to the parallel machine.
The language is also independent of the parallel structure of any particular
machine. However,
if
the
parallelism
is not
apparent in the language,
the user w i l l not be
m o t i v a t e d to design algorithms which are suitable for parallel computation. forces
the
sequential code.
language
It
transformed
to
run
programmer
to
transform
is also unlikely t h a t efficiently
an
inherently
parallel
A sequential algorithm
a completely sequential a l g o r i t h m
on a parallel machine.
Thus in the
interest
into
could be both of
efficiency and understanding of parallel algorithms we have favoured a language where the parallelism is visible. For several reasons it
is preferable to use an existing sequential language as a base,
when defining the new language:
* Sequential
operations
language anyway.
are
indeed
necessary
and
must
be
included
in
the
118
* The
implementation may be simplified
in
that
existing
compilers
can
be
modified to accept the new language, * The user needs to [earn r e l a t i v e l y few new concepts. * The use of the language is not restricted to paraltel algorithms and the same language can be used to program the entire system including compilers and operating system,
When
designing
a high-level
considered for
the
language for
LUCAS,
several d i f f e r e n t
choice of a suitable sequential language.
languages were
APL deals with
parallel
arrays of data in a very generai way and many of the ideas in APL are relevant to parallel processing on an StMD computer.
However,
the dynamic data structures in APL
and the powerful operations on these would make it very d i f f i c u l t to achieve an e f f i c i e n t implementation.
FORTRAN
is
currently
applications where LUCAS may be used.
the
most
used
language
On the other handy
makes i t unsuitable for database processing~
in
many
of
the
its poor data structures
which is one of the pilot application areas
for LUCAS. Pascal
is a well
structured
language w i t h
powerful
makes it suitable for many d i f f e r e n t applications.
control and data structures
which
It has strong typing of variables,
and
a large amount of error detection is possible both at compile time and at run time. Compilers for
Pascal are r e l a t i v e l y uncomplicated to impiement.
chosen so that only one symbol lookahead is needed~ techniques.
The syntax has been
enabling the use of simple parsing
To f a c i l i t a t e the code generation and to allow compilers to be portabl%
an
i m p l e m e n t a t i o n scheme with code generation for a stack oriented virtual machine is used. The
fact
that
portable
compilers
-
written
in
Pascal
-
exist,
simplifies
the
i m p l e m e n t a t i o n on d i f f e r e n t machines. We decided t h a t the new language, terms
of
extensions to Pascal.
corresponds to operations,
the
PascaI/L(UCAS) [Fernstrom82],
should be defined in
The extensions are chosen so that
processing capabilities of
LUCAS.
the new language
This means t h a t
where one instruction operates on several data items,
typical
SIMD
can be specified.
A
characteristic property of associative processing is the a b i l i t y to designate the part of data which w i l l be subject to parallel computations in terms of properties of the data, regardless of where it is stored. Since
the
use
architecture,
of
LUCAS
is restricted
to
algorithms
which
are well
suited
for
the
only constructs which can be e f f i c i e n t l y implemented have been included.
Floating point a r i t h m e t i c ,
for example,
is not included.
The following extensions to Pascal are defined:
119
* Declaration of variables that wiU be allocated to the Associative Array. the
following
these w i l l
be referred
to
as "parallel
variables"~
In
whereas
"scalars" or "sequential variables" stand for variables which are located in the memory of the Master Processor, * An indexing scheme to access parts of parallel variables. * Expressions and assignments involving parallel variables.
* An
extended control
structure,
allowing the
use of
parallel variables as
control variables. * Standard functions for data alignment,
input and output of parallel variables.
6.3 LANGUAGE DESCRIPTION 6.3.1 Declaration of Data The one-dimensional organization of the Associative Array makes it especially suited for operations on one- and two-dimensional arrays. could be represented in LUCAS,
In principle,
arrays of any dimension
but the natural storing scheme where adjacent array
elements also are physical neighbours,
would be lost.
Pascal/L is therefore restricted to
arrays of one or two dimensions. Parallel variables are characterized by their dimension and their range . subscripts
in the declaration defines the dimension of the variable.
The number of
The range can be
seen as a measure of parallelism and is given by the size of the first subscript. There are t w o kinds of parallel variables: selectors and parallel arrays .
6.3.1.1
A
Selectors
selector defines
a Boolean vector over the Processing Elements and is intended to
control the parallelism of operations.
(At execution time this is accomplished by setting
the Tags in these PEs where the corresponding selector element has the value TRUE.) <selector type>
::=
selector[] I selector[] := ()
120
::=
,.
::=
=>
::=
I .. I .. step
::= t r u e I false
We use the same f o r m of BNF as in C h a p t e r 5 to r e p r e s e n t the syntax, F o r example= vat SEL = s e l e c t o r [0_99]; declares a s e l e c t o r w i t h the range 0_99,
i.e.
a s e l e c t o r w i t h e l e m e n t s in the f i r s t t 0 0
PEs, v a t 5EL : s e l e c t o r [0..99]:=(0-98 step 2 => true); declares
a selector
with
the
range 0..9£
where
all the
elements w i t h
even indices are
i n i t i a t e d to the value T R U E and all others to the value FALSE,
6,5.1.Z P a r a l l e l A r r a y s A p a r a l l e l a r r a y consists o f a f i x e d number o f components which are all of the same t y p e and which are l o c a t e d in the A s s o c i a t i v e A r r a y o f L U C A S . or t w o dimensions. range of the a r r a y .
It has the p r o p e r t y t h a t when the f i r s t index is i n c r e m e n t e d by one
in an a r r a y r e f e r e n c e , component
will
referenced, fixed
value
while keeping a possible second index unchanged,
be l o c a t e d
the
Associative Array.
second a r r a y
index~
all
the PE m e m o r y . components
This means t h a t
are l o c a t e d
in
a field
f o r any of the
The d e f i n i t i o n implies t h a t in a t w o - d i m e n s i o n a l a r r a y all c o m p o n e n t s
of a r o w are l o c a t e d in the same PE, d i f f e r e n t PEs.
the new a r r a y
in the PE whose index is one higher than the PE o r i g i n a l l y
but on the same address w i t h i n of
P a r a l i e l arrays can be o f one
The size of the f i r s t subscript in the d e c l a r a t i o n is r e f e r r e d to as the
while the components of a column are l o c a t e d in
121
A component of a parallel array can be of any of the following types: signed integer, unsigned integer,
fixed point number,
Boolean,
character or string.
array with components of any of the first three types,
When declaring an
a precision is specified in the
declaration.
The precision gives the number of bits used in the computations in the case
of integers,
and the number of bits on each side of the "fraction mark" (binary point) in
the case of fixed point numbers.
The maximum length of a string component (number of
characters) is given in the declaration. <parallel array type> ::=
parallel array[] o f <parallel type> 1
parallel array[,..] o f <parallel type> <parallel type>
::=
<parallel type i d e n t i f i e r ) I <parallel standard type> I record <parallel field list> end
<parallel type identifier> ::= <parallel standard type>
::=
integer() I unsigned integer() I fixed(.) I Boolean I char I string() <parallel field list> ::= <parallel record section> /;<parallel record section>/ <parallel record section> ::= /,/ : <parallel standard type> ::=
For example: va__ ! PARA : parallel array [ 0 - 9 9 , 0 - 2 ] of integer(32); declares a two dimensional array,
where the components are of type signed integer with
a precision of 32 bits (including the sign bit). the first 100 PEs with 3 components in each PE.
Components of the array are located in
122
y a r OBSERVATION
parallel array[1..64] o~f
:
record SITE : string(20); TEMP
: fixed(8,4);
endl declares a one dimensional array where each PE f r o m record has two
fields ,
the f i r s t
is a string with
1 to 64 stores a record.
Each
a maximum of 20 character%
the
second a f i x e d point number where the integer part has a precision of 8 bits (including the sign bit) and the f r a c t i o n part a precision of 4 bits. In
correspondence
with
the
components of a record.
Pascal
in the Associative A r r a y of LUCAS. parallel array
of record,
terminology,
the
word
"fieid"
is used to
denote
We have already used the same word when r e f e r r i n g to "fields" However,
a logical field
this should cause no confusion since in a
always is allocated to a physical field
in the
Associative Array.
6.5.2 indexing of Parallel Variables indexing of a variable of array type in Pascal is used to single out one component of the array.
When operating on a parallel variable,
Pascai/L makes it possible to reference
several components along the parallel dimension at the same time. is referred
to
as the referenced
range of
the
variabl%
This set of elements
as compared to the declared
range which is specified in the declaration of the variable. The indexing scheme allows simultaneous access to a column or a subset of the column components of a t w o dimensional array, a r o w or a diagonal.
but not to any other part of the array,
such as
The components of a row are all located in the same PE,
means that they must be sequentially processed.
To access a diagonal,
which
or any other part
of the array where the components are located in d i f f e r e n t PEs - except a column - the PEs would have to include an index register to allow d i f f e r e n t
addresses being used in
d i f f e r e n t PEs. Several ways of specifying the f i r s t index exist. is denoted by a star " * " by constant indices~ a
selector
in the index position.
If an entire column is referenced,
in which case the referenced part is known at compile t i m %
expression.
When
a
selector
this
A part of a column is referenced either
expression
is
used,
the
array
or by
components
referenced are i d e n t i f i e d by the indices where the expression takes the value TRUE.
A
one-dimensional parailel array may be used w i t h o u t any index at all (and no brackets),
in
which case all components of the array are referenced.
123
::= <parallel v a r i a b l e i d e n t i f i e r > I <parallel v a r i a b l e i d e n t i f i e r > [ ] t <parallel v a r i a b l e i d e n t i f i e r > [ ,<expression>]
<parallel v a r i a b l e i d e n t i f i e r > ::=
< f i r s t index> ::= * 1 I .. I <selector expression>
Examples: P0
Select
all
PI[*,0]
Select column O of P1.
PI[S,O]
Where
P112..80,0]
Select a subset of column 0 of P1.
S
the
is
a
elements PI
selector:
of
the
one-dimensional
variable.
is a t w o - d i m e n s i o n a l p a r a l l e l variable.
select
a
subset
of
column
0
of
P1
6.3.3 Expressions and Assignments It
is possible to
type
conflict
combine sequential and p a r a l l e l variables in expressions as long as no
occurs.
This means for
example that
it
is allowed to
form
expressions
w h e r e a scalar i n t e g e r is combined w i t h a p a r a l l e l a r r a y of integers. An
expression,
which
sequential value. p a r a l l e l vaIu%
only
contains
An expression,
sequential
variables
and
constants
results
in
a
which includes at least one p a r a l l e l v a r i a b l e results in a
unless the p a r a l l e l variable(s) is used as a p a r a m e t e r t o a f u n c t i o n which
returns a scalar result. In the c o m p u t a t i o n
of a p a r a l l e l expression,
all r e f e r e n c e d ranges to p a r a l l e l variables
must be i d e n t i c a l and any sequential value is p a r a l l e l i z e d to this range b e f o r e e v a l u a t i o n of the expression.
This means t h a t
4+PARA[*] results in /4 being added to all the components of P A R A and
4+PARA[2.. 5] results in 4 being added to components 2~3,4 and 5 of P A R A .
124
There are four kinds of assignment statements: 1)
The left hand side and the right hand side are both scalars.
This is the normal
Pascal assignment statement 2)
The left hand side is a parallel variable and the right hand side is a sequential expression.
In this case all the components within
the referenced range of the
parallel variable are assigned the value of the scalar expression. 3)
The l e f t hand side is a sequential variable and the right hand side is a parallel expression. value
4)
The referenced range of the parallel variables must be such that the
of
the
The l e f t
right
hand
side
expression
includes
one
component.
hand side is a parallel variable and the right hand side is a parallel
expression.
The referenced components of the left hand side variable are assigned
the corresponding elements of the right hand side expression. expression must be equal to,
or overlap~
The following program exemplifies different kind of assignments. Program Assign; vat ODD :
selector[0..127]:=(1..127 step 2 => true);
EVEN, :
selector[0_127];
P1,P2 : paral!e[ array[0,.127] o f integer(16); I :
integer;
begin EVEN:=no___ktODD; PI[EVEN]:=P2*2;
(* Both sides parallel. Same range *) (* Both sides parallel. The range (* of the rigth hand side expression (* overlaps the referenced range of PI *)
PI[ODD]:=0;
(* Left hand side parallel~ right hand (• side scalar. All the odd elements of (* P1 are assigned the value zero *)
I:=P215];
(* Left hand side scalar, right hand (~ side parallel, but the referenced (* range includes one single element ~)
SEL:=P1 > P2;
The range of the
the referenced range of the l e f t hand side
variable.
SEL
single
(* Both sides parallel. Same range *)
125
h=P2[SEL]~
(* L e f t hand side is scalar, r i g h t hand (* side is paraUet. SEL must have one (* single component w i t h the value TRUE *)
end.
5.5.& C o n t r o l Structure
Pascal contains five structured if
,
case ,
while ,
actions taken
~
constructs which control the sequential program flow: the
and fo_.Lr statements.
In a sequential Pascal program,
all the
can be ordered according to the t i m e i n t e r v a l in which they occur.
This
ordering defines the program flow and is directed by the control statements and by the order
in which
LUCAS
differs
interval,
each
statements are w r i t t e n from in
this
in that
a different
in the program.
as many
The execution of programs on
as 128 actions
Processing
Element.
In
may occur
the
same
constructs in Pascal determine the execution along the t i m e dimension, included
in Pascal/L to allow the control
during the same
way
as the
control
new concepts are
of selection and r e p e t i t i o n along the parallel
dimension. The construct:
if then <true-statement> else
in Pascal selects one of two d i f f e r e n t paraUel s t a t e m e n t , determine
if
the
paths in the program flow.
the Boolean expression yields a selector. true-statement
or
the
false-statement
in the corresponding
Elements of the selector will
be
executed
on
the
corresponding data elements. In a global perspective this means that
both paths w i l l
be f o l l o w e d and that
true-statement
are
but
different
PEs.
and
the
Rather
false-statement than to
extend the
executed~
on
different
meaning of the if-then-else
data
both the and
construct~
in we
define a parallel selection with the foUowing form: where <selector expression> do <true-statement> elsewhere
where the elsewhere-part is optional. Analogous to the Pascal case statement,
Pasca[/L defines a parallel form
of the case-
126
statement,
in accordance to the where-do-elsewhere construct,
not result in one execution path being fotlowed~
but all,
the parallel case does
each working on d i f f e r e n t data.
The form of the parallel case statement is: case where <parallel expression> of : <statement>i : <statement>i
: <statement>i others
: <statement>;
end~ where the others-part
is optional.
compound statements~
i.e.
Like
in Pascal,
statements may be of the form
a list of statements surrounded by a begin - end -pair.
In a similar way an extension to the Pascal while do <statement>
is defined to control repetition for parallel data: while and where <selector expression> d o <statement>
Here the statement is repeated as long as the selector expression takes the value TRUE in
any element,
However,
during
each
repetition
of
the statement~
the
selector
expression also decides in which PEs the statement should be executed. The following example shows how V[I] modulus N[[] can be calculated for every element of the t w o vectors V and N using repeated subtractions: vaT. V,N
:
paratie[ array[0..127] o f integer(16);
whiie and where V >= N d_£ V:=V-N;
127
6.3~5 Standard Functions and Procedures
6.3.5.1 Data Alignment in expressions and assignments where the components of the parallel variables are located in different PEs,
the variables must be aligned.
The kind of alignment needed is defined
by the programmer in terms of standard functions,
which correspond to the possible data
movements over the interconnection network in LUCAS. shift(<parallel variable identifier>,][) rotate(<parallel variable identifier>,I) The first
of these functions shifts a parallel variable I steps along its first dimension,
placing component N in position N+L
Zero-elements are shifted in from the edge.
The
rotate function is similar to the shift function except that the elements that are shifted out at one edge are shifted in at the opposite edge of the parallel variable. shuffle(<paraJlei variable identifier>) exshuffte(<parallel variable identifier>) The
elements
of
the
parallel
variable
Shuffle/Exchange network on LUCAS. the declared range 0..127. component N with Section 1.4.2).
index
permuted
according
to
the
Perfect
The first function performs a shuffle of the elements,
placing
index n0nln2...n k in position Shuff|e(N) with index nln2...nkn 0 (see
The second function performs a shuffle followed by a pairwise exchange
of the elements, with
are
These functions are only defined for variables with
placing component N with
nln2,..nkn0" ,
where
nO"
index n0nln2...n k in position £xshuffle(N)
denotes
that
the
last
bit
in
the
index
is
complemented.
6.3.5.2 Selector Operations first(<selector expression>) This function is used to find the first component of a selector expression with the value TRUE.
It returns a new selector with only this element TRUE.
next(<selector i d e n t i f i e r ) ) The next-procedure assigns the value FALSE to the first true element of the parameter,
128
which must be a variable.
This is useful when processing selected elements sequentially.
any
The any-function returns the value FALSE if a previous call to the first-function or the next-procedure returned an all-false selector,
otherwise it returns the value TRUE.
some(<setector expression>)
A call to the some-function evaluates the selector expression and returns the value TRUE if it contained at least one TRUE element,
otherwise the value FALSE is returned.
Example:
vat PAR1 : p a r a l t e l array[0..99] o f unsigned integer(12);
SEL
: .selector[0..99];
SUM
: integer;
begin SUM:=0; SEL:=PAR1 > 10; white some(SEL) do begin SUM:=SUM+PARIEfirst(SEL)]; next(SEL); end;
In the example SUM gets the sum of all the elements of PAR1 whose values are greater than 10.
6.3.5.3 Input and Output The Pascal standard procedures read and w r i t e are extended to allow input and output of parallel variables. out.
Details of how this should best be accomplished have not been worked
As a preliminary a t t e m p t
the
procedures are extended so that
variables denoting parallel arrays as parameters~ be read or w r i t t e n .
they
may take
meaning that whole parallel arrays may
129
6.3.6 Microprograms It
is
possible to
explicitly
microprogramming language.
invoke
a microprogram
which
has been written
in the
This allows a significant speedup for parts of the program
that can be expressed in microcode,
i.e.
parts which only include parallel operations.
Examples of such operations are matrix multiplications and image operations. A microprogram should be declared in the declaration part of the program. of
the
declaration
is
similar
to
the
syntax
of
microprogram
The syntax
headings
in
the
micraprogramming language( see Section 5.4.4): <microprogram declaration> ::= microp£ogram <microprogram parameter list> I external~ <microprogram parameter list> ::= <empty> 1 ( <microprogram parameter> /~<microprogram parameter>/3 <microprogram parameter> ::= <empty> I
A standard function which is used in conjunction with microprograms is the following:
location(<parallel variable identifier>) This function results in an integer value,
which indicates the bit address to the least
significant bit of the parallel variabte in the Associative Array. Invocation of microprograms is similar to procedure calls. va.._Zr MI~M2~M3 : parallel array[0..127~0..127] o__finteger(16)~ Microprocjram Matmult(A~B~C~precision); externall begin ..o°
Matmult(Loeation(M1)~Location(M2)~Location(M3)~l6)i o...
end__;
130
6.4 EXECUTION ON LUCAS In most impiementations of Pascal,
a v i r t u a l stack-oriented "pseudo-machine" is used as
the target computer for the generation of i n t e r m e d i a t e code by the compiler (p-code). order to execute a compiled program, interprets
the p-code,
In
either a software emulator of the v i r t u a l machine
or a final compilation phase translates the p-code into actual
machine code. In this section we w i l l define some of the extensions to a Pascal pseudo-machine, are needed to implement Pascai/L. Pascal/L pseudo-machine,
which
We w i l l present a part of the instruction list for the
which is adequate to describe the execution of two i m p o r t a n t
constructs in Pascal/L; parallel expressions and t h e where-do-elsewhere statement. The use of the Pascai/L pseudo-machine as the t a r g e t computer for the Pascal/L compiler has the additional advantage that an emulator for the pseudo-machine can be implemented on any standard computer, moved to LUCAS.
This means that programs can be tested before they are
The tests can include relevant performance estimations and extensive
error-checking on the p-code level,
6.4.1 Pascal/L Pseudo-machine The Pascal/L pseudo-machine has several registers and uses three distinct memory areas as shown in Figure 6.1.
On LUCAS,
two of the memories are located in the memory
area of the Master Processor (the Program Memory,
the Stack) and the third in the
Associative Array (the Parallel Memory). The Program Memory holds the instructions of the program being executed. PC,
A register,
points to the instruction that w i l l be executed next,
The Stack contains sequential variables, in the Stack,
the Program Memory and the Parallel Memory.
locations in the Stack: SP, AP,
the
sequential temporaries and pointers to locations
activation
the stack pointer,
pointer,
to
the
Two registers point to
points to the top-of-stack element and
activation
record
of
the
currently
executing
procedure. The Parallel Mern0r X holds parallel variables and parallel temporaries. Parallel Memory
is a 128-element vector and is defined by a descriptor ,
located in the Stack. format
Each entry in the
specification
which
is
A descriptor consists of a pointer to the Parallel Memory and a giving
the
precision
of
the
variable.
The
Parallel
Memory
is
131
organized in the f o r m of two stacks: the Parallel Stack and the Range' Stack . The register PSP points to the top element of the Parallel Stack, variables
and
parallel
temporaries
(used
during
expression
which contains parallel
evaluation
to
held
the
i n t e r m e d i a t e results). Each parallel variable has an associated bit-slice r the
declared
range
of
the
variable.
Each
the range indicator ~
temporary
on
the
which indicates
Parallel
Stack
has a
corresponding range indicator giving its actual range. The register RP points to the top element of the Range Stack. the
Current
(evaluation) Range is stored,
paralle! control statement,
e.g.
This is a bit-slice where
This range is set either when executing a
the where-do-elsewhere statement~
or as the result of
an indexing operation. The Stack is essential far the evaluation of expressions and is used to reference all the operands.
In stack-oriented machines all operands are pushed onto the stack from where
they
removed
are
by
the
arithmetic
and
logical
completed the result is pushed back onto the stack.
operators.
Once
the
operation
is
When operating on parallel variabtes~
it is often enough to push a descriptor on the Stack without also pushing the variable itself
onto
the
Parallel Stack.
In many simple
expressions (like adding two
parallel
variables and storing the result in a third) this w i l l result in a considerable reduction of the overhead involved.
132
Stack
Variables
Program Memor
128
Parallel Memory
Figure 6.1 The Pascal/L pseudo-machine,
The value of an entry in the Stack may have several interpretations=
a scalar value * a pointer to another entry in the Stack * a pointer to a location in the Program Memory * a descriptor to a variable in the Parallel Memory a descriptor to a temporary on the Parallel Stack, Upon procedure entry~
a local data area for the procedure is created both in the Stack
and in the Parallel Memory. area for
scalar
indicating
the
On the Stack this takes the form of a reserved memory
variables and for declared
ranges of
descriptors to the local parallel variables. the
local
parallel
variables are
Bit-s|ices
loaded and selector
133
variables are i n i t i a t e d if needed. The instructions needed to demonstrate expression evaluation and the where-do-elsewhere construct w i l l now be described. the
Stack
and OP for
instruction.
In the f o l l o w i n g ,
the Stack
entry
which
TOS stands for the entry on top of
is addressed by the operand f i e l d of an
(TOS) and (OP) stand for entries in the Parallel Stack whose descriptors are
TOS and OP r e s p e c t i v e l y .
LOAD
type,lev,disp
This instruction puts a variable on top of the Stack. on the values of " l e v " and " d i s p ' .
The location of the variable depends
Lev indicates the number of static levels to traverse
in order to find the a c t i v a t i o n record and disp is the offset within the a c t i v a t i o n record to find the variable.
Together they define the OP entry in the Stack.
value of the t y p e - p a r a m e t e r , Load a scalar. and
the
Load
value
a parallel
The value of the scalar is pushed on the Stack. is
Depending on the
d i f f e r e n t actions are taken:
put
variable.
in
the
location
indicated
by
the
(SP is incremented new
value
of
5P.)
The descriptor of the variable is pushed on the Stack,
but the variable is not moved to the Parallel Stack.
LIT
value
This i n s t r u c t i o n loads the literal specified in the p a r a m e t e r
COPY
on
the Stack.
type
Push TOS onto the Stack,
i.e.
make a duplicate of the element on top of the Stack.
Depending on the t y p e - p a r a m e t e r the f o l l o w i n g actions may be taken; O
Copy a scalar.
The value of TOS is pushed on the Stack.
Copy a parallel variable.
The descriptor which is in TOS is pushed on the Stack,
but the variable is not moved to the Parallel Stack.
Copy a parallel t e m p o r a r y .
The descriptor which is in TOS is pushed on the Stack,
a copy of the t e m p o r a r y is pushed on the Parallel Stack.
134
type,lev,disp
STORE
This
instruction
parameter~ O
stores TOS in the OP location.
Depending on the value of the type-
the following actions may be taken:
TOS
and
OP
are
both
scalars.
Copy
TOS and OP are both parallel variables.
TOS
into
location
OP
and
First compute a selector by p e r f o r m i n g
the operation A N D between the declared range of OP and the Current by register
RP.
Then use this
operation in the Parallel Stack. selector, if
i.e.
not~
TOS
a
selector while
performing
Range - as a field
copy
Check that the declared range of TOS overlaps the
that TOS is defined in every component which has been copied~
raise
is
TOS.
They are both represented on the Stack by
descriptors to entries in the P a r a l l e l Stack.
indicated
pop
a
parallel
run
time
error.
temporary
and
Pop OP
is
the a
TOS
parallel
descriptor
off
variable.
They
the are
represented on the Stack by descriptors to entries in the Parallel Stack. the same actions as in 1),
and
Stack. both
Perform
then pop (TOS) o f f the Parallel Stack by adjusting PSP.
TOS is scalar and OP is a parallel variable.
Compute a selector as above and use
it
the
while
performing
a
field
load
in
TOS is a parallel variable and OP is a scalar. that
this
error. location.
selector
has one single TRUE
Parallel
Stack.
Pop
Compute a selector as above.
element,
and if
not,
TOS. Check
raise a run t i m e
Use the selector to read out the variable element and store it in the OP Pop the TOS descriptor off the Stack.
TOS is a parallel temporary and OP is a scalar.
Perform the same actions as in 4),
then pop (TQS) o f f the Parallel Stack.
ST[N
type
This store-indirect instruction is similar to S T O R E , second element of the Stack (TOS-I),
but the target address is in the
and not specified in the instruction.
parameter has the same meaning as in the S T O R E
instruction.
The type-
135
NOT/NEG
These
type
are
unary
negation.
instructions
for
forming
They operate on TOS,
the
Boolean
complement
and
the
arithmetic
pop the Stack and push the result back on the Stack.
Depending on the value of the t y p e - p a r a m e t e r we have" g
TOS
is
a
scalar.
Perform
TOS is a parallel variable. entry
It
in the Parallel Stack.
Stack which
the
operation
and
replace
TOS
with
the
result.
is represented on the Stack by a descriptor to an
Execute the instruction
is described by TOS.
on the entry
in the Parallel
Push the result on the Parallel Stack with
the
declared range of TOS stored as range i n d i c a t o r and replace TOS by a descriptor to the new element. TOS is a parallel t e m p o r a r y .
[t is represented on the Stack by a descriptor to an
entry
Execute the instruction
in the Parallel Stack.
Stack.
on this value in the Parallel
Leave the result in the same location of the Parallel Stack w i t h o u t changing
the range indicator.
ADD/SUB/IvLILT/D[V/IvlOE)
type
These instructions represent binary operations which operate on the two top elements of the
Stack.
Similar
to
the previous
instructions,
the t y p e - p a r a m e t e r
operands are scalars or if one or both operands are parallel, scalars, at
least
indicates if
both
in the case where both are
the result is pushed onto the Stack a f t e r the two operands have been popped. one
of
Associative A r r a y ,
the
operands
is
paratle],
then
the
operation
is
performed
in
If the
leaving the resuit on the Parallel Stack a f t e r the operands have been
popped. The t y p e - p a r a m e t e r can take any of the values 0 to 8,
as seen in Table 6.1.
136
TOS-1
is
= = = = = = = =
parallel temporary
parallel variable
scalar TOS i s = = = = = =
scalar
1
2
parallel variable
4
5
7
8
parallel
temporary
Table 6.t The value of the type-parameter in binary operations
SETR~E
Compute
type a new value of Current
Range by performing a Boolean AND
Current Range and the selector in (TOS). the Range Stack.
between the
Push the old value of the Current Range onto
Depending on the value of the type-parameter,
the following actions
are also taken:
0
TOS
is
a
parallel
variable.
TOS is a parallel temporary.
Pop
the
TOS
descriptor
off
the
Stack.
Pop (TOS) off the Parallel Stack and pop the TOS
descriptor off the Stack.
POP~ Restore Current Range to a previous value by popping the Range Stack.
137
SWAPRANGE
Exchange the Current Range with the top element of the Range Stack.
6.4.2 Parallel Expressions In order to use the pseudo-machine for evaluation of a parallel expression in a Pascal/L program~
the expression is translated by the compiler to a form of postfix notation.
This form is ideal when a stack is used to compute the expression, simple rule may be used,
since the following
while scanning the expression from left to right"
If the next symbol is an operand then push its value on the stack,
else (it is an
operation) use the element(s) on top of the stack as operand(s) to the operation, pop the operand(s) off the stack and push the result.
When starting the computation, reached, as
well
the stack is empty and when the end of the expression is
the result is the only element [eft on the stack. as
constant%
the
notation
is
extended
so
Since we deal with variables
that
an operand no
longer
is
represented by its value but by an instruction which should be executed in order to put the value on top of the stack. This describes a commonly used technique for the intermediate code in language compilers and is the philosophy behind the Pascal p-code [Wirth71].
The code generated from the
Pascal/L compiler consists of instructions similar to those described in Section 6.4.1. Without dealing w i t h how the transformation process works~
we w i l l took at an example
of a parallel assignment statement in Pascal/L;
var P1,P2 ODD I
:
parallel array[0..127] o f integer(32)~ : selector[0..127];
: integer~
begum
PI[ODD]:=P2-P1 *(2+I);
The statement PI[ODD]:=P2-PI~(2+I) has been translated into parallel p-code and w i l l be executed on the Pascal/L pseudo-machine. Figure 6.2, refer to.
In the p-code program~
which is shown in
we have replaced the lev/disp-parameters w i t h the name of the variable they
138
Instr. .
.
.
1 2 3 4 5 6 7 8 9 10 11 12
.
.
.
.
.
.
.
.
.
Parameter .
.
.
.
.
LOAD LOAD SETRANGE LOAD LOAD LIT LOAD ADD IvUI_T SUB STIN POPRANGE
.
.
.
.
.
.
.
.
.
.
.
.
.
TOS becomes .
.
.
.
1,P1 1,ODD 0 1,P2 1,P1 2 O,I 0 1 7 2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
descriptor t o p a r . v a r . PI descriptor to selector ODD descriptor t o p a r . v a r . P1 descriptor t o p a r . v a r . P2 descriptor t o p a r . v a t . P1 scalar ( t h e v a t u e 2) scalar (value of I) scalar (value of 2+I) descriptor to par. temp, descriptor to par. temp. <empty> <empty>
Figure 6.2 Parallel p-code for the statement PI[ODD]:= P2-P1*(2+I).
6.4.3 Where Statement The general idea for executing a control statement of the form: where selector expression do statement-a elsewhere statement-b on the Pascai/L pseudo-machine is that the selector expression is used to calculate two new values of Current Range which are used when executing statement-a and statement-b respectively.
By using the Range Stack to save the old value of Current Range,
the
problem of how to handle nested where-statements (and similar constructs) is solved. Upon entry of a where-statement,
Current Range is pushed on the stack and restored
after the statement has been executed. A first a t t e m p t to translate the where-do-elsewhere construct results in the p-code given in Figure 6.5. (R5),
In the figure the Current Range (CR) and the contents of the Range Stack
with the top element to the left,
are shown.
139
Instr. .
.
.
.
.
.
.
.
.
.
.
.
Parameter .
.
.
.
.
.
.
0
selector
1 2 } 4 5 6 7 8
expression COPY blOT SETRPC,~E SWAP~ SETRANGE statement-a POP~E SWAP~
9 10
statement-b POPFUXNGE
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CR .
.
.
.
.
.
.
.
.
.
.
RS .
.
R0 1 or 1 or 1
Z 2
0 or
t
R0 R0 Rb R0 Ra Ra R0 Rb
R0 Rb R0~Rb R0,Rb Rb R0 RO
Rb R0
Figure 6.3 Preliminary translation of the where-construct.
Before executing the instruction on line 1, (TOS).
Depending on the type of expression,
parallel temporary. Stack.
we assume that the selector expression is in
The first
this is either a parallel variable or a
instruction produces a copy of this value on the Parallel
Instructions 2 and 5 invert the value and calculate a new range,
the Current Range. following
SWAPRANGE
statement-a.
After
by masking with
This range will be used during the execution of statement-b. and
SETRANQE
execution
of
operations
statement-a,
calculate the
Range
the
Current
Stack
is
Range
popped,
The for and
SWAPRANGE sets the Current Range to the previous calculated value for statement-b. Execution of the where-statement terminates with restoring Current Range to its initial value. In Section 6.5.4 where the control statements of Pascal/L were introduced~ discuss the semantic aspects of executing them on an SIMD computer.
we did not
Intuitively,
we
feel that the execution of the do-part and the elsewhere-part of a where-statement ought to be independent and that no order should exist between statement-a and statement-b. However, processed
when one
executed
after
the
on other.
LUCAS The
as described above, following
example
the
two
illustrates
statements
why
the
are
scheme
presented is insufficient to assure that the result corresponds to the desired semantics of the construct.
140
vat
P1,P2 = parallel array[0..5] o f integer(52); ODD
: selector[0..3]:=(1,5 => true);
begin °.*
where ODD d o Pl:=rotate(P2,1) elsewhere P2:=rotate(Pl,1);
Assume that the i n i t i a l values are as shown in Figure 6.4 (a). execution of the statements, decide
that
the
independent of statement-a
result
the result w i l l be d i f f e r e n t
of
whether
executing
statement-a
one
Depending on the order of
as seen in the figure.
statement,
say
statement-b,
has been executed or not,
does not change any variables until
statement-b
we
must
If we
should require
has been executed.
be that And
similarly the other way around. Figure 6.4 (b) shows the result in the case where both statements calculate their results and
update
the
independent Range.
variables
of
when
the order
of
both
are
terminated.
execution since
they
Note
that
use d i f f e r e n t
these
updates
values on the
are
Current
Figure 6.4 (c) shows the result when statement-a is executed before s t a t e m e n t - b
and Figure 6.4 (d) when statement-b is executed before statement-a.
index .
.
.
.
.
P1 .
.
.
.
.
.
P2 .
.
.
.
.
P1 .
.
.
.
.
.
.
.
.
P2 .
.
.
.
P1 .
.
.
.
.
.
.
.
.
P2 .
.
.
.
P1 .
.
.
.
.
.
.
.
P2
.
.
.
.
0 1
a b
A B
a A
d B
a A
C B
a d
d B
2 5
c d
C D
c C
b D
c C
A D
c b
b D
a)
b)
c)
Figure 6.4 Result depends on the order of execution. excution.
(c)
statement-a
executed
d)
(a) i n i t i a l values.
before
statement-b.
(b) independent (d)
statement-b
executed before statement-a.
In order to obtain independence between the two statements, be used to store the new values of assigned parallel variables, shown in Figure 6.5.
t e m p o r a r y locations must resulting in the p-code
141
[nstr. .
.
.
.
0 1 2 3 4 5 6 7 8 9 I0 11
.
.
.
.
.
.
Parameter .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CR .
.
.
.
.
.
.
.
.
.
RS .
.
.
.
.
.
.
.
.
R0
selector expression COPY 1 or 2 NOT 1 or 2 SETRANGE 1 SWAPRANGE SETRANQE 0 or 1 statement-a (mod i f i e d ) POPRANGE SWAPRANGE statement-b POPRANGE copy temporaries to parallel variables
RO RO Rb R0 Ra Ra
R0 Rb RO~Rb RO~Rb
R0 Rb Rb R0
Rb R0 R0
Figure 6,5 Translation of the where-construct.
While executing statement-a~
all parallel variables which appear on the l e f t hand side of
an assignment (statement-a may be a compound statement) are copied to a t e m p o r a r y area.
For each variable copied~
used when updating the variable. Statement-a is now executed,
there is also an "modify-selector"~
Which w i l l tater be
This selector is initiated to an all-false value. but the following
modifications have been made in the
code:
Ail
assignments
are
to
the
temporary
locations
of
the
variables.
The
corresponding modify-selector is updated to r e f l e c t which elements have been changed in the temporary location. When using a variable which have been copied to a temporary location~ value Statement-b
is is
taken
from
executed with
t e m p o r a r y locations~
this
Iocation~
no changes.
Finally
not
from
variables are
the
its
variable.
updated from
their
using the corresponding modify-selectors as indices in the updates.
A similar technique is employed for the parallel case-statement.
142
6.5 PROGRAMMING EXAMPLES
Examples of programs written in Pascal/L can be found in the following chapters, only two examples wiU be presented here.
and
They are both related to the application
studies in the chapters 8 and 9. Example 1.
Outer Perimeter of Object s
An algorithm for
finding the outer perimeter of objects in a binary image works as
follows (see Example 14 of Section %4.4): 1) 2)
Mark
one
Propagate
picture the
element
marker
to
at
the
neighbours
edge in
the
which
belongs
edge column
to which
the
background
belong
to
the
baekgound 3)
Copy
markers
to
elements
4)
Propagate the markers to neighbeurs in the column which belong to the background
5)
Scan back
and forth
in
over the
the
next
entire
column
image until
This algorithm can be expressed in Pascal/L. as follows:
Program Perimeter; (* Find outer perimeter of objects in Boolean image [ *) vat I,M : parallel array[O..127,0..127] of Boolean; finished : Boolean; k : integer; (*used to indicate columns*) begin (* edge column *)
white and where not(M[*,O]) and not(l[*,O]) and (shift(M[*,O],1) or shift(M[*,O],-1)) do M[*,O]:=true; finished:=false; white not finished do begin finished:=true; (* scan l e f t *) for k:=l to 127 do begin M[*,k]:=net([[*,k]) and M[*,k-1];
which
belong to
the background
no new markers are
produced
143
while and where not(M[%k]) and not(it%k]) and (shift(M[*,k],l) or shift(M[*,k],-t)) do begin M[%k]:=true; finished:=false; end~ end; (* scan right *) if not finished then for k:=126 downto 0 do begin M[**k]:=not(I[*~k]) and M[%k+l]; while and where not(M[*~k]) and not(It%k]) and (shift(M[*,k],t) or shift(M[%k],-1)) do begin M[%k]:=true; finished:=false; end; end; end; end.
Example 2.
Project
One commonly
used operation
in
the
relational data
base model :s the
PROJECT
operation: PROJECT R1 OVER A GIVING R2 where R1 and R2 are relations and A an attribute of R1. relation~
R2,
This operation creates a new
from R1 by discarding attributes other than A.
After that~
all redundant
tuples are removed from R2. Each relation has a corresponding mark selector which indicates where tupies are defined. A description of the operation can be found in [Kruzela83].
Program Project~ vat R I M A R K : selector[0..127]: (* shows where R1 defined *) R2MARK : selector[0..127]; (* shows where R2 defined *) TEMP1 = selector[0..127]; (* marks remaining tuples in R1 *) TEMP2 : selector[g..127]~ (* marks all duplicates of the (* tupte that is under comparison *) RI : parallel array[O..127] of record A,B,C : string(20); end~ INSTANCE : string(20);
144 begin ..... (* relation R1
is input and R I N I A R K
TEMPI :=R1MARK; INSTANCE:=RI[first(R1MARK)].A;
is initiated *)
(* select first instance of (* attribute A *)
while any do begin TEMP2:=(INSTANCE=RI[TEMPI].A); (* select duplicates *) TEMPI[TEMP2]:=not TEMP1; (* mark as analylzed *) R2MARK[first(TEIx4P2)]:=true; (* the first is included in R2 *) INSTANCE:=RI[first(TEMP1)].A; (* get the next distinct instance end~ (* of attribute A *) end.
Part 3
APPLICATION STUDIES
Chapter 7 SOME WELL-KNOWN PROBLEMS IMPLEMENTED ON LUCAS
7.1 INTRODUCTION It
is
often
seen
that
computations
performed
in
separate application
common m a t h e m a t i c a l tools and computational techniques.
If
areas rely
on
it can be shown that a
particular computer design is well suited for the application of one or several widely used tools,
it means that the computer may be useful in many application areas.
In this c h a p t e r we study the i m p l e m e n t a t i o n on LUCAS of three important classes of computations,
namely
matrix
multiplication,
computation
of
the
discrete
transform (DFT) by means of the fast Fourier transform (FFT) algorithm, graph theoretic problems.
Fourier
and solution of
They all represent tools and techniques that are not l i m i t e d to
any specific realm of computation. Studies of m a t r i x multiplication on parallel computers have been made in connection with the DAP project [F|anders et ai.77] and in [Pease'/7]. very d i f f e r e n t from the one used on LUCAS, data
routing
problem.
The approach
The DAP interconnection scheme is
which results in d i f f e r e n t solutions to the
taken by
Pease on a proposed cube-connected
processor array also differs widely f r o m the methods reported here. The FFT algorithm for the calculation of the DFT has been known since 1965 [Cooley and Tukey65].
It is well known that it can be mapped e f f i c i e n t l y onto a perfect shuffle-
connected processor array [Pease68,
Stone71].
However,
the implementation on LUCAS
is probably the first one in practice. Graph theoretic problems are relevant in many application areas. open to e f f i c i e n t
Very often they are
solution on parallel computers [Quinn and Dee 84].
solving the shortest path problem on LUCAS is given.
An algorithm for
On other graph theoretic problems
we also demonstrate how algorithms designed to be e f f i c i e n t on conventional computers can be adapted to a parallel computer like LUCAS. The three computational areas dealt with in this chapter put rather diverse demands on LUCAS.
Matrix
multiplication
and
Fourier
transformation
utilize
interconnection network to a high degree and require that it be effective,
the
parallel
so that full
147
parallelism carried
can be m a i n t a i n e d all the t i m e ,
out
performed
without
conflicts.
The graph
efficiently,
and
also
that
LUCAS
turns
out
to
performed.
7.2 M A T R I X
which
means that c o m m u n i c a t i o n
theoretic some
problems require
that
unconventional
data
quite
meet
these
diverse
demands
must be
searches be passing fairly
be well.
MULTIPLICATION
The m u l t i p l i c a t i o n
of t w o n by n e l e m e n t m a t r i c e s consists of the f o r m a t i o n of n 2 inner
products of pairs of n - e l e m e n t vectors.
An inner product of t w o v e c t o r s is defined as
* = a l b t + a2b 2 +...+ anb n. When m u l t i p l y i n g t w o matrices,
A and B,
the e l e m e n t of the i:th row of column no.
is f o r m e d as the inner product of the vectors comprising row no.
j
i of A and column no.
jofB. The
traditional
sequentially,
method
for
multiplication
one a f t e r the other.
of
two
matrices
computes the inner products
Each inner product is likewise c o m p u t e d sequentially,
as a sequence of m u l t i p l i c a t i o n s and additions. Paratlelizing number
the
of
computation
processors
can
available
be
made
compared
in many to
the
various ways. size
of
the
Depending matrices,
on the different
approaches may be f a v o u r a b l e .
7.2.1 n x n M a t r i c e % We f i r s t
n Processors
consider the case when n processors are a v a i l a b l e .
The most obvious way of
using the p a r a l l e l i s m is to c a l c u l a t e the n m u l t i p l i c a t i o n s of an inner product c o m p u t a t i o n simultaneously.
This is known as the " i n n e r - p r o d u c t
m e t h o d " [Hackney and 3esshope 81].
The r e m a i n i n g addition of the terms of each inner product can be made by n processors in
O(log
n)
time
shuffle+exchange addition
is small
using
network.
a
suitable
Since,
compared
to the
a d d i t i o n step is not v e r y severe. The t w o vectors that we m u l t i p l y w i t h each other, the A - m a t r i x
communication
on a b i t - s e r i a l multiplication
However,
network,
computer,
time,
the
perfect time
the reduced p a r a l l e l i s m
in the
in order to compute an inner-product must be aligned
is stored one column in each processor's m e m o r y ,
the B - m a t r i x
If
must be
This w i l l align the rows of A w i t h the columns of B.
If the m a t r i c e s are loaded into the m e m o r y for m u l t i p l i c a t i o n only, However,
for
there are o t h e r problems w i t h this approach:
so t h a t corresponding elements are available to the same processor.
stored one r o w in each m e m o r y .
any problem.
e.g
the e x e c u t i o n
this need not cause
if the m a t r i c e s are c r e a t e d in the parallel a r r a y or are subject
148
to
other
operations that
perhaps demand the
same storage method for
alignment problem has to be solved in the array. Multidimensiona! accessible.
Access
Memory
both,
the
STARAN has this possibility through its
[Batcher77],
where
both
rows
and
columns
are
This is not possible on LUCAS.
Instead of computing each inner product with the largest possible parallelism, products can be computed simultaneously.
many inner
( A total of n2 inner products are to be
computed). Referring to Figure 7.1,
in order to form the first column of the result matrix,
the
following n inner products must be computed=
* 00 01 02 0,n-1 00 10 20 n-l,O * 10 11 12 1,n-1 00 10 20 n-l,O
* 00 10 20 n-l,0
These expressions show that the first column of A is to be multiplied by BOO, column by B10 , accumulated,
the third by B20 ,
etc.
the second
The results of these multiplications are
and the final results will appear at the correct positions.
A0O A01 A 0 2 ' ' ' A 0 , n _ I
BOO BOl B 0 2 " ' ' B 0 , n _
A10 A l l
B10 B l l
A12'''
B12..-
A20 A21 A 2 2 ' ' '
B20 B21 B 2 2 . . -
An_l, 0 -.-
Bn_l, 0 ..-
An_l,n_ 1
1
Bn_l,n_ 1
Figure 7.1 Two matrices.
Thus,
the k:th column of the product is formed by successively multiplying each column
of A by the elements of the k:th column of B,
constantly accumulating the results.
multiplications are made as multiplications by a scalar, of vector alignment. method".
The
which frees us from the problem
Figure 7.2 illustrates the algorithm,
called the "middle-product
149
AOOB00+A01B10÷A02B20+" •.
A00BoI+AglBlI+A02821+-.-
AooB02+A01B12+Ao2B22 +'''
At 0B00+A~ t B t 0+A12820 +" " "
kl0B01+kl~B}%*kl2B2l +---
A|0Boz+Al1812+AI2822+ ' ' "
A2OBo0+A21B10+A22B20+" ..
A20Bot+A2~BIl+A22821+---
A20B02+A21B12+A22822 +''"
A30B00+A31 B I 0+A32B20+" " "
A30B01+A31BII+A32B21+--- A30Bo2+A31B12+A32B22÷''"
I teration 1;
Substep A
S u b - Substep |
step C
I terat i o o 2:
Substep A
Substep B
Sabstep (:
I tera-
Sub-
Sub-
Sub-
tion 3:
step
step
step
A
B
C
Figure 7.2 M a t r i x
multiplication
algorithm
that
produces
the
result
column
by
column
(middle-product method).
We a r r i v e d at the new scheme by stating t h a t column
at a t i m e ,
produce the result
we wanted the result to come out one
then analyzing which data and which computations were needed to in this form.
Continuing this line of reasoning,
we may ask if we
can find a method to produce the whole result m a t r i x simultaneously, iterations,
where each i t e r a t i o n produces an n x n m a t r i x ,
the result m a t r i x .
i.e.
proceed in n
and the last one produced is
We want all inner products to " g r o w " simultaneously,
as i l l u s t r a t e d in
Figure 7.3. As can be seen f r o m the figure,
this can be done w i t h o u t problems.
[n fact this is just
doing things in a d i f f e r e n t order compared to the method described above. problems (or,
rather,
lack of problems) are e x a c t l y the same.
The access
This is what is called
the " o u t e r - p r o d u c t method". The two methods described work equa|ly well if the matrices are not square.
The only
constraints on the size are that the number of rows must be smaller than or equal to the number
of
processors,
middle-product conservative,
method
and that which
the
memory
produces
the
space h o r i z o n t a l l y result
column
by
is large enough. column
is
since the B m a t r i x can be successively o v e r w r i t t e n if desired.
more
The space
150
Iteration I:
A00Bo0+AoIBI0+A02B2o÷---
AooB01+A0}BII+A02B2% +--.
A00B02÷A01812+A02B22÷'-"
AIoB00+AIIBI0+AI2B20+---
AIoB01+AIIBII+AI2B2I +.-*
AIOB02+AI|BI2+AI2B22+*'-
AzoB00+AzIBI0+A22B20+---
A20801+A21BII+Ax2B21 +--,
A20B02+A21B12+A22B22÷--"
A30B00÷A31BI0+A32B20+-'-
A30B01+A31BII+A3zB21 +''"
A30B02+A31B12+A32B22+""
Sub-
Substep C
Substep
step B
A
Iteration 2: Iteration 3:
Substep
Substep
A
B
Substep
C
Sub-
Sub-
step
step.
Substep
A
B
C
Figure 7.3 M a t r i x m u l t i p l i c a t i o n a l g o r i t h m that computes all inner-products "simu[taneously" (outer-product method),
Written in Pascal/L the middle-product method looks as follows:
Program M A T R I X M U L T ; vat A,B,C: parallel array[0..127,0..127] of integer(8); row: selector[0..127]:=(0=>True); acol,bcoh integer; begin (*clear C *) for bcoh=0 to 127 do C[%bcol]:=0; (* for each B-column *) for bcol:=0 to t27 do for acol:=0 to 127 do (* m u l t i p l y all columns of A *) begin C[*,beol]:=C[*,bcol] + A [ * , a c e l ] * B [ r o w , b c o l ] ; row:= r o t a t e ( r o w , I ) ; (* w i t h each element of B-column *) end; end.
An estimation of the time required to m u l t i p l y two t28 x 128 element matrices of b - b i t data can be made as follows: The number of multiplications by scalar are 1282 = 21/4.
If recoding of the m u l t i p l i e r
using canonical signed-digit code is done (see Chapter 3),
a m u l t i p l i c a t i o n will - on the
151
average
consist of b/3 additions [Hwang79].
-
cycles.
Each m u l t i p l i c a t i o n
Register, bits.
Each addition takes a t i m e close to 3b
is preceded by a transfer of the m u l t i p l i e r to the Mask
which takes approximately 4b cycles.
Thus,
It is followed by an addition over 2b
the m u l t i p l i c a t i o n t i m e is a p p r o x i m a t e l y
214 * (b/3 * 3b + 4b + 6b) = 214(b2+10b) cycles. For
b=8 this
is approximately 2.4"106 cycles.
a p p r o x i m a t e l y 0.5
seconds,
overhead t i m e
With
not
200 ns clock cycles the t i m e
included.
The
time
for
receding
is is
negligible.
7.2.2 n x n Matrices~
n
2
Processors
We next consider the case when we have n2 processors available for the m u l t i p l i c a t i o n of two n x n matrices. each other,
Since there are n 2 inner products to be computed independently of
the n 2 parallelism can always be fully utilized.
before m u l t i p l i c a t i o n may cause some overhead. of
doing the
computation on a perfect
shuffle-connected
method is a computation in place method, final
place.
The
other
However,
data alignment
We have studied two a l t e r n a t i v e methods
i.e.
processor array.
The first
each inner product ig computed in its
This requires a significant amount of shuffling before each multiplication. method
computes
processing elements.
the
different
terms
of
an
inner
product
in
different
The final rearrangement of the result is done in the summation
phase. The first
method is an outer-product method,
illustrated in Figure 7.3. iterations,
i.e.
a further paraltelization of the one
All inner-products are computed simultaneously in a t o t a l of n
in each iteration,
all substeps are done in parallel.
The method is described
in detail in [Svensson83a]. The second method,
which appears to be favourable,
middle-product method described in Figure 7.2.
is a further paratlelization of the
Here,
the entire computation of one
inner product is finished before the next one is started.
A further paralletizatien of this
method
will
finish
the
computation of
n different
inner
products before starting
the
computation of n new ones. In order to perform all substeps of an i t e r a t i o n in parallel, Figure 7,4 is needed.
In i t e r a t i o n no.
Each row is column no. in Figure 7.5. performed.
k of B.
k the m a t r i x B (k) with n identical rows is formed.
This can be done in [og2n broadcast-shuffle%
In the first step in the figure,
In the second step,
the general case~
row no.
the alignment depicted in
as shown
the operation "Broadcast Upper" (BU) is
the operation "Broadcast Lower" (BL) is performed.
In
k = of B is spread to all columns to f o r m
B (k) by the following procedure:
152 for j:=p-1 downto 0 do if kj=0 then BU else BL; Example : Row no. sequence BL~BU,BL,BL.
of
a 16-row matrix is broadcast to
all
columns by the
The formal proof (which is simple) is given in [Ohlsson and Svensson 83]. AO0 At0 A20
An-1~0 A01 A11 A21
x
x
An-1~1 A0k Alk A2k
B
Bok B0k B0k
00 B10 B20
Bok B1k B1k B1k
Bn-110 B01 Bll B21
Blk
I1 ~
--
x
Bkk Bkk Bkk
>
Bkk
Ao,n-1 AI,n-1 A2,n-1
Bn-1,k7 Bn-1,k Bn-l,k >-.~
An-1,n-1 A
x
B0k
I
An-l,k
•
Bn-I ~1
Bkk
I Bn-I,k Bo,n-1 BI,n-1 B2,n-1
o
Bn-1,k B(k)
Bn-I,n-1 B
Figure 7.4 Alignment required for computation of column k according to the middleproduct method. "x" mark rows where products contributing to element R1, k of the result matrix are computed. "*" marks the row where the element R1, k is to be stored.
153
BOO
BOI
BIO
BOI
B20
BO|
B30
BOI
BOI
BII
BII
BII
B21
BII
B31 B02 B12 B22 B32 B03 B13 B23 B33
BII B21 B21 B21 B21 B31 B31 B31 B31
Figure 7.5 Formation of B (1) from B.
The n products contributing to element no. starting in word i.
i of a column are situated n words apart,
The summation process needed to form the element can be done in
parallel with those summations that form the other elements in the column, using the perfect shuffle/exchange network in log2n steps,
as depicted in Figure 7.6a.
Putting the result at the final destination requires log2n additional shuffles, shown in part b of the same figure. parallel for all columns.
As can be seen however~
(For a more formal treatment,
it is done which is
these can be done in
see [Ohlsson and Svensson 83]).
154
R02 R12 R22 R32
(a)
(b)
Figure 7.6 a) Computation of Ro=roo+ro1+ro2+r03 R1=r10+r11+r12+r13 R3=r20+r21+r22+r23 R4=r30+r31+r32+r33 b) Putting the result at the final destination (column 2 assumed)
To sum up the
amount of
shuffling
needed,
we note that neither the pre-alignment
procedure nor the addition procedure u t i l i z e the full parallelism if only one column is t r e a t e d at a time.
However,
the columns can be treated partly simultaneously.
reduces the number of stages for each of the procedures.
This
The t o t a l number of passes of
a bit-stiee through the shuffle/exchange network of I_UCAS in order to align arguments for multiplication, the data length):
add the contributions and rearrange the result is the following (b is
155
Pre-alignment;
2(n-1 )*3b
Summation;
(n-1)*Sb
Post-alignment=
tog2n*2b
Each pass takes one clock cycle.
7.2.5 n x n Matriees~
The m u l t i p l i c a t i o n t i m e is a p p r o x i m a t e l y 5b 2 cycles.
M o r e Than n But F e w e r Than n.2,Processors
L U C A S w i t h its 128 PEs does not f i t into the scheme of Section 7.2.2~ an even square.
when the number of PEs is e.g. In
such
a cas%
the
associative m e m o r y .
n2/2~
n 2 elements
or more general n2/m~
of
a matrix
are
where m is a p o w e r of 2.
distributed
over
m
fields
in the
Figure 7.7 i l l u s t r a t e s the case when n=4 and m=2.
We adopt the m i d d l e - p r o d u c t
method.
To align elements for the c o m p u t a t i o n of column
k~
the m a t r i x B (k) is f o r m e d through broadcasts of column k.
all
such
matrices
since I 2 8 is not
In this section we study the usefulness of the i n t e r c o n n e c t i o n n e t w o r k
amounts
to
m*2(n-1)
broadcasts
if
the
The t o t a l t i m e to f o r m procedure
is
para]lelized
maximally. In each PE,
m m u l t i p l i c a t i o n s are made and the products added.
added over the i n t e r c o n n e c t i o n
network
and adding can again be parallelized~
in log2(n/m) steps.
The sums produced are
This procedure of shuffling
to yield a t o t a l of 2n(m-1) addition steps.
Finally,
log2n shuffles of each of the m fields are made. We see that the m i d d l e - p r o d u c t a l g o r i t h m of Section 7.2.2 is well adopted to the case of fewer
PEs.
The full
parallelism
is used t h r o u g h o u t
the e n t i r e a l g o r i t h m .
processing t i m e m times longer than if n 2 processors were available.
This gives a
156
A00
A02
BOO
B01
B21
A10
A12
B10
B01
B21
A20
A22
B20
B01
B21
A30 A01 A11 A21 A31
A32 A03 A13 A23 A33
B30 B01 B11 B21 B31
B01 B11 B11 B11 B11
B21 831 B31 B31 B31
Matrix A
Part of matrix B
A00B01+A02821 A10B01+AI2B21
CO0
CO0
C0I
C0I
Cl0
C10
C20
Cll
Cll
C30
C20
C01
C21
021 C30 C31
Cii C21 C31
A20B01+A22B21 A30B01+A32B21 A0tB11+A03B31
AIIB11+AI3B31 A21BI1+A23B31 A31B11+A33B31 Computations made in PEs
After broadcasting column I
C31
After addition over Results from interconnection column 0 and I network merged
After 2 shuffles
Figure 7.7 Illustration of part of the computations when two 4x4 matrices are multiplied on an 8 PE array,
7.2.4 n x n Matrices~
More Than n 2 Processors
We also briefly consider the case when there are more PEs available than there are e|ements of a matrix.
For example,
this is the case when 128 PEs are used for
multiplication of 8x8 matrices. Let there be m*n 2 PEs available,
where m is a power of 2,
each fill the upper n Z words of a field.
The A- and B-matrices now
If fut] m*n 2 parallelism is to be utilized the
matrix elements must be broadcast in a way that aligns them properly for multiplication. If we spread column k of B as in Figure 7.57
we will automaticatly~
with m*n 2 PEs)
157
also spread columns k+l,k+2,...,k+m-1 multiple of m or k=0. shuffles,
elements
to the rest of the field,
provided that
are
aligned
so
that
all
multiplications
needed
k+l~...,k+m-1 of the result m a t r i x can be done simultaneously. interconnection network~ result
matrix
stored
k is a
A f t e r spreading A by means of log2n broadcasts and a sequence of for
columns
rearrangement of the result is needed in order to have the
in the same order as the input matrices.
complicated than in the earlier described cases. the whole result m a t r i x in parallel,
However,
This process is more
since it is done only once for
the e x t r a t i m e caused by this is negligible.
A more detailed description of the m*n 2 case is given in [Ohlsson and Svensson83]. number of
k,
A f t e r addition over the
cycles required to
multiply
two
n x n matrices of
b-bit
data with
The m*n 2
processors is the following: Pro-alignment:
2*log2m*3b
Multiplication:
(n/m)*3b 2
+ 2(n/m-1)*3b
Surrrnation:
(n/m-1)*5b
+ [og2m*3b
Post-alignment:
2*(21og2n
+ tog2m)*2b
+ log2m(n/m)*2b
To give a sense of the amount of inevitable overhead time in an implementation~ algorithm has been programmed and tested on LUCAS.
the
The measured execution times
exceed the absolute lower bounds for this type of processor (the formulas given above) with
typically
20
-
3Q%.
The
amount
computation t i m e is for 16-bit data t8%
of
pure
data
alignment
and for 8-bit data 2 6 % .
compared
to
total
Table 7.1 gives the
execution times in microseconds for m u l t i p l i c a t i o n on LUCAS of 8 by 8 matrices with b-bit data.
For comparison~
the same task has been programmed in assembly language
on a conventional VAX 11/780 computer. 3600 microseconds~
The execution t i m e obtained was a p p r o x i m a t e l y
regardless of the number of bits.
b=8
b=12
b=16
Pre-alignment:
55
81
1 06
Multiplication:
255
489
799
Summation:
3/4
48
63
Post-alignment:
48
70
93
392
688
1061
TOTAL:
Table 7.1 Execution times,
158
7.:~ FAST FOURIER TRANSFORM efficiently
computing the discrete
Fourier transform (DFT) of a time series (discrete data samples).
The fast
Fourier transform (FFT) Js a method for
The DFT has properties
that are analogous to those of the Fourier integral transform~
which can be used to
determine the frequency spectrum of a continuous~
time varying signal.
The publication
of the FFT method [Cooley and Tukey 65] meant a revolution in signal processing,
since
the time needed to compute the DFT on a digital computer is reduced by orders of magnitude.
A straightforward calculation of the DFT (according to the definition) on a
sequential computer takes O(N 2) time,
where N is the number of samples,
Q(Nlog2N) time is needed when the FFT method is used. parallel processing. First, on the
Using N processing elements,
the processing time will be O(tog2N).
we will give a short description of the DFT and the FFT algorithm. description given in [IEEE
implemented on LUCAS.
G-AE
67].
whereas only
The algorithm is well suited for
Then we will
It is based
show how the
FFT
is
The interconnection structure plays an important rote in the
computation.
7.3.1 The Discrete Fourier Transform if
a digital
computer is to be used for
necessary that the data be sampled.
analysing a continuous waveform then it
is
The minimal sampling rate needed in order to obtain
a true representation of the waveform is twice the highest frequency present in the waveform. Assume the time series obtained has length N. series. 1.
Denote by X k the kth sample of the time
The DFT of the time series consists of N complex coefficients,
Each A r is obtained by the formula N-1 Ar
=
~ Xke-2~rj rk/N
(7.3.1)
k=0 Using the shorthand notation W = e-2~j/N the expression for A r becomes N-1 Ar =
~ XkWrk k=O
The inverse of (%3.2) is
r=0,1,...,N-1
(7.3.2)
At,
r=0,1,..,N-
159
N-1 k ~ArW- r
Xk = (l/N)
k=0,1,...,N-1
(7.3.3)
r=0 This relationship is called the inverse discrete Fourier transform (IDFT). The DFT and the IDFT are of similar form,
implying that a parallel machine suitable for
computing one can be used for computing the other by simply exchanging the roles of X k and At,
and making appropriate scale-factor and sign changes.
(DFT(Ar)*)* ,
In fact,
IDFT(A r) =
where * denotes the complex conjugate.
7.3.2 The Fast Fourier Transform The FFT is a clever computational technique to compute the DFT coefficients.
The DFT
of a time series is here obtained as a weighted combination of the DFTs of two shorter time series.
These,
point is needed.
in turn,
are computed in the same way,
This is the point vaiue itself,
Suppose that the time series Xk~ Zk,
k=0,1,...,N-l~
each of which has only half as many points.
even-numbered points (X1,X3,Xs,..) ,
(X0,X2,X4,...)
see Figure 7.8.
and
Zk
until the DFT of a single
according to expression (7.3.2). is divided into two functions,
Yk and
The function Yk is composed of the
is composed of
the
odd-numbered points
160
0 o 0 0 0
Xk
o
0 0
0
0
0
0 0
0 0
0 0 0
o
0 0 '
I
I
I
2
i
4
I
i
6
I
L
8
I
~
10
I
'
12
I
i
I
14
i
I
16
'
18
I
;
20
~
'
',
22
0 0
0
0
0
Yk
0
o
0
b
0
0
I
2
3
4
5
6
7
8
0
9
I0
0
0 o
Zk
o
o
o
o
o
4
5
o
o
0
I
2
3
6
7
8
FiBure 7.8 D e c o m p o s i t i o n of the t i m e series X k i n t o two~
9
I0
h a l f as long~
series Yk and
Zk•
Now~
i f B r and C r denote the d i s c r e t e F o u r i e r t r a n s f o r m s o f Y k and Z k r e s p e c t i v e l y ,
is easily shown t h a t the d i s c r e t e F o u r i e r transform~ A
= r Ar+N/2
From
(7.3./4)
transform
and
B
r = Br
(7.3.5)
+ WrC
0 < r 0 < r
£ - wrc r
the
first
N/2
and
Ar~
o f X k can be w r i t t e n
< N/2
(7.3.4)
< N/2
(7.3.5)
last
N/2
it
points
of
the
discrete
Fourier
of X k (a sequence having N samples) can be easily obtained f r o m the D F T of
Y k and Zk~
both sequences of N / 2 samptes.
Figure 7.9 i l l u s t r a t e s this f o r the case N=8.
161
X0:Y0---~
BI
X2:YI---,~
.\
/~f~
AI
DFT X4=Y2---,.-
\\,X/ !
X6=Y3~ XI=Z0----,,.
I
X3=Z 1
r= / / Y \
DFT X5=Z2----,..X7=Z3---~,-
.....
-'J"-O
A7
Figure 7.9 Signal flow graph illustrating how calculation of an 8-point DFT can be reduced to the catculation of two 4-point DFTs. A number w i t h i n a square represents m u l t i p l i c a t i o n by W raised to the number. In the lower hail, the value arriving by the dotted tine is subtracted from the value arriving by the solid line. In the upper half the two values are added.
We can use this technique repeatedly, tong sequences.
Accordingly,
i.e.
we can in turn divide X k and Yk into half as
the computation of B k (or C k) can be reduced to the
computation of sequences of N/4 samples.
These reductions can be carried out as tong as
each function has a number of samples that is divisible by 2. be a power of two.
Normalty~
We w i l l l i m i t the discussion to that case.
The computation is illustrated by the signal flow graph }n Figure 7.10.
N is chosen to
162
Xo
\)®
....... ~
~
~®,0
X4
X2 X6
---
~
"'~-"%-,
\ g ifm
.... ~ + ~ ~ 7 ~ -{~ True); MERGE1, MERGE2~ MERGE% MULI~ MUL2~ MUL~ XUPPER,XLOWER : parallel array [0..127] of fixed(1.NOOFBiTS); begin where ODD do MERGEl:=exshuffle(X.RE) elsewhere MERGEI:=shuffle(X,IM); MULl ;= MERGEI*QMEGA[%I].RE; where ODD do MERGE2:=exshuffle(X.IM) elsewhere MERGE2:=shuffle(X.RE)~ MUL2 := MERGE2*OMEGA[%I].IM; where ODD do MUL:=MUL1-MUL2 elsewhere MUL:=MUL1 +MUL2; where ODD do MERGE3:=shuffte(X.IM) elsewhere MERGE3:=exshuffle(X.RE); XUPPER := MERGE3+MUL; XLOWER := MERGE3-MUL; where ODD do X.RE:=below(XUPPER) elsewhere X.RE:=XLOWER! where ODD do X.IM:=XUPPER elsewhere X.IM:=above(XLOWER); end; (*FFTiteration*) begin ... (*sample values are input to X.RE*) X.IM := 0; for I := 1 to NOOFITERATIONS do FFTiteration(I); SQUARE.RE := X . R E * X.RE~ SQUARE.IM := X.IM * X.IM; RESULT := SQUARE.RE+SQUARE.IM; (*power spectrum is now in array RESULT*) end; (*FFT*)
167
The entire algorithm has also been written as a microprogram.
The execution t i m e for a
128 samples FFT when all data are 8-bit is 0.2 ms per iteration, ms.
The multiplications take 70% of the t o t a l execution time.
making a t o t a l of 1.4 Since the m u l t i p l i c a t i o n
executes in a t i m e proportional to the square of the data length, increased data length.
the r a t i o grows with
Noting that the real and imaginary parts of the coefficients in the
first two iterations have the values zero and plus and minus one only, in these iterations can be omitted. LUCAS
can be used for
This reduces the execution t i m e f r o m t.4 to 1.1 ms.
computation of the FFT w i t h
full parallelism also when the
number of samples does not match the size of the array. e.g.
1024,
the multiplications
When the number is larger,
samples that are 128 units apart are put in the same memory word.
This
makes it possible to compute the first iterations of the algorithm entirely within the PEs. Assuming 2n'128 sample points,
LUCA5 w i l l need 2n(log2128 + n) iterations of the kind
described above to compute the FFT. execution
time
grows
with
the
The following table shows how this number and the
number
of
samples.
(Reduction
for
the
two
initial
iterations is made). n
no. of samples
no. of iterations
time(ms)
0 1 2 3 4 5
128 256 512 1024 2048 4096
7 16 36 80 176 384
1.1 2.6 6.0 13.6 30.4 67.2
Table 7.2 Execution times When the number of samples is smaller than the number of PEs, calculation can be performed at a time. 52,
For example~
when the number of samples is
one sequence of 32 samples is put in memory words 0,4,8,...,
words 1 , 5 , 9 , . ,
still
another in 2,6,10,...,etc.
more than one FFT
The FFT
another sequence in
of all four sequences can be
calculated simultaneously. As noted above,
the result data from the FFT algorithm appears in bit-reversed order.
To get the data out to the host computer in natural order in a simple manner, equipped LUCAS with an "address bit reversal" f a c i l i t y .
we have
The Master Processor can choose
any of two buffers to pass the address to the [/O data registers of the processor array. One of the buffers transfers the address w i t h o u t any changes, the bits of the address.
the other buffer reverses
Thus data can be brought in or out in bit-reversed order by
ordinary block moves. LUCAS has been used for spectral anaiysis of speech in real t i m e [Ohlsson82, et at. of
85]).
speech
Fernstrom
The sampling frequency needed in order to cover the significant frequencies is
10 kHz.
Real
time
analysis based on
128 samples
requires
that
the
168
computation be performed in 12.8 ms,
which is the t i m e to gather 128 samples.
The 1.1
ms needed by LUCAS is well within this I i m i t . Ohlsson
has
suggested
elements
of
LUCAS
processor
even
more
[Ohlsson8/4a~
in
arder
to
attractive
Ohlsson84b] make
for
an
improvement
multiplications
signal
faster
processing,
in
and
Chapter
of
the
processing
thereby t0
the
make
the
suggested
improvements w i l l be presented.
7.4 THREE GRAPH-THEORETIC PROBLEMS Problems that can be identified as graph-theoretic show up in diverse areas~ planning and network analysis, vertices
of
a graph.
hi-directional,
problems of
updating matrices. rich. path
traffic
The connection between two vertices may be uni-directional or
In the first case the graph is called a directed graph ,
path length) may be associated with each path. Solutions of
e.g,
A common task is to find the shortest path between two
this
kind
often
take
Als%
a cost (or
Such graphs are called weighted . the
form
of
searching large trees
or
Opportunities to exploit the kind of parallelism offered by LUCAS are
As examples we w i l l consider algorithms for the solution of two d i f f e r e n t shortest problems on LUCAS.
In the first
problem,
paths between vertices are all bi-
directional and all have the length 1 (if they exist).
The task is to determine the length
of the shortest path between two specified vertices.
In the second problem the paths are
uni-directionat and an individual length is associated with each path.
The task is to
produce a distance m a t r i x showing the lengths of the shortest path between all pairs of nodes.
We w i l l also consider an algor~thm for finding the minimal spanning tree of a
graph~
i.e.
that subset of edges of the graph t h a t connects all vertices w i t h minimal
tota~ edge weight.
7.4~1 Shortest Path Between Two Given Vertices. U n i t Figure 7.1/4 shows a graph that we w i l l algorithm. i.e.
with
From each vertex~ path length 1.
Path Length
use as an example to illustrate the proposed
lines are drawn to vertices that can be reached directly,
A compact way of representing the graph on LUCAS
means of an "adjacency m a t r i x "
shown in Figure 7.15.
A
"1" in the m a t r i x indicates
that there is a direct connection between the vertices in the row and column. paths are bi-directional,
the m a t r i x is symmetrical around the main d i a g o n a l
the m a t r i x is stored one row per memory word~
is by
one column per bit-slice.
Since all In LUCAS
169
1
Figure 7.14 A b i - d i r e c t i o n a l graph.
As an exampl%
we want to find the length of the shortest path between v e r t e x no.
and v e r t e x no.
11.
We do this by successively building the tree of v e r t i c e s reachable
f r o m 2 in one,
two,
three,..,
steps.
in one step are marked in the Tags. tagmarked, steps.
gives a row,
In the
next
To s t a r t with,
logical
OR
indicating w r t i c e s of
formed and the result is stored in the Tags. reached in three steps. vertical
OR-ing
of
all bit-slices
destination v e r t e x is reached.
bit-slices
and
reachable in e x a c t l y t w o
marked in the mark
word is
The tags now indicate which v e r t i c e s can be
This procedure is continued,
marked
th~ vertices that can be reached
OR-ing the conte,~s of those rows t h a t now are
a "mark word",
step the
2
rows,
a l t e r n a t i n g between h o r i z o n t a l and respectively,
until
finally,
the
In this case we arrive at the destination v e r t e x a f t e r six
steps. In each i t e r a t i o n the entire m a t r i x has to be traversed bit-slice by bit-slice. t i m e to p e r f o r m one i t e r a t i o n is p r o p o r t i o n a l to the number of vertices, of i t e r a t i o n s is the same as the length of the shortest path, Execution t i m e = constant ~-I ~ n
1.
n.
Thus,
the
The number
170
-~I
matrix width start
matrix
SOu
destination selector
rce
destination
I I
2
COLUMN 3
I 2 3 4 5 6 7 8 9 IO 11 12 13
ROW
Markword contents
iteration no,
1
4
5
6
7
Tag contents 8
9 I0 II 12 13
I
!
I
I
1
1
I
I
1 t 1 1
I I
1
I
I
I I
1 1
,
1 I I
1
I I I
! 1
I I
1 ]
I
1 1
1 1
1
1
1 1 1
I
l
1 l 1
2 4 6
i mariword_seiector
I
1
i Ii
1
1 1 1 1 1 (~) 1 1
__
__~ 1
3
5
Iteration no.
Figure 7.15 Adjacency m a t r i x of the graph in Figure 7.14. Address Processor are shown,
Pointers administrated by the
and also the changing contents of the Tags and
Mark word as the a l g o r i t h m proceeds.
We
will
now
make
the
m i e r o p r o g r a m m i n g language. Thus~
description
more
precise
by
giving
it
in
the
high
level
The n u m b e r of p a r a m e t e r s needed in the i n s t r u c t i o n is 6.
the parameters must be loaded to the Control Unit
in t w o passes.
One extra
register of the Address Processor is needed to count the number of iterations.
M i c r o p r o g r a m SHORTPATH ( * m a t r i x s t a r t , destinat~n, Begin counter:=1; LTMA(source,direct); ANDTMA(destination selector,direct); If SOME then exit(SHORTPATH);
matrixwidth, source, destinationsetector, (* (* in (*
markword_..selector*)
Clear i t e r a t i o n counter *) Mark in Tags~ vertices reachable one step *) See if destination reached *)
LOOP: While TRUE do Begin counter:=
counter+l;
x:=O;
I t e r a t e m a t r i x width t i m e s
(* For each bit-slice, form
171
logical OF{ of marked words *) Begin LTMT(matrix start+x,direct); CRA; if SOME then CORA; LTMA(markword selector,direct)~ WRRT(m at r i x._st-art +x);
(* Write result in markward *)
X:= X+I~
End~ (* Check if destination reached *)
LTMT(destination,direct); if SOME then exit(LOOP); counter== counter+l~ x:= 0~ CRA: Iterate m a t r i x width times
(* Each bit-slice marked in mark word contributes to horizontal OR *)
Begin LTMA(markword_tag~direct); LTMT(matrix start+x,direct)-,, if SOME then ORRMA(matrix_ start+x~direct); End; LTRA; ANDTMA(destination selector~direct); if SOME then exit(LOOP); LTRAI Endl End;
(* Check if destination reached *)
From the information in Figure 7.15~ gathered during the computation~ trace
back which route or routes that
between bit-slice no. and 13.
Thus,
give the shortest path length.
Logical AND
11 and the Tag contents from iteration no.5 gives " l " s at rows 9
there are two paths to 1t~
one via 9 and the other via 13.
ANDing row 9 and the mark word from iteration no. goes via 10 or 12~
it is possible to
etc,
Now~
4 gives that the path that passed 9
To be able to perform this back-tracking we see that successive
mark words and Tag contents must be saved.
This is easily done and adds very l i t t l e to
the total execution time.
7.4.2 Shortest Pat,,h Between All Pairs o f Vertices in a Weighted ~ Directed, Graph In a weighted~
directed graph the paths between vertices are uni-directional and there is
a length associated with each path.
Figure 7.16 shows an example of such a graph,
w i l l consider the problem of finding the shortest path between all pairs of vertices. graph is given in the form of a matrix. 7.16 is given in Figure 7.17.
We The
The m a t r i x corresponding to the graph in Figure
Note that the absence of a direct path between a pair of
vertices is marked "infinite" (if) in the matrix.
172
2
7
2
3
Figure 7.16 A weighte G
directed graph.
1
I[ 0 21 7 3 3 4 2 5 if 6 ~if 7 I! f
2
3
4
5
6
7
if 0 if if if if if
if I 0 if if if if
8 if if 0 if if 2
2 if 4 l 0 4 if
if if 3 if if 0 if
if if if if 2 7 0
Figure 7.17 The distance m a t r i x of the graph in Figure 7,16,
To solve the problem on LUCAS we w i l l
f o l l o w an algorithm due to Floyd [Floyd62]~
which is considered as one of the two most e f f i c i e n t algorithms for sequential computers, It is well suited for parallel implementation. proportional
to
n 3 is requireG
On sequential computers a computation t i m e
where n is the number of
vertices.
On a parallel
computer with n PEs i t should be possible to perform the algorithm in a t i m e proportional to n 2. The algorithm works as follows. distances~
Starting with the original n by n m a t r i x D of direct
n d i f f e r e n t matrices DI~
D2~,.. ~ D o are constructed sequentially.
Matrix Ok
is obtained from m a t r i x Dk_ 1 by inserting vertex k in a path wherever this results in a shorter path, On a parallel computer with n PEs,
an entire column of the m a t r i x can be updated
simultaneously.
column p of D k is obtained in the following way
In the k:th iteration~
(using Pascal/l_ notation for m a t r i x elements);
173
Dk(,p):= m i n [
Dk_l(,p) ,
Dk_1(,k) + Dk_l(k,p)]
A Pascal/L program for the entire algorithm reads as fol|ows: Program FLOYD; const noofvertices = 128; var Dmatrix: parallel array [1..noofvertiees,t..noofvertices] of integer(8); k,p: integer; begin for k:=l to noofvertices do for p:=l to noofvertices do where (Dmatrix[*~k]+Dmatrix[k~p]) < Dmatrix[%p] do Dmatrix[~,p]:= Dmatrix[*,k]+Dmatrix[k,p]; end.
It is easily seen that the execution time of this program is proportional to n 2.
The task
that is performed n 2 times is an "add fields" instruction followed by "field larger than fieid" instruction and a tagmasked "move f i e l d ' .
These are all proportional to the field
length. The algorithm requires a representation for an infinite value.
We choose a number that
is a l i t t l e smaller than half the largest value that is possible to represent in the given field
length.
In the worst case,
two such numbers are added.
This w i l l
give no
overflow.
7./4.3 Minimal Spanning Tree The minimal spanning tree (MST) of a weighted,
bi-directional graph is defined as that
subset of the vertices of the graph that connects all vertices with minimal total edge weight.
As an example of a context where the problem of finding the MST arises,
consider a telecommunications system.
The problem of connecting a set of cities to each
other using minimal wire length is exactly the problem of finding the MST (provided that only wires from one city to another are allowed). An efficient algorithm for finding the MST of a graph is due to Prim [Prim57].
It was
improved and implemented on computer by Dijkstra [Dijkstra59] and is normally called the Prim-Oijkstra
algorithm.
On
sequential
computers
it
requires
a
processing
time
174
proportional to n The algorithm eventually,
2
on an n-vertex graph.
works by successively
expanding a subtree (catted a fragment ) until~
a spanning tree is obtained.
which may be chosen arbitrarily.
The initial fragment consists of a single vertex,
The fragment is then expanded at each stage by adding
to it the nearest neighbour of the fragment~ minimal distance to the fragment.
i.e.
that vertex not in the fragment with
Ties are resolved arbitrarily,
A f t e r n-1 stages the
MST has been constructed. As an example, fragment,
consider the graph shown in Figure 7.18.
Starting with vertex B in the
edges are added to the subtree in the following order: B-D,
D-A~
B-C~
C-E,
E-F.
2
2
4
Figure 7.18 Weighted~
2
C
t
3
bi-directional graph (left) and its minimal spanning tree (right).
To implement the algorithm on an n-processor array,
we use the same representation of
the graph in the associative memory as in the all-to-all shortest path problem above, a distance matrix,
The distance m a t r i x of our example graph is shown in Figure 7,19,
i,e.
175
F
A
B
C
D
E
A
0
2
if
I
lif
B
2
0
2
I
4
if
C
if
2
0
if
3
if
D
I
I
if
0
4
4
E
if
4
3
4
0
3
F
if
if
if
4
3
h°
Figure 7.19 Distance m a t r i x of the graph shown in Figure 7.18.
In order to d e t e r m i n e which vertex to add to the f r a g m e n t , keeping track
algorithm
P r i m ' s original algorithm is
of the "nearest n o n f r a g m e n t neighbour" of every f r a g m e n t vertex.
then
requires a running
time
proportional to n 3.
Dijkstra's
The
improvement
resulted from using another strategy: keeping track of the "nearest fragment neighbour" of each nonfragment vertex.
This gives O(n 2) processing time.
D i j k s t r a ' s startegy turns out to be the most favourable also on a parallel processor [ike LLICAS. vertex, example~
A distance table is needed in each stage,
lit contains,
for each nonfragment
the name of its nearest neighbour in the fragment and the distance to it. after stage 2~
when the fragment consists of vertices B~
distance table has the following contents-
Nearest neighbour in fragment nonfragment vertex
Distance
C
B
2
E
8
4
F
D
4
D and A,
For the
!76
C
is
chosen as the
new
fragment
member
and the
table
Nearest neighbour in fragment .
When
implementing
information after
.
.
.
.
the
.
.
.
.
.
.
.
.
.
.
.
.
a SELECT
address t h a t
.
.
.
.
.
.
.
.
.
.
.
.
.
.
F
D
8
on
contents:
.
3
LUCAS
we
must
make
sure
that
all
required
F i g u r e 7.20 shows the c o n t e n t s o f the distance t a b l e
the search f o r m i n i m u m
to resolve ties,
Address Processor.
column
.
C
algorithm
After
FIRST the
.
fo|lowing
Distance
E
passing can be done.
each stage.
passed to
.
gets the
information
value o f the distance column (D), about which
This i n f o r m a t i o n
of the distance m a t r i x
and
v e r t e x was chosen must be
is used by the Address Processor to
t h a t should be merged into the D column of
the distance t a b l e on the basis of " s m a l l e s t value wins".
After interation
no:
Vertex label
0
NN
1
D
T
NN
O
2
T
D
B B
o
D
°i
B
B
To f r a g ment: B Output:
B
T
A
2 0
B B B g
NN
3
4
NN
D
A
0 0 b 0
B
C D C D
T
NN
D
T
A B
C D E E
® 4
o o o 0 3
°i
O
A
C
E
F
B-D
D-A
8-C
C-E
E-F
F i g u r e 7.20 The successive contents of distance table and Tags.
The
" V e r t e x l a b e l " f i e l d is necessary f o r this passing of addresses.
vertex,
the
address of
its
column
in
the
distance
matrix.
c o n t a i n only the binary r e p r e s e n t a t i o n o f the w o r d number. to the Address Processor where i t address
to
associative
get
the
memory
Buffer Register.
actual to
the
column
It contains~
(Actually
the
f o r each field
may
This number is then passed
is s h i f t e d l e f t a f e w positions and added to an o f f s e t address).
The
transfer
of
a data
Address Processor is made via the I/O
value
register
from
the
and the I / O
177
A t each stage of the a l g o r i t h m the f o l l o w i n g is done:
* A
search
for
* Output
the
of
minimum
the
pair
value
of
of
in
the
selected
words.
selected
word.
* Transfer of the contents in the " V e r t e x label" field of the selected word to the
Address
Processor,
and
* Merging a new column f r o m
spreading
this
label
the distance m a t r i x
to
a scratch
pad field,
into the D-column on the
basis of '~smallest value wins ~'. * Merging
the
scratch
pad field
into
the
NN-field
using
the
merging
mask
determined above. All
these
tasks
take
a
time
that
is
independent
of
p r o p o r t i o n a l to the lengths of the fields they work on,
the
number
of
vertices~
but
in most cases the number of bits
in the distance values. The number finding vertices~
stages in the
minimum
algorithm
spanning tree
is n - l .
on L U C A S
Thus,
we conclude that
grows only
linearly w i t h
In fact~
linear t i m e is o p t i m a l if the list of edges is to be output in series.
is because each new edge means adding a new v e r t e x
number
the task of
the number of
provided that this number is smaller than or equal to the number of processing
eiements, This
of
the
of
vertices
to be added is n - l .
to the
subtree,
and the
We have shown that the decision which new
v e r t e x to choose can be taken in constant t i m e .
7.4.4 Discussion We have demonstrated how three f r e q u e n t l y encountered graph t h e o r e t i c problems can be solved
efficiently
on LUCAS.
The
compared to sequential execution. [Dijkstra59] The
time
where
l
Dijkstra's
latter As for
solves the problem in a t i m e on L U C A S
is
the
-
length
algorithm
using of
the
an e n t i r e t y shortest
two
show an improvement
the f i r s t
one,
proportional different path.
with
a factor
n
an algorithm due to D i j k s t r a
to n 2 on a sequential computer.
algorithm
Preliminary
can be f o l l o w e d also in i m p l e m e n t a t i o n
some characteristics in common with the MST algorithm.
-
is p r o p o r t i o n a l
studies
give
on LUCAS.
at
to n*l~
hand
that
It would have
The a l g o r i t h m presented here in
part 7.4.1 is simpler to f o l l o w and program but probably tess e f f i c i e n t
if the number of
v e r t i c e s is large and the shortest path between the specified vertices is long. The
parallel
implementation
The MST a l g o r i t h m ,
of
however,
Floyd's
shortest
is more t r i c k y .
path algorithm
is very s t r a i g h t f o r w a r d .
We have not seen any reports on the
solution of the MST problem on an n-processor computer.
Bentley [Bentley79] describes
178
an i m p l e m e n t a t i o n of the P r i m - D i j k s t r a system. full
The execution t i m e
parallelism.
In
IDee
a l g o r i t h m on an n/log2n-processor t r e e - s t r u c t u r e d
is O(nlog2n) ~ and
Yoe
81]
which means that he too is able to use the implementation
of
the
same
algorithm
an a
n0"5-processor w i t h an execution t i m e t h a t is O(n 1"5) is reported. Graph
theory
is
an
area
with
many
examples given we have indicated
that
problems
open
for
para][el
p a r a l l e l processors w i t h
L U C A 5 bear good promise to be useful for these purposes.
solution.
architectures
With
the
similar
to
Chapter 8 LUCAS AS A B A C K E N D PROCESSOR FOR R E L A T I O N A L DATABASE PROCESSING
8.1 I N T R O D U C T I O N Though
today
interpretation,
it
is f a m i l i a r
to
most
people,
computer,
some i n t u i t i v e
The
practical
need for
more
in the database soon gave rise to
systems for
new methodologies,
new algorithms and also new hardware techniques.
enterprise,
database technology,
software
correct
a database management system,
efficient
languages,
The
mostly
A database is stored in the memory of a
and to handle it a new type of software, evolved.
information
but
t h e t e r m database appeared in the l i t e r a t u r e on information processing for
the first t i m e as late as in 1964 [McGee81].
DBMS,
in
of
present
managing the
new programming
A new field of human
emerged and its importance is still growing.
database management systems
is very
efficiency is not always adequate to the needs of its users.
complex
and and its
The reason is t h a t a DBMS
is usually implemented on a conventional general-purpose computer which was designed for other kinds of applications. The von Neumann computer model,
developed in the mid 19/40s,
employed in numerical applications,
where basic operations are addition,
multiplication, by locations.
etc,
was intended to be subtraction,
and basic data types are numbers stored in a memory and addressed
The proper use of this type of computer is sequential calculations in loops.
The purpose of a DBMS is not calculation but r a t h e r manipulation of large volumes of data.
Basic operations required are r e t r i e v a l and updating of data,
records which are identified by contents rather than by locations. are supported by the hardware of general-purpose computers.
basic data types are None of those features
Furthermore,
the DBMS
offers a great natural potential for parallel execution which is impossible to e x p l o i t in a conventional computer. This disharmony between means and goals became more and more apparent in the late 1960s as databases grew larger,
and more sophisticated functions were being incorporated
into DBMSs to satisfy growing user demands. universities
in
the
USA
initiated
In the early t970s,
pioneering research
purpose computers for data base management.
projects
Since then,
people at a number of in the
area of
this area has become one of
the most dynamic research fields in the domain of computer architecture. increasing stream,
special
numerous papers are published each year,
In an ever
dealing with description,
180
analysis and discussion of new designs and new concepts of database computers. There are a number of ways to organize the i n f o r m a t i o n in a database. important
approach is the logical organization
relations .
One p a r t i c u l a r l y
of the data in the form of tables called
This approach has many advantages and many advocates [Codd82].
Its main disadvantage is commonly taken to be the f a c t that a table is a two-dimensional structure to
and therefore
be sequentially
elaborate software
must be translated into a one dimensional string of data if it is
processed in a conventional interface
physical storage structure. implemented e f f i c i e n t l y
between Even if
computer.
This implies
a need for
an
the logical data model seen by the user and the the r e l a t i o n a l
this necessary interface
database management system can be
is responsible for a costly overhead and
for the large c o m p l e x i t y of the system. This disadvantage, two-dimensional organizations.
however, table
The
c o m p u t e r structure
can be turned into an advantage since the s i m p l i c i t y of the
gives
most
an
opportunity
natural
way
to
to
store
exploit
and
to
new
forms
containing
data is a unity,
hardware structure
would
hardware be
in
a
which also looks like a table and where a one-to-one correspondence
between the logical and the physical data organization can be achieved. table
of
process tables
and the natural
would be in parallel,
way to process it
Furthermore,
a
in this t a b l e - l i k e
by operations having tables as operands.
An
Associative A r r a y is such a structure and we believe that it can make the management of a r e l a t i o n a l database simple and e f f i c i e n t . with exploring its possibilities.
The research presented in this chapter deals
For f u r t h e r details see [Kruzeta85,
Lindh at al.84].
This chapter is organized as follows: Section 8.2 gives a b r i e f description of the i m p l e m e n t a t i o n of relational algebra operations on I_UCAS,
when it is assumed that the size of the operand relations are such that they
can be stored in the Associative Array.
Furthermore,
this section contains a discussion
of the performance of very large Associative Arrays. Section
8.5
goes a step
method
of
evaluating
beyond the queries
to
material a
in Section 8.2.
database
stored
in
A simple
the
but
Associative
powerful Array
is
with
an
demonstrated. Section
8.4
is a discussion of
the performance
of
a database c o m p u t e r
built
Associative Array. Section
8.5
studies
the
usefulness
of
an Associative
Array
for
evaluation
of
the
3oin
research
are
also
operation on very large relations. Section
8.6
suggested.
is
a
discussion
of
results.
Some
topics
for
further
181
8.2 R E L A T I O N A L ALGEBRA ON LUCAS
8.2.1 Introduction This
section
presents
the
i m p l e m e n t a t i o n of
relational
algebra operations on LUCAS.
Some of the results of this section may be found in [Kruzeta and Svensson 81]. We w i l l demonstrate algorithms and give their approximate timing equations.
The t i m i n g
equations are helpful in analyzing the performance of the Associative Array. The equations w i l l express the t o t a l execution t i m e in terms of number of clock cycles consumed by the execution.
Parameters in the t i m i n g equations wilt be the sizes of
tuples or attributes and the cardinaJity of involved relations.
We assume that the size of
each relation is less than the size of the Associative Array. The
operations are
Processor,
Prior
parameters e.g.
implemented by microprograms which
to any operation,
are initiated
the addresses of attributes or the sizes of tuples,
of the Associative Array.
by the
Master
the Master Processor must send all the necessary to the Control Unit
The parameters are stored in the registers of the Address
Processor. To f a c i l i t a t e the understanding of some of the operation%
we w i l l give simple diagrams
showing the state of the Associative A r r a y before and a f t e r the operation and in some cases also during execution of the operation.
The diagrams are based on the schematic
picture of the Associative Array shown in Figure 8.1. Associative
Array
which
is
relevant
for
the
Only the particular section of the operation
will
be
displayed.
182
I/0 Buffer Register
D IOB Comparand Register
I/0 Register
[--] to I/0
I
]
Mask Register
Register
D
t
I / O Registers
t X
Memory Array
C
R
T
D Figure 8.1 Schematic picture of the Associative Array.
In the diagrams we w i l l
use the l e t t e r s S and D above a box representing the Memory
A r r a y to indicate the source and destination of data; 0 and 1 denote a value of a bit; x is an unspecified value of a bit; A,B,E stand for a value of a byte; and X (in the Memory Array) is an unspecified value of a byte.
8.2.2 Representation of a r e l a t i o n in the Associative A r r a y There is an obvious mapping between the logical structure representation
in
straightforward
way to
the
Memory
Array.
store
in the
it
allocate each tupie to one memory word,
The Memory
relation Array,
of a relation and its physical is
a
which
table,
and
the
is also a table,
most is to
so that the a t t r i b u t e s occupy v e r t i c a l fields in
the array. Figure
8.2 shows as an example,
a r e l a t i o n consisting of four a t t r i b u t e s stored in the
Memory Array.
MA Si 52 $3 $4 SS
SMITH JONES BLAKE CLARK
ADAMS
20 lO 30 20 30
LONDON PARIS PARIS LONDON ATHENS
Figure 8.2 R e l a t i o n in the Memory Array.
183
A
relation
in
the
Memory
Array
identified
is
by
two
sets
of
parameters:
* Information about which memory words hold its tuples, * The s i z e s a n d a d d r e s s e s of i t s a t t r i b u t e s .
The i n f o r m a t i o n about which memory words hold tuples of the relation is stored in the Memory Array together with the relation,
With each relation in the array there is a
unique byteslice called a Workfietd at an address assigned by the Master Processor. particular
bitslice
corresponding
in
memory
the
Workfield
word
Markbitslice of the relation. execution,
holds
indicates
a tuple
of
by
a
1
in
the relation,
its
bit
pattern
This bits[ice
One
that
the
is called
a
The other bitslices of the Workfield are used as a scratch
pad during
the
Processor.
Since all operations on data in the Memory A r r a y are always performed in
The
content
of
the
Workfield
is invisible to
the
Master
parallel and data are accessed assoclatively there is no reason why the outside world should know in which
memory words the relation is stored,
Before operating on the
relation the Markbitslice is usually loaded into the Tags, The address of an a t t r i b u t e of the reiation in the Memory A r r a y is a 12-bit bitaddress (g .
/4095) to its rightmost bitslice.
The addresses of relations currently in the
Mernory array are maintained by the Master Processor, when it
is loaded into the Memory A r r a y or when it
operation on relations already in the Memory Array. a pool of free space in the Memory Array, the array~ array~
They are assigned to the relation is created as a result of some
The Master Processor keeps track of
Before a new relation is to be loaded into
or before a new relation is created from relations already existing in the
the Master Processor checks the size of the attributes (number of bytes) and
assigns proper addresses to them.
It also assigns an address to the Workfield.
Since addressing of bitslices in the Memory Array is made by random access~ bitslices may be logical neighbours, a contiguous field
in a memory word,
Furthermor%
attributes
the
of
any two
The attributes of the relation do not need to occupy The order between the attributes is arbitrary.
different
relations may be freely intermingled in one
S and J,
residing simultaneously in the Memory Array.
memory word, Figure 8.3 shows two relations,
The relation S has three attributes with addresses SA1, address SWF.
J
has four
Workfield at address JWF.
attributes
with
SA2,
addresses 3A1,
SA3 and a Workfield at 3A2,
3A3,
3A4
and a
The figure displays only the content of the Markbitstices.
184
MA
SAi I I
JAI I I
S1
J1
JA3 ! I
PARIS
$2 $3 $4 BS
J2 J3 J4
ROME ATHENS ATHENS
J5
LONDON
SA3 I I
JA2 I I
SA4 I I
SA2 S W F I I I I
20
LONDON SORTER
SMITH
10
PARIS
JONES
30
PUNCH READER LONDON CONSOLE ATHENS COLLATOR
BLAKE
20 30
PARIS
~ 0 1 0 ~ 0 ~ ~ 0
CLARK ADAMS
JWF I I
t 0 0 0 1 i 1 0
Figure 8.3 Interleaved relations in the Memory Array.
One
item
different
of
data
relations.
physically
represented
in
the
Memory
as we w i l l see in Section 8.3.
d i f f e r e n t relations S,
Q and R.
,
T,
T is the same as S w i t h
only
of T consist of only three letters,
attribute
has
Paris,
R
is
SAI TAt qAi }
I
I
MA
I
I
I
$t S2 $3 $4 SB
SA2 TA2 QA2
SMITH JONES BLAKE CLARK ADAMS
SA3 SA4 TA3 TA4 QA~ QA4 RA![ I ~ I I I 1
20 lO 30 20 30
LONDON PARIS PARIS LONDON ATHENS
in Figure 8.4,
the
d'fferenee
many
we can see four
of
the
SWF TWF QWF RWF I ~ ! I I I I I
! 1 1 ~ ~
that
the values of the fourth
Q consists of the tuples of S whose fourth
result
attribute.
S: T: Q: R:
belong to
S is the original r e l a t i o n loaded into the Memory the
attribute
value
can
This is f r e q u e n t l y the case when a query to a database is evaluated
inside the Memory A r r a y ,
Array
Array
~ 1 ~ i ~
0 i ~ 0 0
! ! 0 0 I
Figure 8.4 Subsets of a relation in the Memory Array.
Projection
on S over
the
fourth
185
8.2.3 Some basic operations in the Associative A r r a y Algorithms operating on relations in the Associative Array can be naturally decomposed in a repeating sequence of basic operations.
In this section,
we w i l l
give examples of
i m p l e m e n t a t i o n of some of the basic operations and we w i l l also give their timings. Load bitstice operation,
see Fig 8.5.
One bitslice from the Memory Array, is loaded into the R Registers.
a t an address supplied by the Address Processor,
The state of the T Registers (Tags) is used for selective
control of the execution of the operation. processors MA
in
the
Associative
S
R
The execution w i l l be performed only in those
Array
where
the
Tags
are
set
to
one.
T
Before
Afler
Figure 8.5 Load R Register.
This operation is executed in one clock cycle.
Store b i t s l i c e operation,
see Fig 8.6.
The contents of the R Registers are stored in the Memory Array at an address supplied by the Address Processor.
The values of the R Registers are stored only in the memory
words where the Tag is set to one. MA
D
R
T
Before
After
Figure 8.6 Store R Register operation.
Logical AND operation on bitslices ,
The AND is e x e c u t e d
see Fig 8.7.
in 3 clock cycles.
In the first clock cycle,
loaded into the R Registers from the Memory Array,
performed between the R Registers and another bits[ice ($2) from with the result loaded into the R Registers.
a bitslice (SI) is
in the second clock cycle,
AND is
the Memory Array,
In the third clock cycle,
the contents of
186
the
R Registers are stored into
the Memory Array (D).
Addresses to bitslices are
supplied by the Address Processor, MA
Si
$2
O
0 i i 0 i
i i 0 0 i
x × x x x
T 0 i "I 0 i
Before
i i 0 0 i
O i O 0 x
After
Figure 8.7 Logical AND operation.
Select the next valid word ,
see Fig 8.8.
In some algorithms tuples of a relation must be processed sequentially.
A mechanism for
selecting the first tuple according to the information in the Markbitslice and also for resetting the corresponding bit in the MarkbitsUce to indicate that the tuple was chosen (removed from the list) is Select first and remove operation. The execution proceeds in the following way: in the first clock cycle the Markbitslice is loaded into the Tags. on the Tags, signal, Unit.
In the second clock cycie,
resetting all Tags except the first.
the operation Select first is performed In the third clock cycle,
indicating whether any of the Tags are set to one,
If none of the Tags are set to one the operation is aborted,
contents of the Tags are copied into the R Registers.
the NONE
is tested by the Control otherwise the
In the fourth clock cycle the
logical operation XOR is performed between the R Registers and the Markbitslice in the Memory Array with the result saved in the R Registers. cycle,
the R Registers are stored in the Memory Array.
Finally,
in the f i f t h
clock
The effect of this operation is
that the first Tag according to the Markbitslice is set to one and the Markbitslice is updated.
MA
M
T 0 0 i
Before
After
Figure 8.8 Select first and remove operation.
187
Spread byte operation, This
see Fig 8.9.
operation transfers one byte from
an address in a selected memory word to a
byteslice at another address in all the memory words. steps.
First,
the byteslice including the byte to be spread,
Registers in 9 clock cycles. clock cycles.
The execution proceeds in three
Finally,
Then,
is copied into the I/O
the selected byte is spread to all I/O Registers in 2
the contents of the I/O Registers are stored into the Memory
Array in 8 clock cycles.
IOB
D !/O
HA
S
D
B E
X g
T
Before
D
B l)
A B E
X X X
3)
A
X
B
X
E
X
BE XXX BB After
Figure 8.9 Spread byte operation.
The execution of the Spread byte operation takes 19 clock cycles.
Store Comparand This
operation
words.
operation,
see Fig 8.10
transfers a word
in the
Comparand
Register into all selected memory
188
Comparand I
ABE
MA
D
T
lX× ×XX XXX
ABE
1
ABE
Before
1
Afte~
Figure 8.10 Store Camparand operation.
The execution of this operation takes 18~b clock cycles,
where b is the length of the
word.
Compare operations. Many d i f f e r e n t types of basic operations for comparing data in the Associative Array can be implemented.
We can classify them according to two criteria=
The location of compared data in the Associative Array. operations,
In one group of
a word in the Comparand Register is compared to a field of
words in the Memory Array. typical for
This is the classical "one to many" comparison,
associative memories.
independent of
the
number
of
The execution time for the operation is words
in
the
field.
In another group of
operations two fields of data in the Memory Array are compared with each other
in
parallel.
In
each memory word
The type of comparison. for comparison.
two
data words are compared.
There are many properties of data that can be used
The simplest type of comparison is the exact match.
operation executing the exact match, equality in all pairs of words.
all corresponding bits are tested for
More complex comparisons are common in
cases where the data are interpreted as numbers. of the type: greater than, Sometimes, during
a
In an
Comparisons can then be
less than etc.
only part of a field of data in the Memory Array is to be interrogated compare
operation.
* Using the Mask Register.
This
feature
can
be
implemented
in
two
The content of the Mask Register at the current
bitaddress indicates to the Control U n i t that the execution of an operation must be disabled in those bitslices. * Using the Address Processor.
The Address Processor,
sequence of bitaddresses to the field~
when generating the
skips the parts of the field that should
ways:
189
not be compared. For use in algorithms implementing the relational algebra the most i m p o r t a n t compare operation is Exact match to eornparand , tn this operation,
shown in Figure 8.11.
each bit of a word in the Comparand Register is compared to the
corresponding bit in all memory words and the Tags are reset to zero if the bits are not equal.
I
Comparand
J
MA
XXX
ABE AEB
....
ABE
.....
I
T
After
Before
Figure 8.11 Exact match to comparand operation.
The execution t i m e for this operation is lO•b
clock cycles,
b is the length of a word.
8.2.4 Internal algorithms for algebra!c operations The result of an algebraic operation, new relation. argument
taking one or two relations as its arguments,
Depending on where the resulting relation is located,
relations,
the
algorithms
for
operations
can
be
divided
* Algorithms where the result relation is a physical subset, one of the argument relations.
is a
r e l a t i v e to the into
c.f.Fig.
two
groups:
8.4 of
The algorithms determine which pieces of
data in the Memory A r r a y belong to the result by creating the Markbitslice of the result relation. Intersection, * Algorithms
In this group we find the algorithms for: Selection,
Difference, which
Semi-join,
assemble the
result
Projection
relation
in some new
Memory A r r a y from pieces of data of the argument relations. are the algorithms for: Union, To simplify attributes.
our
exposition,
and
Division.
area in
the
In this group
Product and 3oln.
we limit
ourselves
to
relations having o n l y one or two
t90
Selection The result of the Selection operation is a relation whose tuples are a subset of those of the argument relation.
During execution of the operation a datum in the Comparand
Register is compared in parallel with values of an attribute in all tuptes of the relation in
the
Memory
comparison.
Array.
The result
consists
of
tuples satisfying
the
As there can be many different conditions for comparison,
many different Selection operations.
criterion
for
there can be
We will show the implementation when the criterion
for comparison is equality. Figure 8.t2 illustrates the algorithm for the Selection operation.
The execution proceeds
in the following steps: First,
the Markbitslice of the argument relation is loaded into the Tags.
Exact Match To Comparand is performed,
with the outcome in the Tags.
Tags are stored in the Markbitslice of the result relation.
Comparand
j
LONDON
]
A
AWF
1
t
I
0 C C i ~ i
x x x x x x
RWF
MA XXXXX SMITH JONES BLAKE CLARK ADAMS
XXXXXX LONDON PARIS PARIS LONDON ATHENS
XXXXX SMITH JONES BLAKE
XXXXXX LONDON PARIS PARIS
0 i ~ I
CLARK ADAMS
LONDON ATHENS
I I
After
Before
Figure 8.12 Selection operation.
The execution of this operation takes Tsetectio
n = 10*b
clock
b is the size of the attribute.
cycles.
0 0 0 0
Then,
the
FinaIly,
the
191
Intersection
The Intersection operation takes two relations as its arguments. subset
of
one
of
the
argument relations
consisting
of
The result relation is a
tuples which belong to
both
relations. The principle behind the i m p l e m e n t a t i o n of the Intersection is that the result relation is d e t e r m i n e d by a successive identification of its tuples in its
"mother" relation.
The
tuples from one relation are t r a n s f e r r e d into the Comparand R e g i s t e r above the other relation,
and c o m p a r e d with it in parallel for an e x a c t match.
If the
Comparand Register is identical with some tuple in the relation below,
tuple in the
its original in the
first relation is added to the result relation by updating its Markbitsliee.
Figure 8.13 illustrates the implementation of the Intersection operation.
A B Ri I
HA
LONDON ATHENS PARIS XXXXXX MADRID ROME
I
AWF BWF RWF I I I
PARIS i XXXXXX I ROME i LONDON 0 BERLIN ~ HAAG C
i O i ~ i I
x x x x x x
LONDON ATHENS PARIS XXXXXX MADRID ROME
Before
PARIS XXXXXX ROME LONDON BERLIN HAAG
~ ~ i 0 ~ I
~ O ~ 5 i I
O I O O
After
Figure 8.13 Intersection.
The execution time of the Intersection operation is Tintersection
= 30*b*N1
clock
cycles.
NI is the cardinality of the first argument relation and b is the size of the tuple.
Difference The D i f f e r e n c e operation has two relations as its arguments. subset of one of the argument relations,
The result relation is a
consisting of tuples which do not belong to the
second relation. At the s t a r t it is assumed t h a t the result relation is equal to the first argument relation. Then,
using the Comparand Register,
it is successively t e s t e d w h e t h e r the tuples of this
relation belong to the second relation or not. result by r e s e t t i n g the result Markbitslice.
If they do,
they are eliminated from the
192
Figure 8.14 illustrates the implementation of the Difference operation.
B
AWF BWF
t
I
I
I
i ~ i O i I
~ 0 i I i ~
x x x x x x
A
R MA
RWF
I
LONDON PARIS ATHENS XXXXXX PARIS ROME XXXXXX LONDON MADRID BERLIN ROME HAAG
LONDON ATHENS PARIS XXXXXX MADRID ROME
Befo~'e
PARIS XXXXXX ROME LONDON BERLIN HAAG
I i i O i i
i O i
1 i I
0 0 O i 0
After
Figure 8,14 Difference,
The execution t i m e of the Difference operation is TDifference
= 50*b'N1
clock
cycles.
N1 is the c a r d i n a l i t y of the first argument relation and b is the size of the tupie.
Semi-join The SemFjoin operation has two relations as its arguments. has two attributes,
A1 and A2,
and the second has one attribute,
B are drawn from the same domain. relation,
The first argument relation B.
Values of A2 and
The result relation is a subset of the first argument
consisting of tuples where the value of the a t t r i b u t e A2 is the same as some
value of the a t t r i b u t e B of the second relation.
The impIementation is similar to the
implementation
that
of
the
Intersection
operation
in
the
tuples
successively identified and added to the resuit relation. Figure 8.15 illustrates the implementation of the Semi-join operation.
of
the
result
are
193
A1
A2
,SMITH ONDONXX
AWF BWF RWF I 1 I !
B RC t
R2 1
SMITH JONES BLAKE XXXXX CLARK ADAMS
LONDON PARIS PARIS XXXXXX LONDON
MA
XXXXXX ATHENS PARIS BERLIN XXXXXX ATHENS XXXXXX
i ~ i 0 i i
O C I C O O
x x x × x x
JONES BLAKE XXXXX I CLARK ADAMS
Before
PARIS PARIS XXXXXX LONDON ATHENS
ATHENS PARIS BERLIN XXXXXX XXXXXX
i ~ O ~ ~
i
I O O
After
Figure 8.15 Semi-join.
The execution t i m e of the Semi-join operation is Tsemi_join:
30*b'N2
clock
cycles.
N2 is the c a r d i n a l i t y of the second r e l a t i o n and b is the size of the tupte of the second relation.
Projection The result
of the Projection operation is a relation which
"horizontal"
subset of the argument relation.
is both a " v e r t i c a l "
and a
Producing the vertical subset is simple,
the Master Processor just records which attributes of the argument relation belong to the result relation in the directory with i n f o r m a t i o n about relations loaded in the Associative Array.
Producing
the
horizontal
subset is computationalty more difficult.
a t t r i b u t e is eliminated f r o m the argument relation, be compared with each other,
After
an
the tuptes with remaining data must
If there are some identical tuples,
on]y one must be
chosen to belong to the result relation. There are two variants of the Projection operation,
called Projection1 and Projection2,
both illustrated in 8.16. Projection1: If a key a t t r i b u t e in the argument relation, is still present in the result relation,
uniquely identifying each tuple,
then there is no need for checking the redundancy
in the result.
The existence of the key guarantees that there are no identical tup]es in
the relation.
The execution of Projection1 in the Associative Array consists of copying
the Markbitsl}ce of the argument relation into the Markbitslice of the result relation. Projection2: If none of the attributes of the argument relation going over to the result relation (or their combination) is a key, checked for
redundancy.
the candidates for tuples of the result must be
The idea behind the i m p l e m e n t a t i o n is that the nonredundant
194
tuples
of
the
argument
relation
are
successively s e l e c t e d
and added into
result
the
relation.
AC Rl
A2 AWF RIWF R2WF
R2 MA
I
I
I
I
I
XXXXX
XXXXXX
O
x
x
XXXXX
XXXXXX
SMITH JONES BLAKE CLARK ADAMS
LONDON PARIS PARIS LONDON ATHENS
i
x
x
LONDON PARIS PARIS LONDON
0 ~ I i ~
O i I i I
ATHENS
~
!
i
x
x
i i
x x
x x
SMITH JONES BLAKE CLARK
i
x
x
ADAMS
Before
O I O 0
After
Figure 8.16 Projection.
The execution t i m e of the Projection1 operation is Tprojection
1 = 2
clock
cycles
and the execution of the Projection2 takes Tprojection
2 = 30*b*lq~2
clock
cycles.
b is the size of the a t t r i b u t e on which the argument relation is projected and NR2 is the c a r d i n a i i t y of the result relation,
Division
The
result
of
the
dividend relation.
Division
operation
is
a
"horizontal"
and
"vertical"
We assume that the dividend has two attributes,
that the divisor has one,
B.
of
the and
Values of A2 and B must be drawn from the same domain.
The i m p l e m e n t a t i o n of the Division operation is rather comphcated. in two phases.
subset
A1 and A2,
In the first phase,
Basically i t proceeds
tuples from the divisor are successively compared to
the a t t r i b u t e A2 of the dividend and a field with a partial result consisting of bitsIices with the results of each comparison is created.
In the second phase,
the i n f o r m a t i o n
f r o m the p a r t i a l result field is used in identifying those values of the a t t r i b u t e A1 of the dividend t h a t w i l l become the tuples of the result relation. Figure 8.17 illustrates the i m p l e m e n t a t i o n of the Division operation. the
Memory
completed.
Array
before
the
computation,
after
its
first
We give the state of
phase,
and when it
is
195
AI
AWF
A2
B
BWF RWF
RC I
l
I
I
I
I
LONDON ATHENS PARIS XXXXXX LONDON PARIS BERLIN
PC P3 P2 XX P2 PI P2
XX PC P2 XX XX XX XX
I i i O i i i
0 i C 0 0 O O
x × x × × x x
XX PI P2 XX XX XX XX
i ~ i O i C i
O i C O 0 0 0
0 O 0 O 0 0 0
XX PC P2 XX XX XX XX
~ I ~ 0 1 ~ i
0 i 5 O 0 0 O
C 0
MA
Before
LONDON ATHENS PARIS XXXXXX LONDON PARIS BERLIN
PI P3 P2 XX P2 PI P2
III I
I
1 O 0 0 0 ~ 0
0 0 C 0 i 0 ~
A f t e r Phase C L O N D O N Pi A T H E N S P3 PARIS P2 X X X X X X XX LONDON P2 PARIS P~ BERLIN P2
0 0 0 0
After F!gure 8.17 Division.
The a p p r o x i m a t e execution t i m e of the Division operation is TDivision
= 3O~b(NAI+*N2)
b is the size of
the attributes~
+ 4*N2*NA1
N2
clock
cycles,
is the c a r d i n a l i t y of the divisor and HA1
is the
number of d i f f e r e n t values of the first a t t r i b u t e of the dividend.
Union The
Union
of
two
argument
relations
gives as a result
a relation
assembled from
all
tuptes of the first relation and those tuptes of the second relation which are not already in the f i r s t one.
196
Figure 8.18 illustrates the implementation of the Union operation.
A
AWF B
R I
MA
I
LONDON PARIS ATHENS XXXXXX PARIS ROME XXXXXX LONDON MADRID BERLIN ROME HAA6 XXXXXX XXXXXX XXXXXX XXXXXX
BWF RWF I I
I I i i 0 ~ ~ O O
~ O ~ i ~ i O 0
× x x x x x x x
LONDON PARIS ATHENS XXXXXX PARIS ROME BERLIN LONDON MADRID BERLIN ROME HAAG HAAG XXXXXX XXXXXX XXXXXX
i i i O ~ I 0 0
i 0 i ~ i ~ 0 0
I i i i i I 0
After
Before
Figure 8.18 Union.
The execution t i m e for the Union operation is TUnio n = 50*b'N2
+ 18*b*p*N2
b is the size of the tupte,
clock
cycles.
N2 is the eardinality of the second argument relation,
and p
is the ratio between the number of tuples of the second relation added to the result relation and N2.
Product The Product operation has two relations as its arguments.
The result is the concatenation
of all combinations of tuples of the argument relations. Figure 8,19 Htustrates the implementation of the Product operation.
197
AWF BWF R2 RWF I I I I
B R~ I
I
MA XXXXX XXXXX SMITH CLARK BLAKE XXXXX
XXXXXX LONDON XXXXXX PARIS XXXXXX XXXXXX
X X X X X XXXXXX X X X X X XXXXXX X X X X X XXXXXX X X X X X XXXXXX X X X X X XXXXXX X X X X X XXXXXX
0 0 ~ ~ i 8
0 I 0 i 0 O
x x x x x x
XXXXXX LONDON XXXXXX PARIS XXXXXX XXXXXX
SMITH LONDON SMITH PARIS CLARK LONDON C L A R KPARIS B L A K E LONDON B L A K E PARIS
O O i ~ i O
O ~ O C 0 0
C i C
Before
XXXXX XXXXX SMITH CLARK BLAKE XXXXX
After Figure 8.19 Product,
The execution t i m e of the Product operation is Tprojection
= 22*b*N1*N2
clack
cycles,
b is the size of a tuple N1 is the c a r d i n a l i t y of the first and N2 of the second relation.
Join Among d i f f e r e n t result~
variants of
obtained
by
specified condition~ We assume that
3oin operations we w i l l
concatenating
of
two
argument
relations
satisfying
some
does not contain the joining attributes.
both the first
called A1 and A2,
tuples
demonstrate the Join where the
and the second argument relation have two
and B1 and B2~
respectively.
attribute%
The joining attributes are A2 and B2~
the condition for joining is their equality.
The idea behind the implementation is that
values f r o m
relation are successively transferred to the
the a t t r i b u t e A2 of the first
Comparand Register and compared with A2 and B2,
Thus selected tuptes from the first
and the second relation are subsequently concatenated by the Product operation into the tuples of the result relation. Figure 8.20 illustrates the execution. The execution t i m e of the Join operation is Tjoin 2 = 48*b*NA2
+ NA2*(1-p)*TP
clock
cycles.
198
b is the size of the join attribute, a t t r i b u t e of the first first
relation,
NA2 is the number of different values in the join
p is the number of values of the join a t t r i b u t e of the
relation which do not match any value in the second relation,
TP is the average
t i m e for the Product operation.
Ai
A2
AWF BC
I
MA
I
LONDON XXXXXX ATHENS BERLIN XXXXXX XXXXXX
B2
I
BNF Ri I
I
R2 t
~
I
RWF I
Pi XX Pi P2 XX XX
PARIS PI OSLO PC PARIS P2 XXXXXX XX XXXXXX XX XXXXXX XX
XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX
i O i I 0 0
i ~ i 0 0 O
x x x x x x
Pi XX
PARIS OSLO
LONDON LONDON
i 0
I I
i i
P~ P2 XX
PARIS P2 XXXXXX XX XXXXXX XX
ATHENS PARIS ATHENS OSLO BERLIN PARIS
i i i
XX
XXXXXX
XXXXXX
Before
' LONDON XXXXXX ATHENS BERLIN XXXXXX XXXXXX
I
PI Pi
XX
PARIS OSLO
XXXXXX
i
i
l
0
O 0
O 0
0
After
Figure 8.20 Join.
8.2.5 Performance analysis One of the fundamental results of Computer Science is the insight that any hardware structure
(with
some necessary minimal
computable function. can
Consequently,
be implemented on LUCAS
capabilities) can,
in principle,
compute any
the sole fact that operations of relational algebra
is hardly surprising.
Rather more interesting is the
question whether they can be implemented e f f i c i e n t l y or not. The Associative Array of LUCAS may serve as a model of a special hardware component of a database computer,
maybe larger than LUCAS but with similar properties,
and the
t i m i n g equations can be used in a q u a n t i t a t i v e analysis of the feasibility of using this component. The analysis of the timing equations is complicated by the fact that the execution t i m e is dependent not only on the size of argument relations but also on their contents. example,
For
the execution time of the Projection operation is proportional to the cardinality
199
of the result relation. behaviour of
the
In order
algorithms,
properties of a database,
e.g.
to be able to determine the very i m p o r t a n t average
it
is in some cases necessary to
know the statistical
the expected number of matching tuptes.
We assume a clock frequency of 5 MHz.
The size of the Associative Array w i l l vary
from
today)
128
processors (the
size
of
LUCAS
to
32k processors (the
size of
the
Associative A r r a y which we believe can be built in the future). Table 8.1 summarizes the approximate t i m i n g equations which give the number of clock cycles for each operation. The
size of the Associative Array
is not a parameter in the timing equations.
This
means that the execution t i m e for an operation on data loaded in an Associative Array with
a given size is the same as the execution t i m e on any larger Associative Array.
Hence,
the relations w i l l always be assumed to be of the largest possible c a r d i n a l i t y for
the given size of the Associative Array.
Seleotion
IO*b
Interseotion
30*b*Ni
Difference
30*b*Ni
Semi-join
30*b.N2
Projectionl
2
Projeotion2
30*b*NR2
Division
30*b*(N/N2 + N2) + 4*N
Union
30*b-N2 + 18*b*p*N2
Produot
22*b*N~*N2
Join2
48*b*NA2 + N A 2 * ( I - p ) * T P
Table 8.1 A p p r o x i m a t e execution times of algebraic operations on LUCAS
We w i l l compare the performance of the Associative Array w i t h the performance of a conventional
sequential
computer.
To
make
this
comparison
meaningful
i m p o r t a n t to carefully determine what performance measure is to be used. is that
we want
to compare the
execution t i m e
of
a well
is assumed.
is
very
The problem
defined system with
performance of a computer about which only one general property, of operation,
it
the
its sequential mode
200
We wiiI base our comparison method on the following observation; The i m p l e m e n t a t i o n of algebraic operations on a sequential computer is t y p i c a l l y based on sorting and merging of tuples of operand relations. merging
the
sorted
For example the 3oin consists of sorting the operands and
relations.
comparisons [Knuth73].
Sorting and merging
Thus~
methods are essentially based on
we assume that the p r i m i t i v e operation of a sequential
evaluation of algebraic operations is the comparison of two tuples.
We w i l l estimate the
number
a
of
necessary
algorithm.
(By
even though merging.)
this
Then,
comparisons
for
an
operatian~
assuming
"goad"
sequential
"good" we mean an algorithm using a minimum number of comparisons~ is not the only way of if
we
divide
the
measuring the effectiveness of
(known)
execution t i m e
for
sorting and
an operation in
the
Associative Array by the number of comparisons and by the number of bytes in a tupte (to compensate for get rid different word-lengths) we obtain a measure which we call an Equivalent Sequential Compare t i m e ~
ESC.
(The influence of the length of a tuple in
defining ESC could be eliminated because all execution times~
cf.
Table 8.1,
are linear
in b,) ESC is a rather crude measure~ For example~
but it is useful for drawing some general conclusions,
given an Associative Array~
if we know the value of ESC for some
algebraic operation then we can conclude that e.g,
a 16 bit computer must be able to
perform fetch and comparison on two 16 bit words in a t i m e less than twice this valu% a 32 bit
computer in Jess than four times this vatue~
etc,
in order to be able to
achieve the same t o t a l execution time for this operation as the Array, ESC wilt be quite unfavourable to the Associative Array. in
the
CPU~
but
sorting
housekeeping operations~
etc~
or
merging
involve
time
ESC reflects only comparisons consuming
data
movement%
as well.
Table 8.2 summarizes the number of necessary comparisons in a sequential implementation of algebraic operations.
We are using results f r o m Knuth: to sort n elements at most
(n*log n) comparisons are required, m) comparisons are required~
to sort n elements having m d i f f e r e n t values ( n ' l o g
and that two sorted tables with n elements each can be
merged using at most 2*n comparisons.
201
Selection
N
Intersection
2 * N * I o g N + 2*N 2 x sort
Difference
2 * N * I o g N + 2*N 2 x sort
Semi-join
+ merge
2 * N * l o g * N + 2*N
2 x sort ProJection2
+ merge
+ merge
N*IogNR2 sort
Division
N2*logN2 + 2 * N * l o g N + 2*N 3 x sort
Union
+ merge
N*log(N/2) 2 x sort
(of
Product
22*b*Ni*N2
Join2
2*N1*logN~ 2 × sort
+ N N/2)
+ merge
+ 2*Ni
+ merge
Tabte 8.2 Number of comparisons in a sequential i m p t e m e n t a t i o n
Intersection Difference,
Semi-join
The t i m i n g
equations
Tabte 8.1).
For the I n t e r s e c t i o n and D i f f e r e n c e operations,
for
the Intersection,
Difference
and Semi-join
of a tupte,
and for the Semi-join i t is the size of a joining a t t r i b u t e .
are i d e n t i c a l
(cf.
the p a r a m e t e r b is the size
Figure 8.21 shows the execution t i m e for t w o t y p i c a l sizes (16 and 6/4) of b.
202
T[s]
b=64
10.00
b=16
1.00
0.10
0.01 I
I
I
i
128
lk
8k
32k
Figure 8.21 Execution times for Intersection,
For example,
Difference and Semi-join.
computing intersection of two relations with the cardinality of 1 k and a
tuple size of 64 bytes takes 0.3 seconds. etc.
=~ N
On twice as large relations i t takes 0.6 seconds
The maximum possible performance increases linearly with the size (cost),
Associative Array.
This seems to be a very good result,
world of sequential computers we must also took at the ESC. Figure 8.22 shows ESC (in nanoseconds).
ESC[ns] 400 300 200 100
128
lk
Figure 8.22 ESC for Intersection,
Difference,
i
I
8k
32k
Semi-join.
of the
but because we live in the
~--
N
203
By examining Figure 8.22, The first
two observations can be made.
is that if a sequential computer is to outperform LUCAS (128 processors),
must make a comparison of two bytes in at most 375 nanoseconds~
it
which is very fast
even for minicomputers. The second observation is the following. e.
g.
f r o m 128 to 32k,
times longer time).
If we increase the size of the Associative Array
i t wilt be possible to process 256 times larger relations (in 256
But,
the same execution t i m e on a sequentional computer can be
achieved by decreasing the t i m e which is necessary for a byte comparison to only a half, which amounts to e.g.
a doubling of the clock frequency.
Selection Table 8.3 gives the execution time, bytes.
b /
It also gives ESC~
N
16
64 ESC
128
k
6 k
32 k
32
32
32
i28
~28
428
15
1.9
32
I
in microseconds,
for an a t t r i b u t e size of 16 and 64
in nanoseconds (!).
0.24
i28 0.061
Table 8.3 Execution times for Selection No m a t t e r how small or how large a relation is,
as long as it fits in the Associative
A r r a y the execution t i m e of the Selection operation is the same.
Notice the impressing
speed of Selection on larger arrays.
Projection NR2 in the t i m i n g equation of the Projection can be expressed as p ' N , probability that a given tuple will belong to the result,
where p is the
and N is the eardinality of the
operand relation.
Figure 8.23 shows the execution t i m e for p=0.1~
tuple size b=64.
The cardinality of the operand relation is the same as the number of
processors in the Associative Array.
p=0.5 and p=l for the
204
T[s] b=64
/
10.000
p=l p=0.5
p=0.1 1.000
0.100
0.010
II.
128
lk
8k
32k
N
FiE.lure 8.23 Execution time of Projection.
Figure 8.24 shows ESC of the Projection operation. ESC[.s] 0.9 0.8 0.7 0.6 0.5 0.4
p=1
0.3 p=0.5
0.2 0.1
p=0.1 I
128
t
Ik
-
t
8k
--
I
32k
N
Figure 8.24 ESC for Projection.
Figure 8.24 indicates that for stoat1 p,
an Associative Array perform much more cast
205
effectively
than sequential computers.
For example,
if a sequential computer is to
perform the Projection in the situations where p=0.t as fast as an Associative Array with 11< processors,
it must be able to compare 2 bytes more than 107 times/second.
Division Figure
8.25 shows the execution
time,
and Figure
8.26 shows ESC of the Division
operation for b=16 bytes and far N2=0.5*N and N2=0.1*N
T[s] 10.000
b=16
1.000
N2=O.5N N2=O.IN
0.100
0.010
I
128
l
I
I
Ik
8k
32k
Fig ur e 8.25 Execution time of Division.
N
206
ESC[~s] 10.0
b=16
1.0
N2=O. 1N N2=O.SN
0.1 I
t
128
]k
I
;
8k
32k
N
Figure 8.26 ESC for Division.
Join We w i l l discuss the Join operation in Section 8.5.
Union Both
the
timing
equation
and
ESC
are
similar
to
those
of
the
Projection.
Product The advantage of using an Associative A r r a y is that it can execute operations in parallel, e,
g.
by making
comparisons one to
many.
The Product
operation requires
transfer of tuples and this transfer must be performed serially. that
the
transfer
of
one byte
calls
for
22 clock
frequency i t makes a p p r o x i m a t e l y 4 microseconds. to
execution an a sequential computer.
operation, some Also,
In an Associative Array,
previous
operation
and
they
But
it
cycles,
only
a
In Table 8.2 we can see thus with
a 5 MHz
clock
It is quite a slow rate when compared does not mean that, it
is not a useful
the operands to be concatenated may be a result of are
identified
assoeiativety~
the destination of the result of the Product operation,
where the result tup[es w i l l be assembled~
by their i.e.
Markbitsliees.
those memory words
can be determined associatively.
207
8.2.6 Comparison of L U C A S w i t h a l t e r n a t i v e designs There are only a f e w
Associative
Arrays
of the same type as L U C A S on which t i m i n g
data for r e l a t i o n a l operations are r e p o r t e d .
We w i l l compare L U C A S only w i t h STARAN,
RELACS [Oliver79],
and w i t h RDB ( R e l a t i o n a l Database Machine) [Shaw79].
A
in
search
operation
STARAN
Selection o p e r a t i o n in LUCAS, bits in the argument. operation
takes
10
microseconds/byte.
[Berra
This is roughly clock
cycles,
STARAN
and
Oliver
79],
which
is
equivalent
to
equal to 1.6 microseconds/byte.
with
a
clock
is slightly faster.
frequency
of
In L U C A S
5 MHz
we
In
LUCAS,
to
copy
this
obtain
2
We can also compare the reading times.
It takes 16 microseconds to read one 256-bit word f r o m the array of STARAN, microseconds/byte.
the
takes 1 + 0.2*n microseconds where n is the number of
one byte
from
a memory
ward
it is 0.5
into
an I/O
Register takes 8 clock cycles which is 1.6 microseconds.
STARAN is 3 t i m e s faster.
a whole
it takes 128 + 8 clock cycles
128-bytes
byteslice
is to be read f r o m
which is 0.2 microseconds/byte,
LUCAS
and this make L U C A S
m a x i m u m size of the a t t r i b u t e on S T A R A N
is 256 bits,
This r e l a t i v e s i m i l a r i t y of p e r f o r m a n c e b e t w e e n S T A R A N
t w i c e as fast as STARAN.
If
The
on L U C A S it is over 500 bytes. and LUCAS w i t h respect to [/O
and search processing makes it possible to apply many conclusions about the usefuteness of S T A R A N
to L U C A S
as w e l l .
For e x a m p l e ,
Berra and O l i v e r [ B e r r a and O l i v e r 79]
have c o n v i n c i n g l y d e m o n s t r a t e d the g r e a t p o t e n t i a l of S T A R A N in a database m a n a g e m e n t environment. In RELACS, between lk
which
is a paper machine,
speed of half t h a t of STARAN, complex
an Associative
and 100k words and a w i d t h of l k bits.
operations,
5.2 m i c r o s e c o n d s / b y t e ,
such as 3oin,
Comparand Register array,
are
Unit
has an assumed size of
The search o p e r a t i o n has an assumed
implemented
1.6 times that of LUCAS. in RELACS
with
the help of
providing for p a r a l l e l comparison many to many.
this f e a t u r e can be i m p l e m e n t e d [Digby75],
Mare a
In principle,
but because of the c o m p l e x i t y of connections
we believe it w i l l never be p r a c t i c a l l y feasible. RDB is also a paper machine.
Its c e n t r a l part is a P r i m a r y Associative M e m o r y ,
which could be r e a l i z e d w i t h a large-scale distributed logic m e m o r y ,
PAM,
or w i t h a b i t - s e r i a l
or word serial design.
It has a c a p a c i t y of b e t w e e n 10k and 1M bytes (the c a p a c i t y of
LUCAS
bytes).
is 128"512=64k
The t i m e
for
the Selection operation is more than 0.8
microseconds/byte. The c a p a b i l i t i e s of P A M are satisfied by LUCAS.
For example a command of P A M such
8s p a r a l l e l set in all of w i t h
208
corresponds to the Selection operation in LUCAS.
A control structure
for each w i t h set and do <statement> of PAM can be implemented with the help of the Select first and remove operation in LUCAS. This makes RDB algorithms comparable to algorithms on LUCAS. For example,
in LUCAS,
the number of searches needed for the Project operation is
equal to the c a r d i n a l i t y of the result relation. the c a r d i n a l i t y of the result relation. main
loop of
the RDB
operation on LUCAS.
aigorithm~
In PAM,
the number of searches is twice
This is because one of the two searches in the is executed as a simple Select first
This makes our algorithm faster.
and remove
Similar observations can be made
when analyzing algorithms for the Intersection and Difference operations.
While the ROB
algorithm for Join makes it necessary to perform search three times for each tup]e of one of the operand relations, for
our alghorithm makes it necessary to do search only twice
each unique value of a joining a t t r i b u t e
of one of the relations!
The number of
searches is thus considerably smaller on LUCAS than on PAM.
8.3 INTERNAL QUERY EVALUATION IN A SIMPLE DATABASE COMPUTER
8.5.1 introduction in this section we wit! demonstrate a method for evaluating queries in a simple database computer which is equipped with an Associative Array. The method is based on decomposition of a query into a sequence of algebraic operations which are serially executed on relations in the Associative Array. As a general environment for query evaluation we wit[ assume the system configuration shown in Figure 8.27. a Master Processor, that
This configuration forms a simple database computer consisting of: a Disk memory,
a Console and an Associative Array.
We assume
the cardinality of relations necessary for answering a query is such that they f i t
inside the Associative Array.
209
Console
Master
Associat£ve
Processor
Array
Disk
Memory Figure 8.27 A simple database computer,
The database computer operates in the following way.
A query to the database is parsed
and translated by the Master Processor into a sequence of algebraic operations.
The
Master Processor determines which relations are needed for answering the query and if they are not present in the Associative Array~
it allocates a free area in the Associative
A r r a y and loads these relations f r o m the disk memory. two directories,
a disk directory with
The Master Processor maintains
i n f o r m a t i o n about relations on the disk and an
array directory with information about relations currently in the Associative Array. array
directory
provides
attributes and Workfield%
the
Master
Processor
with
types and sizes of attribute%
information etc.
about
of
The execution of a query
is governed by a sequence of instructions issued by the Master Processor. instruction~
addresses
The
Before each
some necessary parameters are sent to the Control Unit of the Associative
Array.
The result of an instruction is always a new relation created in the Associative
Array.
The final relation in this sequence is the answer to the query.
subsequently be saved on the disk and added to application
program (computing e.g.
averages) or
This relation can
the database or submitted it
to some
can simply be displayed on the
Console. The database computer must perform many other functions in addition to those described above,
but we w i l l concentrate our interest only on the use of the Associative A r r a y in
answering queries.
210
8.3.2 Database We demonstrate
the
evaluation
database in the wortd~
method
by
using
the
probably best known relational
the Suppliers-Parts-Projects (S-P-P) database which is described in
]Date81]. The S-P-P database, hardware,
used by a hypothetical multinational corporation producing computer
contains the
corporation is involved. relations S,
P,
Relation S ,
information
concerning
a number
projects
in
which
the
O and SPJ.
one instance of which is shown in Figure 8.287 contains the information
about suppliers of different parts to current projects.
It has four attributes: S# the
unique supplier number,
SNAME the suppliers" names,
status of each supplier,
and CITY with names of cities.
the attribute
of
All the information available to users is represented in four
CITY is the location of the suppliers.
STATUS with the integers giving The intended interpretation of
The key of this relation is the
attribute S#.
S S#
SNAME
STAT
SI $2 $3 $4 SS
SMITH JONES BLAKE CLARK ADAMS
20 CO 30 20 30
CITY LONDON PARIS PARIS LONDON ATHENS
Figure 8.28 Relation S.
Relation P ~
shown in Figure 8.29~
different projects. of parts,
contains the information about parts supplied to
It has five attributes: P# the unique part number~
COLOR the colours of parts,
CITY names of cities.
PNAME the names
WEIGHT the integers giving the weights of parts,
The intended interpretation of the attribute CITY is the location
where the parts are stored.
The key of this relation is the attribute P#.
211
P P#
PNAME
COLOR
PC P2 P3 P4 P5 P6
NUT BOLT SCREW SCREW CAM COG
RED GREEN BLUE RED BLUE RED
WEIGHT
CITY
12 17 17 :[4 i2 C9
LONDON PARIS ROME LONDON PARIS LONDON
Figure 8.29 Relation P.
Relation J ,
shown in Figure 8.30,
contains the information about current projects,
has three attributes: 3# the unique project numbers of each project, (probably covert) of projects,
and CITY the names of cities.
it
.]NAME the names
The intended interpretation
of the attribute CITY is the location of plants where the projects are developed.
The
key of this relation is the attribute J#.
J J#
JNAME
CITY
Ji J2 J3 J4 J5 J6 J7
SORTER PUNCH READER
PARIS ROME ATHENS ATHENS
CONSOLE COLLATOR TERMINAL TAPE
LONDON OSLO
LONDON
Figure 8.30 Relation J.
Relation
SPJ ,
shown in Figure
8.31
connects
the
information
about the
specified
suppliers supplying the specified parts to the specified projects in the specified quantity. It has four attributes: S## the supplier numbers (same as in relation S), numbers (same as in relation P)~ GTY
the
integers standing for
combination of the S#,
P# the part
.3# the project numbers (same as in relation 3)~ delivered quantity.
P# and 3# attributes.
The key of
this
and
relation is the
212
SPJ S#
P#
J#
QTY
S1 SI S2 $2 S2 $2 $2 $2 82 82 83 83 84 $4 SS 85 85 $5 $5 85 SS 85 $5 S5
P1 PI P3 P3 P3 P3 P3 P3 P3 P5 P3 P4 P& P6 P2 P2 P5 P5 P6 PI P3 P4 P5 P6
J1 J4 J1 J2 J3 J4 J5 J& J7 J2 J~ J2 J3 J7 J2 J4 J5 J7 J2 J4 J4 J4 J4 J4
200 700 400 288 280 500 600 400 8OO I00 280 500 300 300 208 IO0 5OO 100 200 1000 1200 800 400 500
Figure 8.31 R e l a t i o n SPO.
As an example of how to i n t e r p r e t a tuple of the SPJ r e l a t i o n we can look at the first tuple which is .
It says that the supplier $1 has delivered 200 units
of part P1 to project .]1. The connection between d i f f e r e n t relations in the S-P-P database is mediated by the f a c t that the a t t r i b u t e s S#,
P# and J# in the relation SPO have the same domains as the key
a t t r i b u t e s in the relations S,
P and J and also that the a t t r i b u t e s called C i T Y in the S,
P and J r e l a t i o n have values drawn from the same domain.
The f a c t that the a t t r i b u t e s
have the same name in d i f f e r e n t relations is just incidental.
8.3.3 Evaluation of a query
A query is executed as a series of operations on relations in the Associative A r r a y . operation creates a new,
temporary,
r e l a t i o n from one or two old relations.
relation
in this
sequence is the result
Master
Processor
issuing instructions
of
the query.
and their
Each
The final
The execution is guided by the
parameters
to the C o n t r o l
Unit
of
the
213
Associative A r r a y .
In the case of the Selection operation,
a value to be compared w i t h
the contents of some a t t r i b u t e of a relation in the Associative Array is loaded d i r e c t l y into the Cemparand Register. We demonstrate our method in two examples, 5-P-P database are executed.
showing steps in which queries to the
An accompanying figure w i l l display all relevant original
relations and all temporary relations created during the execution. be given to the right of the relations,
The Markbitslices w i l l
whereas their actual position in the Associative
A r r a y is determined by the Master Processor. To
understand how the information in the figures should be interpreted we can study
Figure 8.32. together attributes JNAME,
with
It shows data of three relations J, their
belong
associated Markbitslices.
to
which
CITY) each,
relations.
T1 and T2 in the Associative A r r a y
Above
the
relations we indicate which
Relations J and T1 have three attributes (J#,
and relation T2 has one a t t r i b u t e (JNAME).
the original relations of the S-P-J databas%
Relation J is one of
T1 is a derived relation obtained by the
operation Selection on J where the value of the a t t r i b u t e CITY is LONDON. result of the Projection of T1 on the a t t r i b u t e JNAME, with
the
addresses
values TAPE to
attributes
and COLLATOR. and
Workfields
T2 TI J TTJ 2'I
J#
JNAME
CITY
J'l J2 J3 J4 JS J6 J7
SORTER PUNCH READER CONSOLE COLLATOR TERMINAL TAPE
i PARIS ROME i ATHENS { ATHENS LONDON i-li { OSLO LONDON { { i
Figure 8.32 Relations J,
T1,
and T2.
The is
it onJy consists of two tuples
information
maintained
T2 is the
by
about names of the
Supporting
relations, Processor.
214
QUERY 1: Get names of projects supplied by supplier S1. There is no relation connecting the supplier numbers with the names of the projects which they supply.
Hence,
the information from two retations~
SPJ connecting the supplier
numbers with the project numbers and 3 connecting the project numbers with the project names~
is needed.
To answer the query~
the numbers of the projects supplied by
supplier $1 must be extracted from SPJ and used in J to look up the names of those projects. The execution of the query in the Associative Array consists of the following four steps: 1) T I : =
SPJ WHERE S # = ' 5 1 "
2)
TI[J#]
T2:=
3) T3:= J SEMIJOIN T2 ON J# 4)
T4:= T3[JN/WIE]
Figure 8.33 illustrates the execution. Comparand Register
above the
creating the relation TI. Markbitslice of T1.
attribute
Relation TI
In steP 2 ,
creating the relation T2.
In step
Sit of
1 ~
the value $1 is loaded into the
SPJ and the Selection is performed,
consists of tuples of SPJ pointed out by the
the Projection2 of T1 on the attribute J# is performed
In step .3 ~ relation T3 is created by the Semi-join between J
and T2 on the attribute J# with a domain common to both relations.
Data of relation T3
are physica|]y a subset of J.
the re]ation T4,
In ste~ /4 ~ the result of the query~
produced by the Projection2 of T3 on the attribute 3NAME.
T~-~
T4 T3
....
SPJ S# S~ Si $2 $2 $2 $2 $2 $2 $2 $2 $3 $3 $4 $4 $5 $5 $5 SS $5 $5 $5 SS $5 SS
P#
J#
QTY
PC
JR J4 JR J2 J3 J4 J5 J6 J7 J2 J~ J2 J3 J7 J2 J4 JS J7 J2 J4 J4
200 700 400 200 200 500
PR
P3 P3 P3 P3
P3 P3
P3 P5 P3 P4
P6
P& P2 P2
P5 P5
P6
P1
P3 P4
P5 P&
J4 J4
J4
6OO
2RJ
l 1
l t
IO0 200 iO00 1200 800
1 l t 1 .1 1 1 t l t 1 1 1 ~t 1
SO0
1
400 800 iO0 200 500 300 300 200 iO0
500
400
1
Figure 8.33 Execution of Query 1.
Ji J2 J3 J4 J5 J6 J7
TTJ 43
JNAME
CITY
SORTER PUNCH READER CONSOLE COLLATOR TERMINAL TAPE
PARIS i~i ROME i ATHENS ATHENS l l i LONDON I OSLO LONDON
is
215
The result of Query 1 is the relation T4 with one attribute, tuples,
JNAME,
consisting of two
(SORTER> and (CONSOLE>.
The execution of Query 1 called for creating four temporary relations from two original relations.
But the only new space in the Associative Array used during the execution was
the space used by the four Workfields of temporary relations. one Selection,
Four algebraic operations,
two Projections and one Semi-join were executed.
QUERY 2: Get 3# values for projects not supplied with any ,,red part by any London supplier. The information from all four relations in the database is needed to answer this query. In S the information about the supplier numbers and the names of the cities are given, in P the part numbers and the part cotour are located, all current projects and finally,
J gives the project numbers of
SPJ connects the part numbers,
the project numbers and
the supplier numbers. The execution consists of the following nine steps: 1) T I : =
S WHERE C I T Y = ' L ~ "
2)
T2:=
S[S#]
3)
T3:=
P WHERE COLOR='RED"
4)
TS:= T3[P#]
5)
T5:=
;key ;key
SPa SE3x4[JOIN T2 ON S#
6) T 6 : =
T5 SEMIJOIN T4 ON P#
7)
T7:=
T6[J#]
8)
TS:=
313#]
9)
Tg:=
T8 DIFFERENCE T7
;key
Figure 8.34 illustrates the execution.
Steps 1 and 2 produce from S the relation T2
consisting of the supplier numbers of the suppliers in London, consisting of the part numbers of the red parts. from
SPJ
Projection2
including of
T6
the
information
on the
attribute
about J#
the
SteRs,,3.,and 4 produce T4
Steps 5 and 6 produce T6 with tuptes London suppliers of
from J by the Projection1 on the attribute J#, Finally,
The
in step 7 gives T7 consisting of the project
numbers of the projects supplied by the London suppliers with red parts. current projects, relation Tg.
red parts.
Step 8 gives T8
T8 contains the project numbers of all
in step 9 the Difference between T8 and T7 creates the result
216
Ti S S#
SNAME
S~ $2 $3 $4 $5
SMITH JONES BLAKE CLARK ADAMS
STAT 20 10 30 20 30
CITY
TTS 21
LONDON PARIS PARIS LONDON ATHENS
Ill i i Ill i
T3 P
P#
PNAME
COLOR
WEIGHT
PC P2 P3 P4 P5 P6
NUT BOLT SCREW SCREW CAM COG
RED GREEN BLUE RED BLUE RED
12 17 17 14 12 19
CITY
TTP 43
LONDON PARIS ROME LONDON PARIS LONDON
ill I i lli i ill
T6 T8
T5
J,
SPJ
TTTP
S#
P#
J#
QTY
765J
J#
JNAME
CITY
TTJ 98
Si Si $2 $2 $2 $2 $2 $2 $2 S2 $3 $3 $4 $4 $5 SS
P1 PI P3
J1 J4 Jl J2 J3 J4 JS J6 J7 J2 Ji J2 J3 J7 J2 J4 JS J7 J2 J4 J4 J4 J4 J4
200 700 400 200 200 SOD 600 400 800 iOO 200 SO0 300 300 200 I00 SO0 ~00 200 ~000 1200 800 400 SO0
Jill llll i I i I
Ji J2 J3 J4 J5 J6 J7
SORTER PUNCH READER CONSOLE COLLATOR TERMINAL TAPE
PARIS ROME ATHENS ATHENS LONDON OSLO LONDON
il ill ll II Ill lli Ii
$5 SS $5 SS SS S5 SS SS
P3
P3
P3 P3 P3 P3 P5 P3 P4 P6 P6 P2 P2 P5 PS P6 PI P3 P4 P5 P6
1 i i i i II~I Jill i i I i i I
Figure 8.34 Execution of Query 2.
i i
217
8.3.4 Discussion The method which we have developed is very simpi% of advantages, Array
straightforward and has a number
One advantage is the speed of evaluation of a query,
can be seen as a high-level language architecture.
The Associative
A number of powerful set-
oriented operations are implemented d i r e c t l y in the hardware and their execution is very fast because the algorithms take advantage of the parallelism in the Associative Array. Another
factor
speeding up the
evaluation is
the
one-to-one correspondence between
commands defining evaluation of the query and the operation of the Associative Array. Because there is no need for layers of software making tests, i t e r a t i o n counts,
etc,
transtation~
interpretation~
there is no software overhead stowing down the execution.
Another advantage is the opportunity for simple implementation of the user's views and access privileges.
The only way to i d e n t i f y a tuple of a relation in the Associative Array
is by using the Markbitslice,
A number of users may use the same relations but~
been assigned d i f f e r e n t Markbitslices~
A very advantageous feature of our method is t h a t Array,
Except for the Join aperation~
query do not create new data.
having
they have access to only a subset of the tuples, it saves space in the Associative
operations executed during the evaluation of a
The result of an operation is just a new Markbits|ie%
and even the t i t t l e space it occupies can be released after the Markbitslice is used by some subsequent operation and no longer is needed. The
main
disadvantage of
our
method
is the
processing of
3oins on relations.
The
execution of a Join operation is not only slow but it might be the case that its resuit is larger
than both
the source relations~
with
cannot be stored in the Associative Array,
the unhappy consequence that the result
Fortunately~
there is a large class of queries
which can be handled without requiring the Join operation,
where the much faster Semi-
join can be used instead [Bernstein and Chiu 81],
8.4 COMPARATIVE PERFORMANCE EVALUATION OF DATABASE COMPUTERS
8.4.1 Introduction In this section we w i l l compare the performance of a backend database computer which contains an Associative
Array
with
a number of other well known database computer
designs. The assumed system,
shown in Figure 8.35,
consists of a host computer,
a disk memory
218
and an Associative Array with its Master processor, the
combination
of
I
the
Associative
In the following, and
Array
the
we w i l l refer to
disk
as
LUCAS.
HOST ] J
LUCAS
I
ARRAY
ASSOCIATIVE
DISK
Figure 8.55 LUCA5 Database computer.
We w i l l determine a response time of LUCAS to three benchmark retrieval queries.
The
times will be compared with response times reported in [Hawthorn and D e w i t t 82]~
in a
performance analysis study of alternative database computers. Hawthorn
and DeWitt
analyzed the
performance
respect to typical queries to a real database. identified: overhead-intensive~
data-intensive,
of
several database computers with
Three classes of relational queries were and multirelationat queries.
From each
category one "average query" was selected and it was submitted to the following database computers:
Associative
Disks
Processor) [Ozkarahan at al.75,
[Slotnick70~
Langdon78],
Ozkarahan and Sevcik 77~
RAP
(Relational
Schuster at aL78],
(Context Addressable Segment Sequential Memory) [Su and Lipovsky75~ Hang and Su 81]~
DBC (Data Base Computer) [Banerjee at al.78,
CAFS (Content Addressable File Store) [Mailer79],
Associative CASSM
Lipovski and Su78, Banerjee at al.79]~
and DIRECT [OeWitt79],
and also to
a conventional computer system with the INGRES relational database management system [Held at al.75~
Stonebraker at al.76].
All database computers were assumed to function as backends to a host~ Each backend is a cellular system: data are stored in cells~ Operations on the cells take place in parallel.
a PDP 11/70.
with one processor per cell.
The backends rely on the host to format
the results for printing and to move the results to the user's terminal.
Those backends
which are not able to carry out arithmetic operations (included LUCAS) rely on the host to perform the arithmetic functions as well.
219
8.4.2 Specifieatio.n o.f characteristics of database machines Since at| of the designs, Hawthorn and DeWitt make the
except CAFS,
made certain
are paper machines or rudimentary prototype%
assumptions about their characteristics
performance comparisons f a i r
and meaningful.
We w i l l
in order to
make corresponding
assumptions about LUCAS characteristics. The data storage medium of all designs is assumed to be moving-head Ampex 9200 disk drives,
Table 8,4 summarizes its parameters,
PARAMETER
MEANING
VALUE
BSIZE BTRACK DROT DAVAC DREAD DCYL DTRACK
block size block/track disk rotation time average access time read time blocks/cylinder data tracks/cylinder
512 bytes 22 b l o c k s ~ 6 . 7 ms 30
ms
0.8 418 ~9
ms/block blocks
Table 8°4 Disk parameters
Associative
Disks,
CASSM,
DBC,
and CAFS
are assumed to have cell
associated with r e a d / w r i t e heads of the disks; RAP,
DIRECT,
processors
and LUCA5 are caching
systems to which data must be loaded from the disks. The size of the database computers were assumed to be the following: Associative Disks, CASSM,
CAFS,
and DBC contain 19 celt processors (one processor/track),
16 cells w i t h a capacity of 16k bytes each,
RAP contains
and DIRECT contains 8 processors and t6
data cells w i t h a capacity of 16k bytes each.
The sizes indicate t h a t we are discussing
rather large systems. Our
assumption
about LUCAS
is
that
the
size
processors and the clock frequency is 6 MHz. of
the
real
WIDTH=512
LUCAS. bytes.
The We w i l l
width
of
see t h a t
a it
memory is
necessary for answering our particular queries, microseconds/byte,
of
the
Associative
Array
is
SIZE=2k
All other properties are the same as those
more
word than
of
a processor
enough for
in
storing
LUCAS the
is
data
Since data are read from the disk at 1.5
we assume that the t i m e for loading data into the Associative A r r a y
is the same as the t i m e for reading them f r o m the disk. out from the Associative Array,
When tup[es are serially read
we assume that i t takes DO=2 microseconds/byte (10
220
clock cycles = 8 for shifting byte into the I/O Register + 2 for copying it into the I/O B u f f e r Register).
A byte in the Comparand Register can be compared w i t h a byteslice in
all words in the Associative A r r a y in CPBC=2 microseconds (10 clock cycles),
a byte in
a selected m e m o r y word can be compared w i t h a byteslice in CPB=6 microseconds,
and a
byte in a selected word can be t r a n s f e r r e d into all o t h e r words in TB=4 microseconds, The L U C A S p a r a m e t e r s are summarized in Table 8.5.
PARAMETER f WIDTH SIZE bI DO CPBC CPB TB
MEANIN6
VALUE S MHz
clock frequency b y t e s in m e m o r y w o r d prooessors data r a t e in data r a t e out compare b y t e t o CR compare b y t e i n word transfer byte
5~2 bytes 2048 processors 0.0015 0.002 0.002 0.008 0.004
ms/byte ms/byte ms ms ms
Table 8.5 L U C A S parameters
There is one f u r t h e r p a r a m e t e r to take into account in the t i m i n g equations for the t o t a l response t i m e to a query, compilation Hawthorn
and
This is the host overhead t i m e ,
communication
and D e W i t t
with
the
backend,
HOV
HOV, was
which is due to query carefully
analyzed
by
who e s t i m a t e t h a t it is 0,042 s in the best case and 0,22 s in the
w o r s t ease,
8.4.3 Database and queries The database for the t h r e e queries is the U n i v e r s i t y of C a l i f o r n i a at Berkeley D e p a r t m e n t of
Electrical
Engineering and C o m p u t e r
Science's
course and room scheduling database,
This database contains 24704 pages of data (12,6 Mbytes) in 102 relations, information
about courses taught- i n s t r u c t o r ' s
of course,
etc.
The queries are actual queries.
name,
course name,
The data are
r o o m number,
type
221
Query
Q1;
Q1
is
representative
for
retrieve(QTRCOdRSE.day~ where
a
class
of
overhead-intensive
queries.
QTR~SE.hour)
QTRCOURSE. i n s t r u e t o r = " d e s p a i n ~ a . m . "
The relation QTRCOURSE contains 1110 tuples. bytes
short
long.
Each tuple has 2/4 attributes and is 127
The relation is stored as a heap in 274 pages (blocks on disks).
a t t r i b u t e "day" is a character field~
"instructor" is not specified in the papery In the test run at Berkeley~
The
7 bytes long; "hour t' is 14 bytes tong; the size of we assume t h a t i t is 30 bytes long.
three tuples satisfied this query.
It shall be assumed that the following algorithm is used in processing query Q1 in LUCAS: 1) The relation QTRCOURSE is loaded into the Associative Array.
The size of the
r e l a t i o n is such t h a t the whole relation can be loaded into the A r r a y . 2)
The
values
of
the
attribute
"instructor"
"despain~a.m." in a Comparand Register.
are
As a result~
compared
in
parallel
with
three tuples are selected.
3) The values of "day" and "hour" of the selected tuples are output to the host.
Since obviously each combination (day,hour,instructor) is unique~
i t is not necessary
to check the result for redundancy.
Query
Q2
is
representative
for
a
Q2 : retrieve(RQCk4S.bu]lding~
of
data-intensive
multirelational
queries.
RQO~S,roomnum~
RQQMS.capacity~ where
class
~SE.day,
COURSE.hour)
ROCt4S,roomnum=COURSE.roomnum and RCEt4S,building=COURSE.building
and
RQCMS.type="lab"
The relation COURSE contains 11436 tuples in 2858 pages (1.4 Mbytes) with i n f o r m a t i o n about all the courses taught in the last four years, disk space.
i t requires 130 tracks (7 cylinders) of
The relation ROOMS contains 282 tuples in 29 pages w i t h i n f o r m a t i o n about
every room t h a t EECS Department can use for teaching courses. a t t r i b u t e pair is 20 bytes long, specified,
The (roomnum~building)
the sizes of the attributes "capacity" and "type" is not
we assume they are 5 and 3 bytes long respectively.
There are 22 labs,
and they were used 422 times in total.
list which contains the building, lab for the last four years.
room number,
capacity,
The result of this query is a day,
and hour of use of any
222
The algorithm used by LUCAS is the following: 1) The relation ROOMS is loaded into the Associative Array. 2) The 22 tuples with the information about labs are selected. 3) Cylinders of pages of the COURSE relation (1634 tuples each) are successively leaded from
disks into the Associative Array~
joined with the 22 tuples of the
ROOMS relation and the result is output to the host.
(This type of external Join
evaluation wilt be discussed in Section 8.5.)
Guery G3 is representative for a class of data-intensive queries on a single relation.
It
includes an aggregate function.
Q3"
retrieve(CtvlASTER, acct~
QviASTER. f u n d ,
encumb=sum(QX4ASTER, encumb by QX4ASTER. a c c t ~
CtvlASTER. f u n d ) )
The relation GMASTER contains 19/4 tuples in 97 pages. not specified,
we assume that they are each 10 bytes long.
There are 17 unique values for the (acct~fund) pair. unique
The sizes of the attributes are
(acct,fund)
pairs along with
their
The query returns to the user the 17
associated sums of
values of
the
attribute
"encumb". This query can be executed in LUCAS in the following steps: 1) The relation GMASTER is loaded into the Associative Array. 2) By an operation similar to Projection~ are selected.
Then,
all tuples with identical (acct,fund) pairs
for each such partition~
host together with "encumb" values of all tuples. 3) The host accumulates the sum of "encumb" values.
one (acct,fund) pair is output to the
223
8.4,4 Response t i m e s of L U C A S The
response t i m e
to a query is the sum of the t i m e
spent in all components of the
machine that cannot be overlapped. Q1-Shert query The response t i m e of L U C A S to query Ql~
AAWORK~
is given by
A A W O R K = HOV + D A V A C + 2 7 4 * D R E A D + A A P R O C + A A O U T where HOV is the host overhead time, disks with
data of QTRCOURSE~
QTRCOURSE r e l a t i o n from
O A V A C is the access time to the f i r s t t r a c k on
274*DREAD
the disk,
is the t i m e for reading 274 pages of the
AAPROC
is the t i m e spent on processing in the
Associative A r r a y and A A O U T is the t i m e spent on returning the result to the host, Since the size of the a t t r i b u t e " i n s t r u c t o r " is 30 bytes~
A A P R O C is equal to
A A P R O C = 30*CPBC = 0,060 ms. Since the a t t r i b u t e
"date" and "hour" are 21 bytes long t o g e t h e r and since 3 tuples are
read out~ A A O U T = 3 " 2 1 " D O = 0.126 ms. The worst case value of A A W O R K is then A A W O R K = 0.22 + 0,03 + 274*0.0008 + 0,I300060 + 0.000126 = 0.46 s N o t i c e that the t i m e spent on processing data in the Associative A r r a y is negligible when compared to the overhead time or to the data transfer time. The best case value of A A W O R K
is obtained if the Associative A r r a y already holds the
r e l a t i o n QTRCOURSE at the t i m e when the query is issued and loading t i m e is 0 A A W O R K = 0.042 + 0.000060 + 0.000126 = 0.0/42 s. The situation that the relation is already in the Associative A r r a y when a query is issued is in f a c t quite realistic, over and over again,
It is often the case that the same set of data is i n t e r r o g a t e d
224
Q2-Muttirelation query The response t i m e to query Q2 is given by: AAVVERK= 2*HOV
;host ;two
overhead,
it
relations
is
will
2*HOV b e c a u s e
be a c c e s s e d
+ DAVAC + 29*DREAD
~time
for
loading
ROCMS
+ 7*(DAVAC
;time
for
loading
7 cylinders
+ 412*DREW))
;of
COURSE
+ 3*CPBC
;selection
+ 7"(22"20"CPB)
;7
times
;with
of
22 t u p t e s
one c y l i n d e r
;7 t i m e s
+ 7"(22"5"TB)
on RQQvlS join
o f ROOM
toad
transfer
of
22 v a l u e s
of
;"capacity" + 7"(60"50"130)
= (2.71-3.07)
;7
times
;best
s
output
case
;depending
of
and w o r s t
result
case
tuples
times,
on 1-E)V
The last t e r m gives the time spent on outputting the 422 tuptes of the result, assumed that
they are uniformly distributed
in the 7 COURSE loads,
We have
thus giving 60
tuptes/load,
G3-Aggregate Functions The response t i m e to query GI3 is given by: AAVVt]=¢(= PE)V + DAVAC + 97*DREAD
;time
for
+ t7"20"CPB
;time
to
select
+ 17"20"DO
;time
to
output
(acct,fund)
+ 194"10"130
;time
to
output
eneumb v a l u e s
+ PE)P
;time
for
= (0.16-0.34)
loading
17 p a r t i t i o n s
computing
17
sums
pairs in
the
host
s + t-DP
We estimat%
r a t h e r conservatively that HDP,
the sum of l l
numbers is 0.01 s.
AAWORK= (0.16-0.38) So
GMASTER
Hence,
which is the time to compute 17 times
225
8.4.5 Performance comparisons
Response times to Q1,
Q2,
and Q3 for each system studied by Hawthorn and DeWitt
and for LUCAS are plotted in Figures 8.36,
8.377
and 8.38.
The systems are ordered along the horizontal axis on the basis of "increasing complexity". LUCAS has the largest number of processors - therefore it can be considered to be more complex than the other designs~ bit-serial processing elements. other designs.
The processors consist of RAM
memory with simpIe
They are much simpler than the celt processors in the
The whole Associative A r r a y may be implemented with a few chips (cf
Chapter 10) - therefore it can be argued t h a t LUCAS is less complex than other designs. Thus,
somewhat arbitrariiy~
we place LUCAS between CAFS anf CASSM.
[s] 0.5
0.4
0.3
0,2
~
0.1
I
INGRES
1
AD
CAFS
BEST
--4
LUCAS
I
~
CASSM
~
~
RAP
DBC
DIRECT
Figure 8.36 Query Q1.
Figure 8.36 shows that LUCAS exhibits the shortest best case time of all studied systems. it
is also 3 times faster than [NGRES.
In the worst case the performance of LUCAS,
though a p p r o x i m a t e l y the same as that of the other systems is actually worse than the performance of
a conventional database management system,
INQRES uses the fact that QTRCOURSE that for simple queries of this type,
tn processing query Q1,
is hashed on instructors name.
It is apparent
none of the specialized database computers gives
226
any significant increase in performance. Is] 40
ST 3O
20
10
I
INGRES
~
I
i
I
l
1
!
AD
CAFS
LUCAS
CASSM
RAP
DBC
DIRECT
F!,£ure 8.37 Query Q2.
LUCAS shows the best performance to query Q2 of air the studied systems. with INGRES, best case.
If compared
it is also 12 times faster in the worst case and 11 times faster in the
It is also 1.3 times faster than the second best design which is the much
more complex DIRECT. The poor performance of RAP~ a 3oin operation.
CASSM and CAFS is caused by their i n a b i l i t y to perform
The host had to decompose query 0 2 to a series of 22 subqueries.
227
[s] 3.0
~
2.0
WORST
1.0
I
I
,--
I
I
I
I
INGRESAD CAFS LUCASCASSM RAP DBC DIRECT Figure 8.58 Query Q3.
LUCAS is the fastest of all designs in executing Query 3.
Ajso~
it is 5 times faster
than INGRES and 1.5 times faster than DIRECT.
8.4.6 Inf!uenee of the size of the Associative A r r a y In previous sections it was assumed that the size of the Associative Array in LUCAS is 2k processors.
We w i l l now examine how the response times to queries O1,
Q2~
and Q3
2k,
and 4k
are influenced by an inc,'ease or decrease of this size. Figure 8.39 shows the worst case response t i m e far LUCAS with 0.Sk, processors. Section 8.3. cylinder DAVAC),
lk,
The t i m i n g data were obtained by a simple modification of the equations in For example,
7 times, etc.
in AAWORK for query 02 in a lk-case~
(7 times
DAVAC),
we
load
a half-cylinder
instead of loading a 14 times
(14 times
228
[s]
Q2
3.0
2.0
1.0
Q1
Q3 t
t
!
I
0.5k
lk
2k
4k
Figure 8.39 Response times for d i f f e r e n t Array sizes.
If we inspect Figure 8.39~
we can make the following observations."
1) Even with a smaller array the p e r f o r m a n c e with r e s p e c t to queries Q2 and Q3 is stilt very good,
As a m a t t e r of f a c t it continues to be b e t t e r than that of the other designs.
2) The increase of the size of the Associative Array from 2k to 4k for query Q1,
and
from 0.5k to 4k for query Q3,
The
reason is that
the interrogated relation in query Q1 can be stored in the Associative
A r r a y of the size 2k, Array
of
the
does not lead to an improvement of performance.
size
and the relation used in query Q3 can be stored in the Associative
0.5k.
The
computing
potential
of
the
excess processors in
the
Associative Array do not contribute in the computation. 3) There is no apparent improvement in response t i m e to query Q2 if the number of processors in the Associative Array is increased from 2k to 4k.
The reason is not the
idleness of processors - they are all utilized - but simply that the Associative Array operates too fast compared to the [/O time. one cylinder load,
The Associative Array of size 2k can store
the Array of 4k stores 2 cylinder loads.
Because the main component
of execution t i m e of this query is the t i m e to toad data from the disks, the
total
processing time
a decrease of
in the Associative A r r a y in the 4k-case (as compared to a
2k-case) is negligible when compared to the I/O time.
229
8.4.7 Conclusions Our
results
indicate
that
in
an environment of
overhead-intensive queries,
using an
Associative A r r a y does not give any advantage over a conventional database management system.
On the contrary~
the conventional system which uses standard techniques of
hashing and indexing and thus can access only those pages on the disk which contain the resul%
in f a c t performs much faster.
The Associative Array must search through the whole relation indiscriminatly. pages must be loaded,
at the speed determined by the speed of disk,
are then searched in an i n f i n i t e s i m a l l y short time,
All its
and even if they
the damage is already done and the
execution t i m e is too large. The potential of the Associative Array can best be exploited in an environment of dataintensive
multirelationa[
queries,
where
its
performance
is b e t t e r
by
an
order
of
magnitude than that of a conventional system. The performance of the Associative Array, of database computers we have discussed~
if compared to INGRES and to other designs would probably be even much more impressive
if we had used queries involving more than two relations. width of 512 bytes,
which is the current size of LUCAS.
We have assumed an array It was much more than was
necessary and i t would even allow for having more than two pages in the A r r a y at the same time.
We had no use for this feature when we evaluated queries Gl~
But in the case that a query involves more than two relations~
Q2,
and Q3.
or in the case that in
the process of evaluation a series of algebraic operations creating i n t e r m e d i a t e relations must be performed~
then the large width of the A r r a y could be e f f i c i e n t l y utilized,
8.5 E X T E R N A L EVALUATION OF JOIN B.5.1 Introduction In Section 8.4~ Associative A r r a y
we evaluated the performance of a database computer containing an of LUCAS
type,
in a real world situation,
We assumed a system
consisting of an Associative Array connected to an Ampex 9200 disk drive.
This system
performed remarkably well in comparison to other designs. If we analyze d i f f e r e n t components of the response times,
we can see that the times are
determined largely by the properties of the disk drive: average access time,
block read
230
time, the
number of blocks/track~ Associative
Array
could
and number of tracks/cylinder. process
data
ten
negligible influence on the t o t a l response time. times slower (assuming constant I/O time)~ the execution t i m e either.
times Also,
If 9
faster9
it
in our experiment 9 would
have only
a
if the Associative Array were ten
it would not have any significant influence on
This observation indicates that the system is not properly
balanced. In t h i s s e c t i o n ,
we will s t u d y a s i m i l a r s y s t e m
as t h e o n e in S e c t i o n 8.4~
will a s s u m e t h a t t h e p r o p e r t i e s of t h e disk a r e p e r f e c t l y
Associative Array.
but nee we
m a t c h e d to t h e p r o p e r t i e s of t h e
The rationale is t h a t if we are going to design a database computer
w i t h such a powerful processing component as the Associative Array~
then we wilt surely
not want to rely on the properties of a standard disk drive p r i m a r i l y aimed to be used w i t h a sequential computer.
We w i l l instead modify the disk drive to achieve the highest
system efficiency. This
"ideal"
view.
Disk-Associative
We w i l l
execution
of
restrict
one
of
our
the
Array combination can be analyzed from investigation
most
important
to
a study
operations
of in
the
cost
query
many points of
efficiency
processing,
of
the
the
join
operation. In conventional systems,
there are two basic approaches to evaluate the join [Yao79].
1) Use of a nested loop algorithm.
Tuples from the two argument relations are compared
and
This
matching
tuples
are output.
algorithm
is very inefficient 9
since for
two
relations of size N 9 it gives execution t i m e proportional to N ' N . 2) Use
of
sorting
and merging,
cf.
Section
8,29
when
the number of
necessary
comparisons of tuples is t h e o r e t i c a l l y proportional to only N*iogN, The
second
method
is
seemingly a much
faster
approach than
the
first
one.
The
algorithms with the complexity growing as N*logN are in general considered to be good algorithms [Knuth73].
But in reality~
in the case of very large relations where data are
stored on the disk and must be brought into the CPU in pages~ a l g o r i t h m is inherently stow due to large overheads. in [DeWitt
For exampl%
even the sort-merge DeWitt and Hawthorn
and Hawthorn 81] analyze the execution of the join of two relations,
30000 and 3000 tuptes respectively~
on a V A X
using sophisticated merge and sort algorithms.
tl/780
with
with
an IBM 3330 disk drive~
Examining the timing equation for e.g.
the merge phase shows that 55 percent of the execution time is due to loading of pages to
the
main
memory
and only
45 percent
is due to
proper
"merging"
in the CPU.
Merging of two pages in the CPU is a f f l i c t e d by further overheads and all this together gives inherently long t o t a l execution times. This is an unsatisfactory situation. greatly
improved by parallel join
Fortunately,
the performance of the join can be
processors and many proposals can be found in the
231
l i t e r a t u r e [Tong and Yao 82].
We w i l l not t r y to prove that our configuration is better,
we simply investigate some consequences of our design choices. We w i l l determine the execution t i m e and also whether there is an o p t i m a l size of the Associative A r r a y for given statistical properties of relations. We assume t h a t the join witl be performed on very large relations which t y p i c a l l y contain 104-106 tuples.
We must stress t h a t
function of t i m e .
what is considered to be a large relation is a
Large relations in the future will be much larger than large relations
today. Since i t feasible
is highly to build
unlikely that
in the reasonably near future
Associative Arrays
that
it w i l l be economically
are able to store large reiations~
we w i l l
assume t h a t the size of the A r r a y is less than one tenth of the cardinality of the larger of the two a r g u m e n t relations.
8.5.2 System description We w i l l assume the system configuration shown in Figure 8.40. stored on the disk.
The operand relations are
They are partitioned into pages of equal size.
The disk w i l l function
as a large addressable memory where the basic unit of data which is accessed is a page. In c o m m e r c i a l disk systems, r e a d / w r i t e time. returning
the
Since,
result
to
the average a c c e s s t i m e is not much larger than a block
as we will see,
processing data in the Associative Array and
the host takes much longer t i m e
than loading data~
we w i l l
assume that the t i m e to locate pages of data can be overlapped with the processing time. The Associative A r r a y has the same capabilities as LUCAS.
For example to compare one
byte in a memory word with a byteslice in all memory words takes /48 clock cycles.
ASSOCIATIVE ARRAY
DISK
HOST
I................ Figure 8.40 Database computer.
This architecture is similar to the LUCAS system in Section 8.4.
The main difference
between these two systems are: 1) The data transfer rate,
r,
is now determined by the speed with which it can be input
232
into the Associative A r r a y - in Section 8.4 we only made sure that data coming from a disk could be loaded into the A r r a y 2) The size of a page is determined by the size, A r r a y means large pages,
c,
of the Associative A r r a y ,
a small A r r a y means small pages - in Section 8.4,
a large
the size of
a page was equal to the size of a block on the disk. First,
we
will
determine
r.
We assume
sequence of columns of bytes. tuple comes f i r s t
that
the
When read from
and is loaded into the f i r s t
page is stored
the disk,
I / O Register,
second tuple is loaded into the second [/O Register~ the last,
c-th,
loaded into the I/O Registers, the Associative Array.
During
Thus,
cannot receive any new data.
For a
For a very small Associative Array,
when one set is filled with
the content of the other is shifted into the Array. cycle
(and
we w i l l assume that we have more complex,
I/O Registers working in a c o m p l e m e n t a r y fashion,
r is 1 b y t e / c l o c k
there is a
the [/O Registers into the memory
a necessary buffering of 8 bytes presents no p r a c t i c a l d i f f i c u l t y .
we assume that r is 1 b y t e / c l o c k cycle.
data from the disk, that
A f t e r every c bytes
which takes c clock cycles,
this period the Associative A r r a y
we w i l l see that they must be studied too), "double',
the whole column is
it is shifted in 8 clock cycles into the proper bitaddress in
in the stream of bytes coming from the disk,
large Associative Array,
b y t e of the
until the f i r s t byte of
After
This procedure is repeated for each column.
period of 8 clock cycles spent on shifting data from words.
byte of the f i r s t
then the first
and so on,
tuple is loaded into the last I / O Register.
on the disk as a
the f i r s t
it also sets the l i m i t
Since we assume
on the minimal possible size of the
Associative A r r a y which is 8 processors. Next,
we look at the value of s.
tuples are returned to the host.
It
is the rate with
which the bytes of the result
We w i l l assume that it is same as in Section 8.4,
t0
clock c y c l e s / b y t e (8 clock cycles for shifting + 2 for copying to the I/O Buffer Register). The difference in speed between loading and o u t p u t t i n g of data is due to d i f f e r e n t modes of
operation.
Data
are
loaded as pages,
but
they
are output
as tuples
which
are
selected according to their content.
8.5.3 A l g o r i t h m and t i m i n g equation s
The a l g o r i t h m
which
we w i l l
Youssefi
where
each tupte
other,
76],
use is based on a tuple substitution a l g o r i t h m [Wong and from
one r e l a t i o n is compared with
and matching tupies are concatenated.
is that in one operation,
all tuples of the
The advantage of using parallel hardware
a tuple from the first r e l a t i o n can be compared with a whole
page of tuples of the second relation.
In principle,
a speed up equal to the parallelism
in the hardware could be achieved. We assume that
the two operand relations are of the same c a r d i n a l i t y ,
N tupies in P
233 page%
and t h a t the tuples have 2 a t t r i b u t e s of b b y t e s each.
The result relation has
three attributes. The execution proceeds in the following way,
illustrated by Figure 8.41.
All pairs of pages of the operand relations are successively loaded into the Associative Array into areas I and II.
For each pair,
in the way described in Section 8.3.
the values of joining a t t r i b u t e s are c o m p a r e d
Matching tuples are s e l e c t e d and serially output to
the host.
DISK
MEMORY 2*b >
ASSOCIATIVE ARRAY
,
I I I
I c I
ct 1 [ ....
V
V
i
II
I
l p
I I I V
1 I t I
1 c
I v
N I
I l I
P
v
Figure 8.41 Execution of Join.
We will c h a r a c t e r i z e the c o n t e n t of the relations by a s e l e c t i v i t y f a c t o r g,
which can be
defined as the ratio b e t w e e n the cardinality of the result of a join of two relations and the
cardinality
of
their
product.
Intuitively~
randomly s e l e c t e d from the two operand relations,
g is the
probability
wilt match.
that
two
tuples,
We wii1 assume t h a t the
values of joining a t t r i b u t e s are evenly distributed.
Depending on the size of g, various proposals for Join processors behave differently. There are two interesting situations which we wilt treat separately.
234
1) The case when the value of g is comparatively large.
The volume of data in the
result relation is then very much larger than the volume of data in the operand relations. This case is assumed in a comparative study of different hardware approaches to Join processors [Tong and Yao 82]. e.g.
The study assumes a typical g being 0.5.
the 3oin of two relation of t0 5 - t0 5 tuples each,
It means that
is a huge relation with a
cardina[ity of 5"106 - 5*109 tuptes. 2) The ease when the value of g is small.
[DeWitt and Hawthorn 81] assume g=O.O00t,
0.001 and 0,01 which gives considerably smaller result relations than in the case studied by Tong and Yao. We w i l l analyze the performance of the Associative Array with respect to both cases. The execution time of Join can be divided into three components: T Join = T L + Tp + Tp,
T L is the loading time.
It is the time spent on loading data from the disk
into the Associative Array. Tp is the processing time.
i t is the time spent on processing data in the
Associative Array. T R is the output time.
it
is the time spent sending concatenated result
tuples to the host, Loading time The relations have P=N/c pages each. takes 2*c*b clock cycles.
Hence,
There are p2 pairs of pages.
To toad one page
the t o t a l loading time is
TL= 2*p2*c*b Outputing time p2 pairs of pages are compared.
Each comparison of c tuples from the first relation
with c tuptes from the second produce in average c2*g result tuples. 5*b bytes long.
To output one byte takes 10 clock cycles,
One result tuple is
hence,
TR= 30*P2*b*c2*g The overhead associated with selecting "next selected tuple" for outputting is about 5 clock cycles/tuple and can be ignored.
235
Processing time a) Large g
For a large selectivity factor (g >> l / N ) ,
if g > 1/c then one page with a sample of c
tuples contains as many different values of the joining attribute as the whole relation. Hence,
to compare two pages of the operand relation%
be made.
only 1/g tuple comparisons must
Since the comparison takes 48 cieck cycles/byte (cf.
Section 8.2) we get,
TpG= 48*p2*b/g b) Small g For a small selectivity factor,
we may assume that a sample of c tupIes in one page
contains c different values of the joining attribute. comparison must be made c times.
To compare a pair of pages,
a tuple
Hence,
Tpg= 48*p2*b*c The timing equations are summarized in Table 8.6.
TL
TR
2*N2*b/o
3O*N2.b.g
TP6 48*N2*bl(c2*g)
We have substituted N/c for P.
TP9 48*N2*blc
Table 8.6 Timing equations for Join
8.5.Z~ Discussion The timing equations have the same general form as the timing equations of other designs of specialized 3oin procesors.
The dependency on the cardinality of the operand relations
and
number
the
dependency on the
of
processors in
the
Associative
Array
can be
expressed as
T= N2*AI,
and T= A2/c + A3
For example~
the timing equations of the Two-dimensional Join Processor Array of c
processors of [Tong and Yao 81] have the form T= N2*B1,
and T= B2/c + B3/sqrt(c)
and the timing equations of a Join Processor of [Nlenon and Hsiao 81] have the form T= N2"C1 + N ' C 2 ,
and T= C3/c + C4
We believe that - unless there wilt be some unexpected technological breakthrough - for
236
very large relations, equations for
the
as tong as data must be staged into 3oin processors,
total
execution t i m e
the timing
of the Join on any hardware wttt always look
a p p r o x i m a t e l y the same. The t i m i n g equations in Tabie 8.5 show t h a t the execution t i m e decreases with increased size of the Associative Array.
But we can see t h a t the t o t a l execution t i m e contains a
t e r m which is independent of this size, Obviously,
the t i m e for outputting selected tuptes,
above some size of the Associative A r r a y i t wilt dominate,
f u r t h e r incresing the size of the Associative A r r a y does not pay.
TR.
which means t h a t
We assume that a cut
point of usefulness of the size of the A r r a y occurs when the time to output the result is 10 times target than the sum of the t i m e to toad data and process them in the Array. We wilt analyze the timing equations to see at which sizes of the Array the cut point occurs for d i f f e r e n t values of g. Large g To simplify our discussion, and Yao 82], Hence,
i.e.
we w i l l make the same assumption as the one made in [Tong
g=0.5.
the t i m i n g equation is
TT3oinL = NZ*b*(96/c 2 + 15 + 2/c) The following formula defines the cut point: 96/c 2 + 2/c = 1.5 We can see that the cut point occurs at c=8,
and a further increase of the Array size
above 8 processors gives only marginal improvement of performance. We have assumed that longer t i m e
than
to
output a byte from
to load it
(ten clock
the Associative Array takes ten times
cycles versus one d o c k
suspect t h a t this is the reason why the cut point value is so small.
cycle) and we might L e t us assume that
we couid somehow speed up the t i m e for outputtlng data from the Array, dock
cycle/byte.
e.g.
to 1
The result would not be much b e t t e r as the cut point is achieved
already at 40 processors. Since the idea of using the Associative Array assumes a large number of processors,
our
results cleaMy demonstrate that a bit-serially operating Associative A r r a y is not a viable a l t e r n a t i v e as a candidate for a 3oin processor in the case of large values of g.
237
Small 9 The f i r s t
observation t h a t we can make is t h a t the loading t i m e is 24 times shorter than
the t i m e spent on processing in the Associative A r r a y and can thus be neglected. A f t e r some s i m p l i f i c a t i o n , c
we get the f o r m u l a for the size of the cut point
= t6/g
Table
8.76
gives
the
g
size
of
the
cut
point
for
different
values
of
g.
C
0.01 O.OOi O,O001
~.6-I03 1.6-~04 i,&*iO S
Table 8.7 Cut points for small g
We can draw the conclusion that for small values of g,
indeed suitable. optimum
Furthermore,
we
the large Associative A r r a y is
can with a large degree
of confidence
predict its
size.
We could see t h a t in an analysis of the f e a s i b i l i t y of the use of the Associative A r r a y it was of outermost For large g,
importance
i,e.
to make proper assumptions about the properties of data.
any t w o tuples w i l l match with a high p r o b a b i l i t y ,
the Associative
A r r a y is obviously useless. [Tong
and Yao
82],
analyzing the
g=O.5 as being a worst case value, their
method
the
Associative A r r a y
result
would
be
performance
of d i f f e r e n t
hardware solutions assume
if an Associative A r r a y was investigated according to very
unfavorable
for
the
should certainty not be used in such a case.
simple
reason
However,
that
the
f o r small g~
the result would indeed be d i f f e r e n t .
8.6 C O N C L U S I O N S The r e l a t i o n a l data model offers many advantages. as seen by the user. preferred foundation.
format
for
It
is s y m m e t r i c a l
a question to
with
To mention a few: It is very simple
respect to queries,
a database.
It
so that
there
is no
has also a very strong t h e o r e t i c a l
238
There is only one problem which has been called the greatest open research question of the relational data model: can it be implemented efficiently [Chamberlin76]? to King,
[KingS0],
definite yes.
the answer,
According
based on the experience with the IBMs System R,
King goes even one step further,
is a
claiming not only that relational systems
can be implemented with reasonable performance but also that "exotic hardware (like associative memories,
etc)" is not required.
about and is probably right. improve performance,
King certainly knows what he is talking
But there is another question: could this exotic hardware
if it was available anyway?
The purpose of the research presented in this section was to study the applicability of an Associative Array
in the design of a backend database computer whose function is to
support a large database which utilizes the relational data model. Our approach is exactly the one which was criticized by DeWitt and Hawthorn in [DeWitt and Hawthorn 81] who give it the name "architecture directed" research.
database machine designers usually begin by designing what they consider to be a good architecture which they feet win efficiently execute one or two database operations.
Afterwards they develop the algorithms to support
all
operations
the
required
database
using
the
basic
primitives
of
their
architecture . They advocate instead an approach where an architect of a database machine should start by
first
developing algorithms and then
extract
necessary for their efficient implementation.
the
understood,
should one attempt to design a machine.
We believe
that
DeWitt
and Hawthorn
primitive
operations which
are
Only after these primitives are known and
confuse
the
different
roles
of
research
and
development.
How can an architect extract the p r i m i t i v e operations without (at least in
the
his
back
solutions?
of
head) knowing
Furthermore,
anything
about the
potential
of
different
hardware
how could it be possible to know anything about the potential
of exotic hardware if there were nobody exploring it. The Associative Arrays do not enjoy widespread use at present,
and so the research and
development of algorithms utilizing this device is rather limited. described the implementation of algorithms for
In Section 8.2,
operations of relational algebra.
we The
implementation consisted mainly of transferring selected tuples of operand relations into the Comparand Register, Array,
comparing their values with the contents of the Associative
and processing bitslices of information.
The implementation of the Division
operation is an example of utilizing the possibilities offered by the bitsliee processing. The execution time of an operation was typically proportional to the size of the tuple and to the cardinaIity of one of the argument relations.
We have extrapolated our results and
analyzed the performance of very large Associative Arrays,
The absolute execution times,
239
assuming a moderate 5 MHz clock frequency,
were very good,
and probably out of
reach for present day sequential computers. We have also discovered that the increase in performance gained by increasing the size of the Associative Array by e.g.
a f a c t o r of 256 can be achieved on a sequential computer
by doubting its processing speed.
This is a rather negative result but it would be wrong
to use it indiscriminately for refuting the Associative Arrays. by
the
following
observations.
We
were
comparing
Its impact can be softened
"ordinary"
algorithms
on
the
Associative Array with only one property of the best sequential algorithms known. large volumes of data in the argument relations, associated with comparisons.
"preparing"
For
it might be the case that overheads
data for sorting are much target than the t i m e for proper
Furthermore,
sorting and merging of data has a disadvantage of consuming
v e r y large space in the memory of the sequential computer.
We conclude this defence of
an Associative Array by recalling that Date [Date85] claims that a typical software DBMS uses on the average ten machine instructions
per byte in evaluating a given selection
condition on a given record. In
Section
8.5,
we
demonstrated how algorithms f r o m
Section 8.2 can be used for
evaluating complex queries e n t i r e l y within the Associative Array.
When large Associative
Arrays w i l l become available than for a r e l a t i v e l y stable set of queries (referring to the same relations over and over again),
this method would be advantageous.
One very interesting avenue of research,
which we have only very briefly discussed is
the area of o p t i m i z a t i o n of query evaluation.
In our examples,
operations leading to the answer was stated ad hoc. efficient How
ways to do it.
much
from
the
What is the best strategy? research
on
optimization
evaluation according to our method.
on
What "tricks"
the sequence of algebraic
But obviously,
there can be more
How to identify bad strategies? sequential
computers
applies
to
can be used to avoid the negative
effects of the 3Din operation? [n Section 8.4,
we assumed a backend database computer consisting of an Associative
A r r a y connected to a disk. welt-documented
designs
of
We compared its performance with the performance of other database computers
conventional database management system. types of queries.
and
also with
the
performance
of
a
We determined the response times to three
We could see t h a t in an environment of overhead-intensive queries the
Associative A r r a y does not give any advantage over a conventional database management system. that
the
Comparison of response times to data-intensive or m u l t i r e i a t i o n a l queries gave Associative
Array
is
more
than
an
order
of
magnitude
conventional system and also b e t t e r than all the other studied designs. to overstate the importance of this finding, can be the case t h a t
other tests
better
than
i t was only one benchmark experiment,
could give d i f f e r e n t
results.
the
We do not want
But still,
it
since the
Associative A r r a y performed so remarkably welt the obvious implications should not be ignored.
240
In Section 8.5, that
the
we went beyond the basic assumption of Sections 8.Z and 8.3,
Associative Array
is larger
than
the
operand relations.
external evaluation of algebraic operations on very large relations. chose the Join operation.
In a way,
which was
We dealt with
the
As an example we
we were trying to answer the following question:
Why is the performance of the Associative A r r a y in Section 8.4 so good?
We identified
two cases which depend on the content of the database= 1) large s e l e c t i v i t y where the Associative A r r a y is found to be usetes%
factor
-
2) small s e l e c t i v i t y f a c t o r (which is
the case of queries G2 and Q3 in Section 8.4) - where its use can be advantageous. We have started by designing a model Associative Array,
LUCAS,
investigate its f e a s i b i l i t y in database management applications. being an e x p l o r a t o r y basic research. Associative Array
and one goat was to
We consider our research
We do not claim that we have proved t h a t the
is a viable component of future database computers.
that people who dismiss it, are c e r t a i n l y mistaken.
But we claim
because they believe t h a t its only capability is searching,
Chapter 9 LUCAS AS A DEDICATED PROCESSOR FOR IMAGE PROCESSING
Our concern in this chapter is to investigate the usefulness of LLICAS - and LUCAS-type of processors - in image processing. computer for image processing.
We first
state some demands that are put on a
We then briefly r e v i e w earlier attempts to meet these
demands through the use of unconventional computer architectures.
We give arguments
that the kind of parallelism offered by LUCAS is the one that is most useful for some v e r y i m p o r t a n t image operations.
N e x t we discuss d i f f e r e n t organizations of a processor
array giving arguments for and against d i f f e r e n t structures. how to
best map an image onto a certain
constitutes the main part of the chapter, on LUCAS
are
treated.
The
Section 9.4,
which
describes how image operations are performed
organized as a linear array of processing elements.
operations that can be performed by local processing, transforms
We also t r e a t the question of
processor structure.
investigation
The main concern are
but also measurements and global
is carried
out
in
the
form
of
examp]es.
Timings are given and comparisons are made both with a conventional computer and with special purpose image processors.
9.1 C O M P U T A T I O N A L DEMANDS IN IMAGE PROCESSING linage processing
is characterized by large amounts of low precision data.
image size is 512 by 512 picture elements (pixeis) of 8-bit data, million bits.
On the other hand~
A typical
i.e approximately 2
image processing offers possibiJities to t r e a t very many
data items in parallel. The image processing area is usually divided into image anaiysis and enhancement on the one hand and image coding on the other.
As for the first issue,
a pattern recognition
task calls for analysis of a picture leading to a description of it in terms of features. The
computation
transformations,
of
features
normally
involves
a
variety
of
picture
to
picture
Many of these transformations are useful in their own right as image
enhancement operations.
242
The purpose of possible.
A
image
key
coding is to
notion
is
compress the
tranformation
represented by less correlated data.
of
information
pictures
to
reversible linear transforms.
in
which
In this representation less significant
removed w i t h o u t too much distortion being introduced.
Howeyer,
in an image as much
a form
they
as are
data may be
The major tools in this area are
LUCAS and similar processors are useful also for this task.
we w i l l not further consider the image coding area in this t e x t .
image enhancement and image analysis often takes place in cooperation between a human operator and a computer. takes
the
decisions
calculations. typically,
This
This is frequent in e.g.
which kind
of
be
medical applications.
operations
should
interactive
use calls for
one second in order not to be disturbing.
performed,
the
The operator
machine
processing times
not
does
the
longer than,
For most operations these demands
are not met by ordinary computers. Also in c o m p l e t e l y automatic image analysis - w i t h o u t human interaction - strong demands are often materials
put it
on the processing time.
When analysing medical samples or samples of
is desirable to be able to analyse as many samples as possible in as short
t i m e as possible.
N o r m a l l y the desire is to increase the throughput of data compared to
human analysis. Furthermore,
there
is
a
constantly
information
in a u t o m a t i c
control
man
be in
such
would
many
increasing
interest
and manufacturing.
activities
if
input picture.
constraints on the t i m e
100 ms for a r e l a t i v e l y
using
pictures
as
input
If we consider how handicapped a
he was not
allowed
understand that this is an a t t r a c t i v e path of development. process puts certain
in
to
use his
eyes,
we
The dynamics in the controlled
available for the necessary analysis of the
advanced analysing task may serve as a t y p i c a l
example. Much image processing has been done on conventional computers. special purpose designs have been made, tasks.
Thus,
processor, of
desirable
experience
and the effects features
of
is quite
large
of d i f f e r e n t a computer
Furthermore,
on the desirable
characteristics
of
an image
approaches to performance enhancement. designed
for
image
processing
may
* It should be able to handle e f f i c i e n t l y
many d i f f e r e n t
image t r a n s f o r m a t i o n
ranging from simple operations on binary pictures through gray scale
modifications
and
threshoiding
to
complex
* It should be able to handle e f f i c i e n t l y
filters
different
and
global
transforms.
kinds of feature e x t r a c t i o n
tasks. * It should provide very high e f f i c i e n c y in those tasks that are i d e n t i f i e d as the most frequent ones.
A list
include
f o l l o w i n g points:
tasks,
many
each aimed at speeding up specific processing
the
243
* It should be able to cope with widely varying image sizes and number of bits per pixel.
Desirable is that the same program can be used for different
values of these parameters. * It shauld have an efficient input/output f a c i l i t y . should
stand
in
reasonable
proportion
The time for input/output the
to
computation
time.
9.2 DIFFERENT ATTEMPTS TO MEET THE DEMANDS Due to the constantly increasing importance of the image processing field we see a growing number of special purpose processors built to meet the computational demands. The first proposal for a special purpose computer for image processing is due to Linger [Unger58].
However,
actually implemented. categories, found
not
until
the
late
sixties and early
seventies machines were
We w i l l divide some of the implemented machines into different
based on the principles of organization.
A more elaborate overview can be
in [Danielsson and Levialdi 81] and [Reeves 84].
Many of the processors are
described by their designers in [Duff and Levialdi 81].
9.2.1 Fast neighbourhood access The importance of local operations in image processing was early understood,
as was the
discrepancy between the picture geometry and the linear memory space of a conventional computer.
Computing the addresses of neighbouring pixels took too much time.
Picap I [Kruse73],
one of the first picture processors to be built,
uses two 61-stage
shift registers in order to provide parallel access to the complete 3 x 3 neighbourhood of each pixei.
The pixets are operated on sequentially and the picture size is fixed to 64 x
64 /4-bit pixels.
Picap I has two special purpose processors,
one for linear filters.
one for logic operations and
The "Cytocomputer" [Sternberg79] is another,
more recent,
design
utilizing this scheme for fast neighbourhood access.
9.2.2 A small number of special purpose processors Some designs use a small number of processors working in parallel.
al.80] developed at Linkoping University, larger system s
Picap H,
identical,
carefully designed,
special purpose
An example of this }s the f i l t e r processor FIP [Kruse et Sweden.
The FIP processor is incorporated in a
containing many special purpose processors.
F]P is used for the
244
l o ~ level image to image transformations.
Other processors serve e.g.
input/output and
image management functions. Another system designed at Linkoping University incorporates another processor of this eategory~
the
subprocessors.
QOP processor [Granlund81]. The subprocessors,
in turn,
GOP and FIP
use pipelining.
each
use four
parallel
The organization of the GOP
subprocessors is strongly adapted to the nature of a certain general operator type.
An
important feature of the FIP processor is the ability to reach the elements of an almost arbitrary sized neighbourhood very fast.
This is accomplished through a quite large cache
memory (32 kbyte) holding a portion of the picture. In both these machines al| subprocessors perform the same operations,
thus operating in
an SIMD manner. Another
processor of
this
developed in Karlsruhe,
type
also called FIP
-
is included in the FLIP system
West Germany [Gemmar et al.81].
The 16 identical processors in
the FLIP-FIP can work in either MIMD mode or SIMD mode. program memory and instruction decoding circuitry.
Each processor has its own
The processors may be arranged
according to the topology of the processing task. The Picap-FIP,
GOP and FLIP-FIP implementations show that carefully designed special
purpose processors can give considerable prestanda although very few processors work in
parallel.
9.2.3 A large number of conventional microprocessors During
the
last
years
designs
using
microprocessors have been proposed. PASM [Siegel81] systems.
a
large
number
of
standard,
conventional
Among these are the ZMOB [Rieger et a|.80] and
The number of processors used in these systems is in the order
of the squareroot of the number of pixels in a picture. With
a large
number of processors,
the design of the interconnection structure
communication between them becomes a critical issue. different
structures
microprocessors,
for
interpracessor
communication.
uses what is called a conveyor belt.
for
ZMOB and PASM use radically ZMOB,
with
This is a 257 stage,
its
256
Z80
ring formed
8-bit wide shift register with one stage ( ' m a i l b o x ' ) for each processor and one for the host computer. equipped with
The PASM system with 2n processors (typically 102/4) is planned to be an n-stage interconnection network.
Considered in particular are "the
generalized cube ~T and "the augmented data manipulator T~. An important feature of these networks is their partitionabiJity. as several program,
subsystems of
This means that the system can be configured to work
parallel
machines,
each
subsystem controlled
by
its own
245
9.2Y4 A v e r y large array of simple processors A
major
part
of
the
computational
burden
in
transformations using operations of local nature. independently of each other,
image
processing
is image
to
image
Al| new pixel values can be calculated
using only the old pixe[ values in a small neighbourhood as
arguments. Considering this f a c t i t becomes a t t r a c t i v e to arrange a large number of processors in a two-dimensional structure.
Two special purpose processors for image processing designed
along these lines are CLIP/4 designed at University College, designed
at
Goodyear
Aerospace,
Ohio
under
contract
London [Duff79] and MPP
from
intended to process satellite imagery at high speed [BatcherS0]. also used in the general purpose processor array DAP,
NASA
and p a r t i c u l a r l y
The same structure is
a commercial product from ICL~
England [Fianders et a1.77]. An o v e r v i e w of these designs was given in Chapter 4. of bit-serial processors implemented in LSI. processors.
All three machines use an array
The size of the CLIPZ~ array is 96 x 96
MPP has a 128 x 128 array while DAP has been implemented in both 32 x 32
and 64 x 64 array versions.
The processors are controlled by a central control unit which
provides identical control signals and memory addresses to all PE's.
The control unit in
turn gets its instructions from a master computer of conventional type. Each
processor
south,
east,
in
DAP
and MPP
and west).
is connected to
its
four
nearest neighbours (north,
In CLIP4 also diagonal neighbours are connected resulting in
eight d i r e c t l y connected processors. The size of the data memory associated with each processor is for MPP t y p i c a l l y 1 kbit and for
DAP
typically
/4kbit.
CLIP4
processor - only 32 bits - which
has a very small memory associated with
is a severe l i m i t a t i o n ,
each
especially in grey scale and
colour image processing.
9.2.5 LUCAS compared to other machines LUCAS is a SIMD computer composed of bit serial processing elements like MPP, and
DAP.
The
number
of
processing
elements
is,
however~
magnitude less than that of the above mentioned machines. the way pictures
are best stored and manipulated.
This,
about two
CLIP4
orders of
This has consequences for in turn~
determines which
interconnection structure between individual processors is the most suitable. The
number
of
processing
elements
is more
like
that
of
PASM
and ZMOB
but
the
processors are very different.
The processors in LUCAS have no instruction decoding and
sequencing
can
circuitry.
They
only
work
in
SIMD
mode.
Using
fully
equipped
microprocessors like in PASM and ZMOB is of course a t t r a c t i v e considering the very low cost to which these can now be achieved.
However,
if they are p r i m a r i l y intended far
246
SIMD use - which also PASM and ZMOB apparently are - too much redundant c i r c u i t r y will
be present
integration
of
in the system.
Taking
several processors into
into
consideration a future
a single chip,
the use of
w i t h o u t redundancy appears to be a b e t t e r way to follow. integration technology is increasing tremendously,
development with
processing elements
A v a i l a b i l i t y of large scale
which makes this an i m p o r t a n t aspect.
9.2.6 The advantages of image paralietism In
[Danietsson
and
Leviatdi
81],
the
different
dimensions
of
parallelism
open
for
u t i l i z a t i o n in image to image transformations of tocat nature (neighbourhood operations) are identified.
The four possibilities are=
Operator parallelism= The sequence of
operations to be performed on the
image according to a chosen algorithm is performed in parallel in a pipetined fashion. * Image parallelism: Several pixels of the image are treated in parallel using multiple processing units. Neighbourhood parallelism: The processor has access to all neighbourhood pixel values simultaneously, * Pixel bit para]lelism: The bits in a pixet are treated in parallel. only
dimension
of
parallelism
utilized
in
a
This is the
conventional
computer.
The range of parallelism in each of the four dimensions is between one and in the order of a hundred for all dimensions but the image coordinate dimension. from a few thousand up to several million image points. can almost always be utilized. According to this, image operations.
This is not the case with the other types.
a processor of LUCAS type has potential to be e f f i c i e n t
for local
The main concern of this chapter is to investigate this by programming
and t i m i n g several algorithms on LUCAS.
First,
processing
images to
discussed.
Here the range is
Investments in image parallelism
elements
and
the
mapping
of
however, the
the interconnection of the processor
structure
wit[
be
247
9.3 ORGANIZATION OF PROCESSOR ARRAYS FOR iMAGE PROCESSING
9.3.1 introduction Processor arrays designed to utilize image parallelism can be configured in many d i f f e r e n t ways. the
Besides d i f f e r e n t arrangements of the interconnection between processing elements
mapping of
interconnection
images to this structure,
structure
one
mapping
another mapping for other operations. is the
most favourable for
hardware for input/output.
can be made in different may
be
favourable
for
ways.
Given an
certain
operations,
A question of great concern is also which mapping
input/output.
Often,
different
mappings require d i f f e r e n t
Therefore the set of mappings available on a machine may be
restricted. When Unger
first
proposed u t i l i z a t i o n
of
image parallelism [Unger58] he used a two-
dimensionally connected array of processing elements and mapped only one pixel onto each element.
This mapping has obvious advantages.
m a x i m a l degree.
Secondly,
program straightforward and uncomplicated.
However,
it utilizes parallelism to a
it is very simple since the arrangement of the processors is
the same as the arrangement of the pixels.
the processing elements,
Firstly,
This makes the step from
The control unit,
algorithm to
broadcasting directives to
can also be kept f a i r l y simple.
not even with the largest arrays implemented can we count on having as many
processing elements as pixels.
A very common image size,
512 times 512 pixel%
is
sixteen times the size of the largest array implemented - the 128 times 128 array of MPP.
This
means
that
with
the
one-pixel-per-processing-element mapping,
obliged to resort to dividing large pictures into smaller parts, cumbersome rand effects. accessible neighbourhoods.
Furthermore,
we
are
something that often gives
the mapping also results in very small directly
Given a two-dimensional configuration of the processors,
a
mapping giving one subimage per processor is probably preferrable [Danielsson and Levialdi 81]. For
arrays
with
a
very
large
number
of
processing
organization is very natural and probably the best. configurations may be equally favourabte. some other interconnection scheme,
elements
A linear organization,
is one example.
a
two-
For smaller arrays,
dimensional
however,
other
often combined with
Having chosen a linear organization,
the mapping of an image to the array can still be made in different ways.
248
9.3.2 Two-dimensionally organized arrays
Among processor arrays implemented and used for examples of two-dimensional organization, implementations
of
image
Goetcherian 80] (CLIP4),
processing
image processing there are three
namely CLIP4,
operations
[Batcher80] (MPP),
on
DAP,
machines
[Kushner et al.81] (MPP),
show different mappings of images to processor arrays. MPP use the one-pixel-per-PE mapping,
MPP and DAP.
these
Marks,
Reported
([Fountain
and
[Marks80] (DAP))
While the users of CLIP4 and
in his implementation on the 3Z x 32 PE
loads a 6 x 6 pixels subimage into the memory of each processing element when
processing 192 x 192 pixels pictures. Users of CLIP4 have no true possibility of choosing other mappings since the machine was built for one-pixel-per-PE. TV pictures.
The input/output system is made for 96*96 pixels frames from
Furthermore,
the processing elements of CLIP4 are equipped with strictly
combinatorial parts that allow a signal to flow through a series of elements during one clock cycle. However, Finally,
With it
the very slow clock rate of CLIP4
loses sense
if
other
this is an important feature.
storage methods than
one-pixet-per-PE are used.
the very small memories of the processing elements do not allow many pixels to
be stored in one processing element. DAP~
on the other hand~ was not built for image processing.
system is built into the machine. single clock cycle.
Therefore,
No image input/output
A signal is not allowed to pass through many PEs in a Marks as a user is free to adopt any storage method.
The one he finds best is to store a square subimage in each PE,
although i t gives some
problems with irregular addressing schemes. On
MPP~
with
its
fully
reformatting hardware, possible.
However~
synchronous
processing elements
and powerful
input/output
use of other storage schemes than one-pixel-per-PE should be in [Kushner et al.81] the analysis is limited to 128"128 image%
because the memory of each PE is considered too small for working on a larger subimage than 3*3 pixels. its
immediate
The reason for this is that, neighbourhood are
stored
in
evidently,
not onty the subimage but also
the
memory.
same
This
seems to
be
unnecessary.
9.3.3 Linearly organized arrays
Like
DAP,
processing~
the STAR,AN computer [Batcher7&] was not primarily designed for image but
its use in this field of application has been thoroughly investigated
[Goodyear76],
[Potter78].
STARAN uses a linear ordering of the 256 processing eIements
of an array.
In addition to this the so called flip network provides further possibilities
for communication between processors. The image to processor array mappings used with STARAN follow roughly the approach ane-pixel-tine-per-processing element.
When the number of lines exceeds 256,
two or
249
mere lines are stored in each processor's memory as shown for a 512 x 512 pixels image in Figure 9.1. image,
Lines stored in the same memory word w i l l thus be 256 lines apart in the
Two adjacent lines along the cut in the image w i l l reside in the memories of the
b o t t o m and top processor respectively, used this should give no access problems. wilt be d i f f e r e n t for these lines.
If a "wrap-around" neighbour communication is However~
the addresses to neighbouring pixets
This may very welt double the computation t i m e for
local operations, 512
A
image 512
512 x 512 8-bit pixels
V////////J, "'i _
1
PEs
Storage. 256 words, 8 kbits each Figure 9.1 Storage
of
a
512
x
512
x
8
image
in
a
256
PE
STAR, AN.
Another approach to storing a 512 lines picture in a 256 processors array is to store two adjacent lines in the same memory word, larger
The major advantage with this method is the
i m m e d i a t e l y accessible neighbourhood that a u t o m a t i c a l l y follows,
The method is
illustrated in Figure 9.2.
A 5 x 5 neighbourhood is directly accessible only w i t h up/down
communication.
a 5 x 512 neighbourhood).
(In f a c t
Of
course~
the method can be
generalized to any other ratio between image height and number of processing elements. The larger the rati% be.
the larger w i l l the size of the i m m e d i a t e l y accessible neighbourhood
250
b0 [ ~ / / ' / / / / / / .
~//#//>!
1
b l i~////~
i
i
a2 "./ / _ / / _ / / / / Image
a254 b254 a255 b255
,,222779T~ [~'///¢'/A i
aO
a,
"-'.d _/_./_JK_/.,.,'
a2
~2Y2227,,;';
bO b1
I
• •
1 I I
a255
I
':///////~"
b2
I
__J PEs
Storage
Figure 9.2 A l t e r n a t i v e method for storing a 512 by 512 pixets image in a 256 PE array.
9.4 IMAGE
OPERATIONS
ON
LUCAS
ORGANIZED
AS
A
LINEAR
ARRAY
OF
PROCESSING ELEMENTS
9.4.1 Introduction In
this
section
we
wilt
use LUCA5
as a
mode[
machine
in order
to
examine the
applicability of bit-serial SIMD machines w i t h up to a few hundred processors in the field of image processing .
Algorithms are programmed and analysed with regard to execution
t i m e and possible changes in the hardware that would make execution faster. We do not claim interest
centers
any particular novelty for on
the
techniques
of
the algorithms presented,
implementation
and
the
level
but rather our of
performance
achievable using this specific type of hardware. The
interprocessor
communication
structure
d i f f e r e n t organizations can be tested,
of
However~
LUCAS
is
reconfigurabie,
Therefore
we w i l l assume a linear organization of
251
the processor array with communication one and two steps up and down in addition to a perfect
shuffle/exchange network (Figure 9.3),
Furthermore,
we w i l l assume that the
size of the image side agrees w i t h the number of processing elements so that one line of the image e x a c t l y occupies a field of the m e m o r y as shown in Figure 9,4,
Only in the
last subsection (9.4.8) wilt we depart f r o m this assumption and discuss the consequences for neighbourhood size and input/output. The main concern in our investigation is the set of operations that takes images into images. Secondly,
Firstly~
this is the kind of operations for which a processor array is best suited.
they are normally the most t i m e consuming operations on ordinary computers.
But we w i l l also briefly examine the use of LUCAS for extraction of picture properties of d i f f e r e n t kinds. fn the investigations to f o l l o w we w i l l group operations according to the characteristics of t h e i r execution on LUCAS,
In association with each operation treated~
in what c o n t e x t of image processing it is normally used.
we wilt indicate
In most cases similar operations
are described in [Rosenfeld and Kak 76] where more background material can be found. Unless otherwise stated, on
LUCAS.
Timings
microprograms are, precision.
the presented algorithms have been microprogrammed and tested presented
are
made
with a few exceptions,
using
a
clock
cycle
of
200
ns.
The
general with regard to image width and pixel
These are specified as input parameters to the microprograms.
In connection with the presentations of the algorithms we w i l l sometimes indicate changes in the hardware of LUCAS that would improve the performance.
MEMORY ARRAY
Figure 9.3 Available
data
inputs to a PE
252
V7
t.__t
p°°Fp°l p°3p°4 \
~3o P20 PIO PO0 P31 P21 P11 P01 P32P22 P12 P02
/
~ 3P23 P13 P03 ~4
PlmlP04
/ /
b-bit field
Figure 9.4 Storage of an image in the m e m o r y of LUCAS.
9.4.2 Genuinely laeal operations.
Small neiqhbaurhood sizes
We call an o p e r a t i o n genuinely local if the new value of a pixe] (x~y)~ the operation~
depends only on the pixet vatues in a small neighbourhoed around (x~y).
O is an o p e r a t o r that consists of a sequence of such operations~ tocal~
as the result of if
O is no tonger genuinely
since the new value of a pixel at one side of the picture may very welt depend on
the old value of
a pixel at the o t h e r side.
This kind of operations are t r e a t e d
in a
253
separate section below. In this section we t r e a t local operations w i t h neighbourhood sizes smaller than or equal to 5 in one d i r e c t i o n , is a p p r o x i m a t e l y
a r b i t r a r i l y large in the o t h e r d i r e c t i o n .
quadratic).
This means t h a t
(Usually,
the neighbourhood
all pixel values of the neighbourhood are
i m m e d i a t e l y a v a i l a b l e over the i n t e r c o n n e c t i o n n e t w o r k (Figure 9.3).
9./4.2.1
Binary images
E X A M P L E 1: Salt-and-pepper noise r e m o v a l A binary p i c t u r e obtained f r o m a grey scale p i c t u r e by thresholding o f t e n has s c a t t e r e d w h i t e points in black regions and s c a t t e r e d black points in w h i t e regions as a result of noise in the original picture. counting itself.
the If
number
of
This is called salt-and-pepper noise and can be detected by
neighbours
this number is large,
of
a pixel
that
differ
from
the value of the pixel
the value of the p i x e l is changed.
changes a p i x e l value if i t differs
from
An a l g o r i t h m t h a t
seven or more of its eight nearest neighbours
proceeds on L U C A S as follows. The
image
is swept
over pixeleolumn-wise~
instead of non-matches. each pixel column,
right
to left.
Matches are counted
f i e l d is reserved for a counter in each word.
the counter is f i r s t
point t h a t matches, 10,
A two bit
from
i n i t i a t e d to zero.
the counter is i n c r e m e n t e d .
the most s i g n i f i c a n t b i t is locked to one.
After
Then,
the counter has reached binary
This bit w i l l serve as a m a r k bit for noise
points.
Finally,
image~
and the scan proceeds to the next pixel column.
the value of these points are changed and w r i t t e n into a separate result
The e x e c u t i o n t i m e fur an image w i d t h of w pixel columns is 71w + 7. w i d t h 128 pixels is t r e a t e d in 9095 clock cycles, A dramatic
For
for each neighbouring
time
i.e.
An image w i t h
1.82 ms.
gain would of course be achieved if a counter were included in each
processing e l e m e n t .
A counter that could be i n c r e m e n t e d and tested in one clock cycle
would save 56 percent of the processing t i m e . For the task at hand we can also manage very w e l l w i t h o u t a c t u a l l y counting the matches or mismatches,
instead of adding the m i s m a t c h i n d i c a t o r (1 or 0) to the c o u n t e r field for
each neighbour
checked,
we can just save the i n d i c a t o r
in an 8 - b i t
which can be analyzed a f t e r the whole neighbourhood is gone through. vector
contains
t w o or more zeroes,
analysis can be done in 24 clock cycles. the
above solution,
we
have reached
decreased by a l m o s t 35 percent.
the point
mismatch
vector,
If the mismatch
is not considered a noise point.
This
C o m p a r e d to the 48 cycles used for counting in a significant
improvement.
The
total
time
is
254
EXAMPLE 2; Border finding A point in a binary image is catted a border point if it has the value "1" and is adjacent to a point with the value "O'. (see Figure
Depending on whether we use 8-adjacency or 4-adjacency
9.5) we get stightiy different borders (Figure 9.6).
[•
X
(a)
(b)
Figure 9.5 a) the points 4-adjacent to x b) the points 8-adjacent to x.
(a)
(b)
(c)
Figure 9.6 a) object b) border using 4-adjacency c) border using 8-adjacency. A microprogram to mark border points is very straightforward. column-wise,
For each column the logical product of the neighbourhood (/4- or 8-) of
each pixel is formed, border point, only
The image is swept aver
If the product is zero and the pixet value is one the pixe[ is a
The logical product of a 3 times 3 pixels neighbourhood can be farmed in
four AND-operations,
two "horizontal" and two " v e r t i c a l ' .
Therefore,
the 8-
adjacency case gives only slightly longer execution time than the 4-adjacency case: 9w+6 and 8w+6,
respectively,
tn time this means 0.23 and 0.21 ms,
respectively,
for w=128.
EXAMPLE 3: Shrinking and expanding The
operations of
[P, osenfetd
shrinking
and Kak
expansions w i l l
dean
76].
and expanding in One
or
up "ragged"
a few
binary
shrinks
images have
many applications
followed by the same number of
borders and delete small
objects.
Shrinking
and
expending can also be used to obtain the skeleton of an object or to detect clusters of points.
255
Shrinking is the same as deleting border points. similar to the border finding program. 8-adjacency shrink Bw+6.
Thus,
The times are a l i t t l e shorter than for border finding becaus%
[n the logical expression for the new v a l u e of a point~ role as its neighbours.
the microprogram becomes very
The t i m e for 4-adjacency shrink is 7w+6 and for
the point itself plays the same
This is not the case for border finding.
The times for expansion
are the same as for shrinking.
4: Gap
EXAMPLE
fitting
As an example of
an operation with
f o l l o w i n g one [Iliffe82],
a larger neighbourhood than 3 x 3 we use the
useful for f i l l i n g in gaps in thin curves: let point "Z" have the
If some point in X and some point in Y have the value "1", value "1".
(See Figure 9.7).
x z Y
x
z
Y
x
I~
z
Y
I1
1i
Y
Figure 9.7 Mask configuration for gap filling.
A straightforward approach to solve this on LUCAS is to take the four cases one a f t e r the other, Start
OR-ing the results together:
by
clearing
a
scratch
pad
bit-slice
OR-sum of
X-field
-> R ->
OR-sum of
Y-field
-> R
SP.
For
each
case:
T
R Ab,D T -> R R OR
Then:
SP
->
SP
SP OR Z - > R e s u l t
image
The t i m e for the execution of the microprogram is 45w+6 cycles,
9.4.2.2
w=128 gives 1.15 ms.
Grey scale images
Many local operations on grey scale images include additions and subtractions of whole images.
Often,
the two images in such an operation are identical except for one of
256
them being shifted one step, to name two examples. operations.
This is the case in averaging and differentiating operations~
Therefore~
our first examples w i l l be these~
generally usefu[~
Later examples w i l l combine these to compound operations.
EXAMPLE 5: Addition/subtraction of images Addition
bitslice
and four
additional cycles for each pixel column (for test and reloading of bit counter).
(or
subtraction)
of
two
pictures
takes
three
cycles
per
Including
initial parameter loading this makes w(3b+4)+6 cycles~ where w is the image width in pixels and b is the number of bits per pixel~ specified through parameters.
both
(The additional /4. cycles per pixel column can be reduced
to one if the operation is made as a single w*b bits wide addition with markers in the mask register for pixet-slice limits). Addition of two 128 pixet wide images with 8 bit data takes 3590 clock cycles~
i.e.
0.72 ms.
EXAMPLE 6: Point by point maximum of images A point by point maximum operation on two images A and B replaces all pixels of A~ that are smaller than the corresponding pixets of B~
with the B-pixels.
The operation
proceeds in two phases" In the first phase a pixel slice of B is subtracted from the corresponding pixel slice of A without storing the result. moved to the Tag registers. the A-image slice.
The signs of the result is
In the second phase the B-slice is written tag-masked into
The time for subtraction is two cycles per bitstice (two reads).
time for move is also two cycles per bit-slice (one ready
one write).
The
The t o t a l time is
w(4b+10)+4 cycles. w=128 and b=8 gives 5380 cycles,
or 1.08 ms.
EXAMPLE 7: Thresholding The most common operation used for segmentation of grey scale images is thresholding. It produces a binary image that has ones in those coordinates of the original picture where the value exceeds a certain threshold. cycles are used per bit slice. memory word.
In the implementation on LUCAS~
two
This is because the threshold value is stored in each
An alternative is to store the threshold in the Common Register.
Since
257
the ALU has one input from Common and one from the memory word, could then be made faster. signals as the memory. Register.
However~
the comparison
the Common Register receives the same address
Hence the threshold must be stored repeatedly along the Common
A base register for
Common Register addressing would be a good thing to
include in the Control Unit. The i m p l e m e n t a t i o n of thresholding made on LUCAS takes w(2b+5)÷7
cycles.
w=128 and b=8 gives 2695 cycles,
or 0.54 ms.
E X A M P L E 8: Roberts" cross-difference operator Difference operators are widely used for the detection of edges. called
Roberts"
operator
[Roberts65]
has
the
following
One v a r i a t i o n of the so
form
(j
and
k
are
image
two
formations of
coordinates): R(j,k) = m a x ( t I ( j , k ) - l ( j + l , k + l ) I The microprogram absolutes,
for
,
tI(j,k+l)-I(j+l,k)l
this can be divided
and one maximum,
into
) two
subtractions,
Subtraction and maximum were treated above,
value f o r m a t i o n on a 128 x 128 8-bit image takes 2820 cycles,
or 0.56 ms.
Absolute Thus,
the
execution t i m e for Roberts" cross-difference operator is
2 subtractions:
2 x 0.72
ms
2 absolutes:
2 x 0.56
ms
1 maximum:
1 x 1.08
ms
3.64
ms
Total
execution
time:
An example showing the e f f e c t of the operator is given in Photo 9.4 under example 16 below.
EXAMPLE 9: The Laplacian operator The d e r i v a t i v e of an image in the x - d i r e c t i o n can be approximated by the expression
df/dx = f ( x + l , y ) - f(xw)
The second order derivative,
then~
becomes
258
d 2 f / d x 2 = [ f ( x + l , y ) - f(x,y)] - [f(x,y) - f ( x - l , y ) ] -= f ( x + l w ) + f ( x - l , y ) - 2f(x,y)
The Laplacian o p e r a t o r is defined as L(f) = d 2 f / d x 2 + d2f/dy 2 = = f ( x + l , y ) + f ( x - l , y ) + f ( x , y + l ) + f(x,y-1) - 4f(x,y)
The a p p l i c a t i o n of the Laplacian to a pixel whose four neighbours all have the same value as the pixel itself, but none has larger,
gives a zero result.
If some of the neighbours have smaller values
L(f) w i l t be negative.
of an edge in the image,
At
value than the c e n t e r pixel~
This is the case,
the o t h e r side of the edge r
but none has smaller.
for example,
at one side
some values have a larger
There the result wilt be positive.
The function
f(x,y) - L(f(x,y)) = 5f(x,y) - [ f ( x + l , y ) + f ( x - l , y ) + f ( x , y + l ) + f ( x , y - 1 ) ] will
take
on
the
value
f(x,y)
at
all
points
where
neighbourhood is the same as the c e n t e r pixei value. f(xw)
if f(x W) is smatter than that
f(x W) is larger
than
mean value r
the
mean
value
of
the
4-pixel
It wilt take on a value s m a l l e r than and it w i l l take on a larger value if
the mean value of the neighbourhood.
Thus,
a p p l i c a t i o n of this
f u n c t i o n has the e f f e c t of increasing the contrast in the image. The Laptacian o p e r a t o r applied in the way described above can be used for enhancement of blurred pictures and also for d e t e c t i o n of edges, "Laplacians"
can be defined by using d i f f e r e n t
lines and spots.
neighbourhoods~
Alternative digital
or by using a weighted
average over the neighbourhood. We w i l l consider the i m p l e m e n t a t i o n on L U C A S of t w o operations of this kind, L 8.
L4
is the
neighbourhood.
inverse
of
L
as defined
above.
L 8 uses all
pixel
values of
L 4 and L 8 are l i n e a r f i l t e r s t h a t can be described as in Figure 9.8.
L 4 and a 3x3
259
[
i -1
-1
-I
-t
-t
-1
8
-1
-1
-1
-1
L4
L8
Figure 9,8 The operators L 4 and L 8,
Photo 9,1 i|lustrates the e f f e c t of using L 4 on an image. original
image,
applying L 4.
reproduced with
only 4-bit
grey scale.
Negative values are put to zero,
image gives the result shown lower left. original image.
The upper l e f t image shows the Upper right
is the result of
Addition of the result to the original
The contrast has increased compared to the
Subtraction instead of addition gives the result shown lower right.
Here
the edges have been blurred.
Photo 9.1 Illustration of the e f f e c t
of the Laplacian operator L 4.
Upper right: L 4 applied.
Upper heft: original,
L o w e r left: L 4 added to original.
subtracted f r o m the original.
Lower right: L 4
Negative values have been set to zero.
4-bit
grey scale is used. Computation of L 4 is done by first subtracting
the
values
of
the
multiplying all pixet values by the f a c t o r 4~
four
M u l t i p l i c a t i o n by 4 takes no t i m e at all, bits of the pixel are fetched, can reduce this
to
the time
Thus,
neighbouring
pixels,
one
after
the
then other.
since it only means changing the address when
the t o t a l t i m e is the t i m e of four subtractions.
of two additions and one subtraction with
We
the following
260
method= First,
the sum of each pixet and its upper right neighbour is formed and stored
in a temporary area.
Then,
each pixe] in this area is added to its upper left neighbour.
The result obtained is finally subtracted from the original image value immediately to the right multiplied by 4. i f the full dynamics of the b-bit fields is to be used~ extension of the field length to b+2 bits must be made.
Thus,
the time to compute L 4 is the following (time for
addition and subtraction was given in example 5 above): T(L 4) = 5*[w(5(b+2)+4)+6] = w(9b+50)+18 cycles. w=128 and b=8 gives T(L 4) = 1507/4 clock cycles, To compute L8,
2.61 ms.
the sum of all nine elements of the neighbourhood is first computed,
which can be done in just four additions. element,
i.e.
First,
the neighbour above is added to each
then the neighbour below in the original image is added to this sum.
new image obtained in this way the process is then repeated, neighbours,
respectively,
To give L8,
[n the
using the left and right
instead of those above and below.
the sum of all nine neighbourhood pixels is to be subtracted from the value
of the center pixet multiplied by 9.
Multiplication by 9 is performed as an addition of
the pixe] value with itself shifted left three positions.
The total sum for the computation
of L 8 becomes the sum of: a) Vertical addition: 1 addition using b bits, b)
Horizontal
addition;
e) Multiplication by 9 : 1
1
addition
1 using b+l bits using
b+2
bits,
addition using b+5 bits
d) Subtraction: 1 subtraction using b+5 bits The computation times are: a) w(6b+11)+12 b) w(6b+25)+12 c) w(Sb+15)+12 d) w(Sb+15)+12 The total time is T(L 8) = w(18b+60)+48 cycles. w=128 and b=8 gives T(L 8) = 26160 clock cycles,
i.e.
5.25 ms.
one
using
b+5
bits
261
E X A M P L E 10: Mean vahJe f i l t e r i n g in some pictures,
replacing the value of each paint with the average pixel value in some
neighbourhood of the point (including noise.
the point itself)
may be a useful way to reduce
This is called local averaging or mean value f i l t e r i n g .
Mean value f i l t e r i n g division
by
the
neighbourhoods
means addition of
size of
of
size
the 5x5
all
pixels
neighbourhood. and
5x5.
in the neighbourhood followed by a
We
consider
Division
by
9
mean value f i l t e r i n g
can
be
approximated
over by
a
m u l t i p l i c a t i o n by 7/64 with an error of about 1,5% only.
(Anyhow,
length,
m u l t i p l i c a t i o n by 5/I28 is an
division by 9 cannot be done exactly).
Similarly,
approximation of division by 25 with an error of 2.3%. can
be
implemented
subtraction
of
as a m u l t i p l i c a t i o n
the original value.
by
Division
8 (which
with l i m i t e d data
M u l t i p l i c a t i o n of a value by 7 takes no time)
by 64 takes no time,
division by 9 can in fact be done as a single subtraction.
followed
which
Similarly,
by
a
means that
division by 5 w i l l
be a single addition. The sum of all pixeis in a 5x3 neighbourhood is obtained by four additions, in example 10 above.
The t i m e for this is w(12b+54)+24.
which is realized as a single subtraction, is postponed t i l l a f t e r the division.
as described
The subsequent division by 9,
takes w(3b+13)+6 cycles if truncation to b bits
In t o t a l ,
the t i m e to compute the mean value of
each 3x3 neighbourhood in an image is T(M 9) = w(15b+47)+30 cycles. w=128 and b=8 gives T(M 9) = 21406 clock cycles, In the case of a 5x5 neighbourhood, obtained in six
additions - three
i.e.
4.28 ms.
the sum of all elements in the neighbourhood can be vertical
and three horizontal.
The three horizontal
additions can be reduced to two using the fact that the image is swept over columnwise from
right
to l e f t
during computation.
pixel can be computed f r o m
The sum over the neighbourhood of a specific
the one obtained for a pixel in the preceding column by
subtracting the rightmost contribution and adding a new contribution from the left. The t i m e
to
compute
w(15b+62)+30 cycles. bits data.
the sum of
the
neighbourhood for
each pixel of
the
image is
The final m u l t i p l i c a t i o n by 5 is done as a single addition on b+5
This takes w(3b+20)+6 cycles.
Thus,
the t o t a l t i m e required to compute the
average over 5x5 neighbourhoods is T(M25) = w(18b+82)+36 cycles. w=128 and b=8 gives T(M25) = 28954 c[ock cycles,
i.e 5.79 ms.
The t i m e to do averaging over a 5x5 neighbourhaod is only 35% longer than the t i m e required for a 3x3 neighbourhood.
262
E X A M P L E 11: Median f i l t e r i n g For suppression of noise in images~
the use of non-linear filters~
is considered to have many advantages over linear filters~
like the median fiiter~
e.g.
taking the average.
Perhaps the most important is the a b i l i t y to preserve sharp edges [Justusson80]. Median f i l t e r i n g
means replacing
a pixeI
Danielsson has devised an algorithm [Danielsson81].
that
value by
the
median of the
neighbourhood.
utilizes bit-serial scanning of the arguments
The algorithm has been implemented on LUCAS for a 3*3 neighbourhood.
It starts by analyzing the set of mast significant bits of the neighbourhood points. there are more zeroes than ones in this set, has a zero as its most significant bit.
If
i t can be concluded that the median value
It proceeds with the following bits~
successively
refining the hypothesis. When
traversing the neighbourhood~
conditions have the e f f e c t
scanning a bit
slice of
the arguments~
that a counter of each point is incremented~
other conditions have the e f f e c t that the counter is decremented. the counter is done bit-serially on LUCAS~ execution t i m e - around 70 percent.
it
certain
while certain
Since this operation on
takes a considerable part of the t o t a l
The execution time for a w-column picture with
b-bit pixel values is
w(154+324b) c y c l e s . Our example w=128,
b=8 yields 351,488 cycles,
i.e 70 ms.
If the processing elements
were provided with counters that could be incremented or decremented in one cycl%
the
t i m e would decrease to around 20 ms.
9.4.3 Genuinely local operations.
Larger neighbourhood sizes
The communication network on LUCAS permits a processing element to access data from words one or two
steps up or down.
distance from the PK~ the distance is,
it must be t e m p o r a r i l y loaded in words in between.
The larger
the larger number of temporary storage steps are needed,
distance is very large - 20 or more, perfect
When data is needed from words at a larger
When the
a p p r o x i m a t e l y - it may be favourable to use the
shuffle/exchange network to route data to the desired PEs.
Shift of data an
a r b i t r a r y number of steps up or down can be made in log2N passes through the network, where N is the number of PEs [Lawrie75],
Such long distances hardly occur in local
operations. In
this
section
we
will
give
one
example
neighbourhood than is directly accessible.
of
a local
operation that
uses
a larger
The example given concerns f i l t e r i n g of a
grey-scale imagel according to [Kruse77] the need for larger neighbourhoods is stronger on
263
grey-scale
images
than
on
binary
images.
The
calculations
involve
multiplications,
therefore the computation t i m e w i l l dominate strongly over the t i m e to fetch data to the correct PEs.
E X A M P L E 12: Linear f i l t e r i n g The mean value f i l t e r described in example 10 above is an example of a convolution of the image m a t r i x with values equal to 1.
a smaller m a t r i x ,
in that
case a 5x5 or 5x5 m a t r i x with all
The convolution was followed by a division by the total weight of the
convolution m a t r i x in order to keep the overall grey-scale level in the image unchanged. Linear f i l t e r s are often specified as convolution matrices of larger size than this.
Cross
c o r r e l a t i n g the image w i t h small template images also involves the same computations. As our example we w i l l take a linear f i l t e r specified by a convolution m a t r i x of size 9x9. b bit data are used,
both for image pixel values and f i l t e r constants.
The convolution is computed as iterations over (a) the pixel-cotumns of the image and (b) the 81 values of the convolution m a t r i x . loop variabl%
Depending on which one is chosen as the outer
two d i f f e r e n t computation strategies are obtained=
(t): For each of the 81 values of the convolution m a t r i x , entire image by the value.
to the result of another point, corresponds to matrix.
do the following= M u l t i p l y the
The product obtained in a specific point,
p,
is to contribute
located at a certain distance from p.
where the currently
used f i l t e r
constant
This distance
is located in the
convolution
In 56 of the 81 cases (see Figure 9.9) the transfer of the product must be done
in two steps because of the l i m i t a t i o n s of the interconnection network. overhead from
this
can
be
reduced
if
destination already when they are formed.
the
products
In fact,
are
stored
closer
However, to
the
the final
this reduces the overhead to one
clock cycle for each bit to be transferred. (2): For each of the pixel calumns: M u l t i p l y the column and its 8 neighbouring columns (4 on each side) by 9 values each.
(In t o t a l 81 multiplications).
transfer the result to the PE that needs it and accumulate it. l i m i t a t i o n s in interconnections is the same as above.
A f t e r each multiplication, The overhead due to the
264
Figure 9.9 The
area
whose
position.
The
values shaded
are
needed
area
shows
Measuring only pure computation time,
to
compute
which
the
values
result
are
at
the
directly
the two approaches are equivalent.
center
accessible.
An advantage
of method (2) is that c o m p u t a t i o n can s t a r t as soon as a few columns of the image are input,
and output
can start
as soon as the result of one column is obtained.
c o m p u t a t i o n and i n p u t / o u t p u t can be overlapped. (1),
In each of the 81 iterations, the
In the following,
but the t i m i n g for the other method wiI1 be the same,
scalar
is
multiplication
in is,
are accumulated
canonical
the entire
signed
on the average,
digit
(see
Section
3.4.2).
equal to b 2 + 4b - 3 cycles.
using an increasing number of bits,
takes 3(2b+5)+4 = 6b+19 cycles.
Thus,
we w i l l use method
disregarding overhead.
image is m u l t i p l i e d by a scalar. code
Thus,
The
Preferrably, time
for
the
The products obtained
on the average 2b+5 bits.
This
the t r e a t m e n t of a pixel slice takes
b 2 + 10b + 16 cycles per i t e r a t i o n . This is done for all w columns in each of 81 iterations.
The t o t a l time,
then,
becomes
81w(b2+Igb+16) cycles. With the values w=128 and b=8 this makes 1,658,880 cycles. the
extra
time
required
to
pass data
between
l i m i t a t i o n s of the interconnection network.
the
memory
To this value we should add and the
the
This is the case in 36 of the 81 iterations,
and the e x t r a t i m e is one clock cycle per b i t to be transferred. 36 x w x 2b cycles,
PEs due to
which for w=128 and b=8 is 73728 cycles.
This makes a t o t a l of This is small compared
to the c o m p u t a t i o n time. The t o t a l execution time for an 8-bit 9x9 linear f i l t e r on a 128x128 image of 8-bit data amounts
to
1,732,608
cycles,
on
the
average,
i.e
0.42
seconds.
There
is
some
265
uncertainty
this
value,
since
this
algorithm
has
not
been programmed on LUCAS.
9.4.4- Semi-local operations Operations consisting of
repeated applications of
local
operations until
some specific
criterion is reached form an important class of image processing algorithms.
Although
such operations are made up of local operations they can not be called genuinely local, since the result at some point in the image may depend on pixel values at a large distance from
the point.
We use the term "semi-local operations" to describe these
operations. Semi-local operations can be used for such tasks as counting the number of objects in an image,
labeling objects or following curves and borders.
We will take four examples,
all on binary images.
EXAMPLE 13: Connectivity preserving shrinking to a point A method for counting the connected components (the objects) of a binary image is to shrink every object to a point and then count the number of " l " s in the image. We assume that the 4-adjacency relationship is used to define connectedness. assume that the components are without holes, here w i l l not shrink them to a point.
We further
otherwise the algorithm that we present
(There are algorithms [Rao et al.76] that also
shrink objects with holes to single points). The shrinking process is not allowed to disconnect any object, shrinking operator used in Example 3 cannot be used, objects and thereby disconnect them. found in [Danielsson82],
w i l l be used.
Instead,
the simple
It would delete thin parts of the
the operators shown in Figure 9.10,
The operators change the center pixel (underlined)
from 1 to 0 if the neighbourhoad is as specified, w i l l shrink objects to single points.
Therefore,
Repeated application of the operators
266
1 0 1 I I
0 011 0
0 011 11
11 011 0
A
B
C
D
1 110 1
0 110 0
11 110 0
0 110 11
F
G
H
I
0 01 1
E
1 01 0 J
Figure 9.t0 Connectivity preserving shrinking operators.
As usual,
the image will be swept over columnwise.
column before stepping to the next one. "recursive" operating mode, of an operator.
Thus,
may have changed.
All the operators are applied to a
We wilt use what is often called "sequential" or
meaning that the very input image is changed as the result
when applying the next operator on the same slice,
the slice
This also motivates repeated application of the same operator on a
slice before taking the next operator. The algorithm that we use works as follows:
Scan the image from l e f t to right. For each pixel column: 1) Apply operator A. 2) Apply operators B,
C and D in sequence.
3) Apply operator E,
repeat until no more changes occur.
/4) Apply operator 3,
repeat until no more changes occur.
5) Repeat steps 2,3 and 4 until no more changes occur. Then,
scan the image from right to left.
For each pixel column: 6) Apply operator F. 7) Apply operators Q,
H and i in sequence.
8) Apply operator J,
repeat until no mere changes occur,
9) Apply operator E,
repeat until no more changes occur.
10) Repeat steps 7~ 8 and 9 until no more changes occur. Repeat the scanning procedures until a whole scan is made without any changes.
267
pixel column
Before a new performed. time.
is treated a test to see if there are any
If there are no " l " s in the column,
Also,
F
or G,
a column may
B,
Therefore,
the test is performed.
after each of these operators,
The order between the individual operators is crucial.
The operators A and F have the
p o t e n t i a l to d e l e t e a whole string of "1"s in one application. first on each column.
at all is
applying the operators is a waste of
with the application of the operators A~
become blank.
once more,
"l"s
Once applied on a column,
Therefore,
they are used
t h e r e is no sense in applying t h e m
since no result from the other o p e r a t o r s can make t h e m applicable on new
pixels. The execution time for the procedure strongly depends on the characteristics of objects in the image.
On some images,
all pixels but one per object.
the
one pass over the image is sufficient to delete
Normally,
however,
two passes or more are needed.
Objects formed like spirals are the most d i f f i c u l t to shrink and require many passes. An example where three passes are needed is shown in Figure 9.11. first pass are marked by a "1",
Pixels deleted in the
Those deleted in the second pass by a "2",
fourth pass no deletions are made.
etc.
In the
When this is discovered the procedure ends.
jl 1 I I I I
I
I I
I I
I I
1
~1
1ll 1
3
t
2
1
2
1
1
t
I I I
I 2 2 I 2 ~ I 2
I
I
2
1
1
3
1
1
1
I I
1
1
I
I
I
3
3
2
~
1111 1111 1111 2222
3
3 I
3@
I
3 I
I
I
I
13
3
3
3
3 3 2 2 ~
3
3
2
2
1
1
2
lll
t3
11]
2 2 2 1
I
I
I
I
I
I
t
1 1 1 1
-1
1
1111j 1111
2 3
Figure 9.11 An object that is shrunk to a point in three passes.
Photos 9.2 (a) - (c) show the shrinking of 13 objects in a t28 x 128 binary picture.
(a)
shows the original image,
(b) shows the image after one sweep from l e f t to right and (c)
shows the final 13 points.
The total execution time is 10 ms.
268
e
t
"
1t ""{Jtjl,IItll
¢,,
(a)
.....................
(b)
(c)
Photo %2 C o n n e c t i v i t y
preserving shrinking to points.
sweep f r o m l e f t to right~
EXAMPLE A
(a) original imag%
(b) a f t e r
one
(c) final result
1/4: Finding the o u t e r p e r i m e t e r s of objects
method
for
finding
the outer
perimeter
of
each connected component
(object)
in a
binary image is to propagate a m a r k e r f r o m a point at the edge of the picture over the image area,
until they reach an object.
When this procedure is completed,
those pixels
of the objects t h a t have markers as neighbours are m a r k e d as o u t e r p e r i m e t e r pixels. As a m a t t e r
of fact,
the procedure
can equally well be used for finding the hoies or
269
inner contours of objects.
Hole points are those "background" pixels t h a t have not been
marked~
and inner contours are those o b j e c t pixeis t h a t have a hole point as a neighbour.
Different
strategies
fast as possible.
can be used in order to spread the m a r k e r over the background as
Figure 9.12 i l l u s t r a t e s t w o approaches.
used to define eonnectedness of objects. and f o r t h .
For each column,
has been marked. not l e f t by
The
particular pixels
a background pixel is marked if it has any 8-neighbour that
In (a) a new column is taken all the time~
u n t i l no more pixels can be marked.
a *.
numbers
at
the
pixel is marked,
are
We assume t h a t 4 - a d j a c e n c y is
In both strategies the image is scanned back
marked
after
image
points
whereas in (b) a column is
The s t a r t i n g point at the edge is marked show
in which
step
of
the
In (a) the last pixel is marked in step no. 15 steps.
Furthermore,
shortened: when a c t i n g on a c e r t a i n bit-slice, be considered only in the f i r s t b e l o w and above can a f f e c t
step.
of
the
steps
32,
a
in (b) all
in (b) can
be
the h o r i z o n t a l and diagonal neighbours need
During
the result.
many
procedure
the f o l l o w i n g steps,
Thus,
only the neighbours
it seems t h a t the s t r a t e g y t h a t spreads
the m a r k e r v e r t i c a l l y to a m a x i m a l degree before continuing in the h o r i z o n t a l d i r e c t i o n is the best one.
3231192021222324 32 18 19 20 21 22~!~'7/~24
5 6 7 8 9 10 13 14 4 6 7 8 9 10 ~--~14
17 ~ 2 2 ~ 2 4 2
II 6 ~__~////~13
8
2 3 F"//////A ,
2
3 ~5
7
6
16 2 3 4 5 6
14 14
6 7 ~/////~_~13 14 6 7 ~ 9 10 13 14 2 6 7 8 9 10 13 14
8
7 8
(a)
(b)
Figure 9.12 D i f f e r e n t strategies for propagation of m a r k e r .
It can be noted t h a t propagations like this are very e f f i c i e n t l y processor.
The
reason is t h a t
entirely combinatorial. On
LUCAS,
vertical. Strictly
we
Thus,
could
CLIP4
is equipped w i t h
p e r f o r m e d on the CLIP4
a propagation f u n c t i o n
that
is
the e n t i r e propagation is achieved by a single instruction.
imagine
a similar
funetion~
working
only in one dimension
-
I t would be easy to i m p l e m e n t . synchronous t w o d i m e n s i o n a i arrays (as to our knowledge,
one pixel per PE wilt
of course p e r f o r m
w e l l on this operation,
t i m e s b e t t e r than a linear a r r a y as could be expected, twodimensional compared
to
array 15 steps
of
64
processors
needed by
MPP is such) t h a t store
would
a linear
need
array
h o w e v e r not as many
On the e x a m p l e of Figure 9,12 a 9
steps
to
of 8 processors,
reach
the
last
pixel,
Each step requires
270
looking
at
all eight
neighbours.
Larger
examples t h a t
tendency hoJds - the increase in p e r f o r m a n c e
we have studied show t h a t
this
falls far below the increase in amount of
hardware. A microprogram
on L U C A S for finding the o u t e r p e r i m e t e r s of objects [Svensson85a] was
applied to the image in Photo 9.5(left). t o o k 1.4 ms.
The processing of the 128 x 128 image shown
in C h a p t e r 6 the a l g o r i t h m is specified in the n o t a t i o n of Pascal/L.
Photo 9,5 Finding the outer p e r i m e t e r of objects
EXAMPLE
15: Component labeling
Component different
labeling
in
binary
pictures
points of C to have the same valu% best
method
for
doing
preserving shrinking shrinking process, new
label
position.
is
the
process
components of the imagel in o t h e r words,
in the
assigning
different
on L U C A S
is probably
to
start
off
with
image
-
the
to The
a connectivity
as suggested in [Danielsson and Ericsson 82]. result
labels
we want all
and no point not in C to have t h a t value.
the image is scanned once more.
is stored Now,
this
operation,
of
for any component C,
After
For each pixel w i t h the value " I "
labeled
image -
at
the
the a
corresponding
the labels of the points are propagated to at! points belonging to the
same object in the original image. in the previous example.
However,
This is a process s i m i l a r to the propagation described in this case not only a m a r k e r b}t is propagated,
but also - in a d i f f e r e n t m e m o r y area - a m u l t i - b i t ]abe[.
271
As with the previous examptes~ objects in the image. guess
is
that
connectivity
the
the processing time strongly depends on the shape of the
The operation has not been programmed on LUCAS. time
preserving
for
propagating
shrinking~
labels
is
assuming
approximately at
most
the
64
A qualified same
as for
different
labels,
EXAMPLE 16: Tracking For the detection of edges in an image some kind of gradient operator (e.g. cross difference operator~ of preprocessing~ of noise).
e.g.
described in example 8) is often applied.
median filtering~
some kind
is first done in order to suppress the influence
The derived picture is then typically thresholded at some appropriate lavel.
too high level w i l l lead to some edge points being missed~ give many "false" edges. "tracking".
(Possibty~
Roberts"
A
while a too low threshold wilt
A method that can be used to remove these drawbacks is
We then start with the "safe" edge points obtained by thresholding with a
high threshold value (see image A in Figure 9.13).
Then we propagate these points along
connected edges in a picture that has been obtained by thresholding at a lower level (image B) and obtain an image of true edges (image C).
I
I
l I
1
1
1
1
1
1
1 1
I
I
I
I 1
1 I
I
1
I
A
I 1 1
B
C
Figure 9.13 Result of thresholding at high level (A) and at low level (B).
Result of
tracking the " l " s of A in B is shown in C,
The technique used for propagation is exactly the same as in Example 15. shown in Photo 9.4 (a) ...
(e).
(a) shows the original imag%
(b) shows the result of
Roberts" gradient operator (see Exampie 8) applied to the imag% result
of
thresho]ding (b) using two
different
levels.
(e) finaJty~
obtained when the points in (c) are tracked along the edges of (d). to reach the different results are
An example is
(c) and (d) show the shows the result The processing times
272
b)
3 . 6 4 ms ( R o b e r t s ' )
c)
0 . 5 4 ms ( T h r e s h o l d )
d) 0 . 5 4 ms ( T h r e s h o l d ) e)
0 . 7 7 ms ( T r a c k i n g )
The image size is 128 x 128 8-bit pixels.
Hawever~
in the illustration in Photo 9./4 (a)
and (b) only 4-bit grey scale is used.
(a
(b)
(~)
(d) (e)
Photo 9.4 (a) original~ (b) result of Roberts" cross difference operator applied to (a)~ (c) result of threshotding (b) at level 10, (d) result of threshotding (b) at level /4, (e) result of tracking the points in (c) along the points in (d).
273
9.4.5 Measurements
The operations that we have looked at so far have all been of the kind t h a t transforms images to images.
Often,
the number of objects,
we instead want to measure things in the image,
determine the area of the p e r i m e t e r of an object,
e.g. etc.
count This is
also known as feature e x t r a c t i o n . Looking
closer
number
of
at
"l"s
measurement
such in
that
is
measurements,
a
binary
one finds
imeg%
sometimes
e.g.
useful
for
pattern
shrinking operator to a binary picture repeatedly. remaining
"l"s.
that
many rely upon a count of the
those
mentioned
analysis
After
As a m a t t e r of fact,
Another
following:
Apply
elongated or not~
etc.
L e t t be the number of shrinking steps required in
This means that the width of the object is 2t.
the object has a quadratic form,
the area is 4t 2.
measure
image.
the
Then
area of we
the
shrink
shrinking steps required.
object,
the
These t w o
object
Now,
L e t A be the true area of the
The q u o t i e n t A / ( 4 t 2) is then a measure of the elongatedness of the object.
we f i r s t
a
count the number of
a q u a n t i t a t i v e measure of the "elongatedness" of an object can be
order to erase the object t o t a l l y .
binary
do.
The successive counts w i i l f o r m a " f e a t u r e v e c t o r " t h a t w i l l have quite
obtained through a study of shrinking.
object.
is the
each step,
d i f f e r e n t c h a r a c t e r i s t i c s if the objects are smell or large,
if
above
i.e.
we count the number of
until
it
vanishes and count
"l"s
Thus, in the
the number
of
measures can then be used to get a value of the
elongatedness of the object. Counting the number of this section.
" l " s in a binary picture is one example that we w i l l consider in
Another is finding the m a x i m u m p i x e l value of an image,
coordinates of t h a t pixel.
t o g e t h e r with the
The third example that we w i l l t r e a t is histogram collection.
E X A M P L E 17: Counting the number of ones We w i l l discuss two methods for counting the number of second method assumes additional hardwar%
"1 "s in a binary picture.
The
net i m p l e m e n t e d on LUCAS.
M e t h o d 1.
The f i r s t step in this method is the summing of each row of the image separately. is of
course
done in parallel
for
all
rows (words).
The fastest
way
This
is the f o l l o w i n g
(assume a t28x128 binary picture)= First,
sum pixels pairwise so that 64 sums~ each with
a value between 0 and 2,
Then sum these sums pairwise,
with
values between 0 and 4,
circumstance many,
are formed.
that
the
initial
etc. additions
giving 32 sums
The reason for this method being e f f i c i e n t use very
few
bits,
although
the
is the
additions are
and t h a t the longer additions towards the end of the procedure are very few.
In
274
total,
a l i t t l e less than one thousand clock cycles are needed to sum ever the rows.
Now,
the
row
sums can be added f a s t l y
over the p e r f e c t
Seven addition steps are required to add at[ 128 row sums. f r o m 8 to 14 during the process, in t o t a l ,
then,
128x128
picture.
shuffle/exchange
The number of bits increases
which requires a p p r o x i m a t e l y 300 clock cycles.
t300 clock cycles are needed to count the number of " l " s With
a
network.
cycle
time
of
200
ns,
this
takes
260
in a binary
microseconds.
Method 2. ][f L U C A S
is equipped w i t h special purpose hardware to count the number of responders
(number of Tag registers w i t h value one),
the number of " l " s in a picture can of course
be obtained faster. An adder t r e e according to Figure 9.14 can serve this purpose. and adder c i r c u i t s
the summing t i m e is 160 ns,
Using standard PROMs
thus s m a l l e r than the clock c y c l e t i m e .
The values of consecutive counts are a c c u m u l a t e d in the "Count A c c u m u l a t o r " ,
a register
t h a t can be read f r o m the M a s t e r Processor. The t i m e
needed to count the number of
"l"s
in a binary image is then equal to the
t i m e needed to put the bit-slices in the Tag f l i p - f l o p s . one slice cycles,
per cycle.
Thus,
i.e 25.6 microseconds
m e t h o d 1.
the
total
count
using a 5 MHz
time clock.
This can be done at the speed of for
a 128x128 image becomes
128
This is ten t i m e s f a s t e r than by
275
from Tags
4
/
A
f Fore
Tags
4
.... i
A
5
7
/
A
6
8 /Z
7
...../
IC6unt Accumulator I
Figure 9.1/4 Part
of
P=PROM.
an
adder
tree
to
count
the
number
of
responders.
The t o t a l number of PROMs required is 16,
A=Adder,
the t o t a l number of
4 - b i t adders is 34.
E X A M P L E 18: M a x i m u m value of image An a l g o r i t h m to [oeate the maximum-vatued element in a m a t r i x was described in Section 3.3.4.
I t starts o f f by finding the m a x i m u m etement of the f i r s t column,
the n e x t cotumn to see if there are targer elements. taken as a new candidat%
etc.
If there are,
then examines
the ]argest one is
A b i t - s l i c e in the associative array and a register in the
Address Processor are constantly updated to keep t r a c k of where the m a x i m u m value so far can be found. The c o m p u t a t i o n t i m e is data dependent.
One search for "|arger than Common" is needed
276
for
each pixel column.
columns.
Also~
w o r s t case r pixefs~ ms~
A search for
maximum
value of a column
is needed for some
data has to be moved f r o m the a r r a y to the C o m m o n Register.
all of this is needed for all pixet columns.
In the
Assuming 128 columns of 8 - b i t
the w o r s t case takes 128(12 + 29 + 16) = 7296 cycles,
i.e.
a p p r o x i m a t e l y 1.5
using a 5 MHz clock.
EXAMPLE
19: Grey level histogram
C o l l e c t i n g the histogram of a grey l e v e l p i c t u r e means counting the number of occurences of each of the possible grey levels. of the grey levels,
A straightforward
search the e n t i r e p i c t u r e
those points where the specific grey l e v e l occurs. binary picture.
m e t h o d is the folIowing: For each
and produce a binary picture w i t h
the search w i l l take
a p p r o x i m a t e l y 1500 cycles and the count 1300 cycles (see Example 17,
800 cycles~
i.e.
Thus~
in
Then count the number of " l " s in the
Assuming a 128 by 128 image w i t h 256 grey levels,
3800 cycles per grey level.
"l"s
method 1),
i.e.
the t o t a l h i s t o g r a m is collected in 256 x 3800 = 972
195 ms using a 5 MHz clock.
This is quite a long t i m e r
in f a c t the
M a s t e r m i c r o c o m p u t e r could gather the h i s t o g r a m in a t i m e t h a t is close to this. There are ways to shorten the t i m e . in Example faster
17 w i l l
using the count responders n e t w o r k described 155 ms.
Second~
the search can be made
pad area for i n t e r m e d i a t e search
The f o l l o w i n g is one possibility: D i v i d e the grey values into four classes based on
the t w o most s i g n i f i c a n t bits. these classes.
C r e a t e binary maps showing which plxels belong to each of
T w e l v e cycles per p i x e l slice are needed to c r e a t e these maps~
cycles in t o t a l .
having a c e r t a i n
i.e.
1536
A s i m i l a r division of grey values is made based on the next t w o bits~
This gives 16 maps in t o t a l r
c r e a t e d in /4 x 1536 = 6144 cycles.
grey value can be obtained
This takes four cycles per bit-slice r obtained in the Tags r once.
to
at the cost of having to reserve some scratch
results.
etc.
First,
decrease the t i m e
i.e.
through
logical
AND
Now,
the points
between four
512 in t o t a l for each grey value.
maps.
The result is
and the number of ones can be calculated in the adder tree at
Thus the t o t a l t i m e for histogram c o l l e c t i o n using this method w i l l be 6144 + 256
x 512 = 157 216 cycles~
i.e.
27 ms.
H i s t o g r a m c o l l e c t i o n is not one of those tasks t h a t an a r r a y of this kind p e r f o r m s best. With
increased capabilities of the processing elements,
t h a t allow t h e m to p e r f o r m one
h i s t o g r a m c o l l e c t i o n each on the pixels stored in t h e i r respective memories~
it is possible
to do w e l l
We can also
also on this task,
as shown in [E)anielsson and Ericsson 82].
choose the possibility to compute the histogram outside the array. "listening"
A f a i r l y simple device
to the input or output s t r e a m of pixels can be designed for this task.
p i x e l value t h a t passes the device is used as an address p o i n t e r to a memory~ corresponding m e m o r y word is i n c r e m e n t e d by one. is one 8 - b i t
pixe[ every 200 nanoseconds.
r a t e is r e a l i s t i c ,
and would c o l l e c t
Each and the
The m a x i m u m [/O r a t e w i t h L U C A S
A histogram
collection
device f o l l o w i n g
this
a histogram for an 128 x 128 x 8 image in 3.3 ms.
277
9.4.6 Global transforms There
are many two-dimensional global
transforms
that
are used in image processing,
p r i m a r i l y for the purpose of image enhancement and restoration and image encoding, this study we w i l l restrict
in
ourselves to a brief discussion of how the two-dimensional
Fourier transform can be calculated on LUCAS and the implications of this for the WatshHadamard transform.
E X A M P L E 20: Two-dimensional FFT In Section 7.3 we studied the i m p l e m e n t a t i o n of transform using the FFT algorithm. imag%
I,
produce the final result,
within
simultaneously.
one
Element.
Assuming a 128 x 128 image,
the same as for the column transforms.
During
the
transform each
then transform the columns of G to
transforming the rows means making an entire FFT
Processing
row
transforms,
addressing" w i t h i n the Memory Module.
128
such
computations
are
done
the t i m e for the row transforms w i l l be
This is because the same number of a r i t h m e t i c
operations are performed in the two cases, accessed.
G,
F.
an image stored in LUCAS,
calculation
one-dimensional discrete Fourier
can be obtained in the following way [Nussbaumer81]: First~
row of l to produce an intermediate m a t r i x ,
With
the
The two-dimensional discrete Fourier transform of an
the only difference being the way data is data
is
accessed
by
In the column transforms,
means
of
"butterfly
the shuffle/exchange
network a u t o m a t i c a l l y provides the correct data. The result m a t r i x is obtained with its rows bit-reversed.
When the image is output this
is corrected through the use of a bit-reversed address buffer,
as described in Section 7.3.
The t o t a l t i m e for a Fourier transform of a 128 x 128 x 8 bit picture w i l l be 256 times the required t i m e for a 128 point one-dimensional FFT.
Since this was 1.1 ms,
the t o t a l
t i m e will be around 300 ms. Another transform frequently used in image processing is the Walsh-Hadamard transform (WHT).
The principle for the computation of the FFT can be applied also to the WriT
[Gonzalez and Wintz 77].
The difference is that the t r i g o n o m e t r i c functions are reduced
to plus one and minus one. percent. in 30 ms,
Thus,
This reduces the computation t i m e with a p p r o x i m a t e l y 90
a two-dimensionai WriT could be performed on a 128 x 128 x 8 picture
278
9.4.7 Input/output Finally,
we want to investigate how tong t i m e ]s needed for input/output of images.
The [/O rate of the Processor Array itself is very high. array.
The bottleneck is outside the
Data is transferred between the I/O data registers and the Memory Modules at a
rate of 128 bits per clock cycie~
i.e.
128 x 5 x 106 = 640 Mbits/second.
However,
data can not be w r i t t e n into or read from the I/O data registers at that speed.
This is
what puts the l i m i t on 1/O data speed: how fast can the rest of the system communicate w i t h the I/O data registers? When no special purpose t/O processor is used,
the fastest way for the Master Processor
to communicate with the I/O data registers is through the use of the system's D M A unit. The t o t a l time required for input/output of a 128x128 m a t r i x of 8-bit data by this method is 19.9 ms.
A binary image requires 1/8 of this time,
i.e.
2.5 ms.
The I/O processor is capable of w r i t i n g or reading an I/O register with m a x i m a l speed, i.e.
5 MHz,
and can thus f i l l the 128 I/O registers in 25.6 microseconds.
The time to
transfer the contents of the I/O registers to the Memory Array is 2.2 microseconds. t i m e required for
The
input or output of a 128 x 128 x 8 image is then 128(25.6 + 2.2)
microseconds = 3.6 ms.
A binary image requires 0.45 ms.
One further comment on the input/output t i m e should be made: Filling (or reading) the I/O data registers from the Master Processor or I/O processor can be done at the same time
as computations take place in the array.
bound,
Thus,
for tasks that are computation
the e f f e c t i v e input/output t i m e is in fact 2,2 microseconds per 8-bit slice,
i.e.
282 microseconds for a whole 128 x 128 x 8 bit image.
9.4.8 Larger images Throughout Section 9.4 we have assumed that the size of the image side agrees with the number of Processing Elements,
so that one line of the image exactly occupies a field of
the memory. When the number of pixels per line in the image is greater than the number of PEs we propose that each PE takes care of more than one column of the image. a 512 x 512 pixels image is stored with four columns per Memory Module. neighbouring columns because this view. stored.
We propose
is advantageous f r o m neighbourhood access point of
(Larger accessible neighbourhood).
2048 pixels.
For example,
Each memory module would receive 512 x /4 =
Since the MMs are only 4096 bits wide,
only two bits per pixel can be
This means that LUCAS is not large enough to hold larger pictures than that.
To equip this kind of machine with larger memories is one of the easiest things to do and we feel it is highly recommendable if the machine is to be used for image processing.
279
We disregard the memory length problem for a while and concentrate on how the pixels should be individually ordered within the MM. image to be stored in a 4 PE machine, stored in each Memory Module.
As an example we take a 16 x 16 pixels
Figure 9,15 shows which pixets of the image are
We propose a storage ordering according to Figure 9.16.
It simplifies makes the addressing required to access neighbouring pixels simple. pixel we have that addresses -16 + 1, modulus6/4.
For each
its eight nearest neighbours are stored at the pixel places with 0 + 1 and +16 + 1 relative to the p i x e l ' s own address and taken
Some neighbours are in the same MM,
P00
P01 P02 P03 P04 P05
Pl0
Pll PI2 P13 P14 P15
P20
P21 P22 P23 P24 P25
P30
P31 P32 P33
P40
P41 P42
PIs'oP15'1P15'2 MM~
P06 P07
others in a neighbouring one.
P08
Po,~5 Pl,15
i
........ [
MMI
Figure 9.15 Division of a 16x16 image on four Memory Modules.
P15,15
280
MM~:
P00 Pl0 P20 P30''"
MMI:
P04 P14 P24 P34"'" P05 P15 P25 P35"'" P06 P16 P26 P36"'" P07 P17 P27 P37"'"
MM2:
P08 PI8 P28 P38"'" P09 Pl9 P29 P39"'"
MM3:
P0,12...
P01 P l l P21 P31''" P02 P12 P22 P32"'" P03 P13 P23 P33"'"
Figure 9.16 Storage order of pixels within Memory Modules.
Input/output according to these principles is not without problems. pixels per line agrees with the number of PEs,
When the number of
pixels arriving one by one in TV scan
mode are just w r i t t e n into the I/O registers in the order of arrival, pixel only - no.
0r
4~
been input to the array~
Now,
8 and 12 - are to be put in the registers. pixels no.
every fourth
When these have
1~ 5~ 9 and 13 are treated in the same way~
etc.
The procedure is repeated for each line. What is needed is a device with
enough storage to store a line and with addressing
hardware that can read out the contents in another order than the one in which i t was stored.
In the case that served as an examp1%
the address bits are merely shifted two
steps to the left~
giving the sequence 0~4~8~... when the two rightmost bits are 00~
sequence 1,5~9~...
when they are 01~
and so on.
the
Thus~ this can be a very simple
device. The
implemented
generator. size.
I/O
Processor
can
be
described
as a
micraprogrammable address
This makes it able to handle different ratios between image size and array
Different
microprograms~
giving different address sequences~
can be initiated
depending on the ratio at hand.
9./4.9 Comparison of execution times Some of the tasks described have also been programmed on a conventional VAX 11/780 computer and measures of execution times have been made.
For some of the special
purpose image processing machines that we have mentioned in this chapter~ results from implemented image operations have been reported. examples.
We wilt take a few such
Since we want to use the results to make comparisons with the processing
times on LUCA5~ to L U C A S
performance
results.
we have only chosen such results that can be put directly in relation
281
9.4.9,1
V A X 11/780
The programs were w r i t t e n in Vax assembly language.
The comparison is summarized in
Table 9.1. As can be expected,
the greatest difference in t i m e is found for binary images.
simple shrinking operation takes 650 times longer t i m e LUCAS.
With 8-bit data the VAX computer is b e t t e r off,
orders of magnitude faster.
The
on the VAX computer than on but LUCAS is still nearly two
16-bit pixe[ values are not very common in image processing.
Comparison between 8- and 16-bit processing times show that an ordinary computer like VAX
cannot
take
advantage of
the
fact
that
image
data
have low
precision
-
the
processing times for 8- and 16-bit data are nearly i d e n t i c a l
Binary
image
Time
on
VAX ( m s )
Border8~
8~bit
Shrink8
pixel
Laplace
L4
Laplace
L8
Roberts"
16-bit
pixel
on
LUCAS
130
0.2
197
Z.61
Ratio
(ms)
650
values
cross-difference
Mean value
Time
3x3
75
290
5.23
55
218
3.64
60
335
4,28
78
46
values
Laplace
L4
203
4.46
Laplace
L8
296
8.92
33
218
6.80
32
374
7.35
51
Roberts" Mean
cross-difference
value
3x3
Table 9.1 Compared processing times for V A X 11/780 and LUCAS.
Image size is t28 x
128 pixels
9.4.9,2
OAP
In [Marks80] the t i m e needed for collection of histogram on the pilot DAP with 32 x 32 PEs and 200 ns cycle t i m e is given. 6-bit grey scale is obtained in 17.25 ms.
The histogram of a 192 x 192 pixets image w i t h
282
To get a comparative measure for LUCAS, 6-bit
pixei values.
require 10.8 ms to collect the histogram, Without
an adder
we imagine a 128 x 256 pixels image with
This is very close in size to the one Marks uses.
tree
the
time
would
LUCAS would
provided it was equipped with an adder tree. be 4/4 ms.
DAP
has eight
times
as many
processors and the same clock rate as LUCAS. Marks f u r t h e r
reports processing t i m e for the following operation on an image of the
same size: The image is first formed,
d i f f e r e n t i a t e d in two directions~
threshotding performed,
the absolute values are
and logical OR between the results is taken.
The t i m e
for this is reported to be 2.9 ms. On LUCAS,
the same operation on a 128 x 256 x 6 bit image would take 5.8 ms,
t w i c e as long time. part accounts for
i.e.
LUCAS has more powerful instructions in the PEs which probably in the ratio being smatter than eight~
numbers of processors in the two machines.
Also,
which is the ratio between the
the addressing of neighbouring pixels
w i t h i n the PEs causes some overhead in DAP. 9.4.9.3
CLIP4
In [Fountain and Goetcherian 80] execution times for a couple of algorithms implemented on CLIP4
are reported.
Addition
microseconds on CLIP4.
of
two
images,
96 x 96 x 16 each~
LUCAS adds two 128 x 128 x 16 images in 1332 microseconds.
The t i m e per pixei is 49 ns for CLIP4 and 81 ns for LUCAS. times as many processing etements~
11 ns/pixel for LUCAS.
The
Thus,
w i t h its 72
similar in c o m p l e x i t y to a
is reported to take 25 microseconds on CLIP4.
128 x 128 image on LUCAS takes 180 microseconds.
9.4.9.4
Thus CLIP/4~
is only /40 % faster per pixel.
An edge detection algorithm for binary ?6 x 96 pictures, shrinking operation,
takes 450
Binary shrinking of a
This is 2.7 ns/pixel for CLIP4 and
in this case CLIP/4 can be considered /4 times as fast.
Picap-FIP
main
features
of
the
Picap-FiP
processor
are
the
use of
four
special
purpose
processors operating in parallel and the u t i l i z a t i o n of a fast cache memory to hold that portion of the image that is currently treated.
In [Kruse et al.80] the execution t i m e for
Roberts" cross difference operator performed on Picap-FIP is given. 100 ns/pixei (8-bit
data).
On a 128 x 128 pixels image,
The t i m e required is
this makes 1.5 ms.
The
c o m p a r a t i v e t i m e for LUCAS is 3.64 ms. 9.4.9.5 FLIP-FIP The FLIP-FIP,
using 16 identical processors,
is reported to perform median f i l t e r i n g
over a 3x3 neighbourhood in 1 second for a 512 x 512 pixels image [Qemmar et at.81]. This makes 3.8 microseconds/pixel. 128 x 128 image in 70 ms,
On LUCAS,
the same operation is performed on a
which makes 4.2 microseconds/pixel.
283
L a p l a c e - f i l t e r i n g using 3x3-window is reported to take 0.2 seconds for a 512 x 512 image on FLIP-FIP.
This makes 0.76 microseconds/pixel.
t r e a t e d in Z.61 ms, 9.4.9.6
On LUCAS~
a 128 x 128 image is
which makes only 0.15 microseconds/pixel.
Conclusion
We note that the processing times presented for LUCAS and those far the other machines are of the same order of magnitude.
The comparisons with VAX show that the times are
about two orders of magnitude shorter than the times on a sequential computer.
We take
these figures as an indication that LUCA5 has the p o t e n t i a l to be a useful tool in image processing.
9.5 C O N C L U S I O N S As we noted at the beginning of this chapter~ area with many d i f f e r e n t demands,
image processing is a large computational
The processing examples that we have treated in this
chapter by necessity cover but a smail part of the types of computations that an image processing system should be able to perform e f f i c i e n t l y ,
The presented operations are all
examples of tasks that require very long execution times when performed on conventional computers.
We have shown that they can be solved on LUCAS w i t h a considerable speed-
up compared to sequential execution [Svensson83b]. More i m p o r t a n t than the usefulness of the physical machine is the usefulness of the kind of
architecture
that
it represents.
We feel
bit-serial processor arrays in image processing. than
DAP,
squareroot of
CLIP4 the
and MPP,
with
quite convinced that there is a need for LUCAS represents another kind of array
a number
of
PEs that
is in the
order of
image size instead of in the order of the image size itself.
the Our
experience is t h a t using a number of PEs that is equal to the image side and organizing the
PEs
in
one
dimension only,
give
very
straightforward
programming
and simple
If varying image sizes are used~ this organization may have some drawbacks~
and it may
input/output.
be favourable to use a two-dimensional organization as is proposed for LIPP [Danielsson and Ericsson 82],
The two-dimensional organization gives a more intricate neighbourhood
addressing scheme and thus puts stronger demands on the address generating control unit.
Part 4
EPILOGUE
Chapter10 CONCLUSIONS A N D C O N T I N U E D R E S E A R C H
18.1 G E N E R A L
The bit-serial, processor..
word
parallei
working
We have found that great f l e x i b i l i t y
b i t - s e r i a l processing elements. give faster utilized.
mode is the .prime c h a r a c t e r i s t i c
of the L U C A S
and generality is offered by the use of
T r e a t i n g many bits in parallel in each PE would of course
processing in many cases,
but often
that
kind of parallelism could not be
The instruction set would be more complex far the PEs in the bit-paralLel case.
The Processing Elements have been found to have the necessary f a c i i i t i a s for most tasks, with
respect to both the number of f l i p - f l a p s
and the available functions.
Sometimes -
but surprisingly seldom - the processing would have been faster if more boolean functions had been available. A minor change that would have improved the performance on some tasks is the f o l l o w i n g (see Figure 2.7): If the D i r e c t input (D) and the Common input (CAM) were interchanged it would still be passible to input one b i t from each source simultaneously. also be possib|e to
input
one b i t
on D and at the same t i m e
But it would
one bit on,
say,
the
" A b o v e " input which would make v e r t i c a l d i f f e r e n t i a t i o n faster. To increase the processing speed of a processor array there are two ways to follow, is to
increase
powerful, There
the number
of
processors.
are
processing
some
other
is to
make the
processors
application is
probably
areas such
where an
area
examples
on
image
the
first
[Lindh
approach et
aL84].
is
advantageous.
However,
processing,
a
counter
that
could
could be integrated with an index register function.
to " s h i f t " data a d i f f e r e n t number of bits in d i f f e r e n t f l o a t i n g point operations - and also for table took-up. needed,
In MPP [Botcher82] and P R O P A L
Data
image
improving the power of the processors. be
and
base signal
As we noted
incremented
decremented in one clock cycle would add s i g n i f i c a n t l y to the performance. function
One more
which can be done w i t h o u t abandoning the b i t - s e r i a l working mode.
processing may benefit more from in
The
or
The counter
The l a t t e r would be useful
memory words - necessary e.g.
in
M u l t i p l i c a t i o n is a function often
2 [Cimsa79] i t is speeded up through the use
of a shift register to hold the p a r t i a l products in the processing elements.
286
10.2 A PROPOSAL FOR A MORE POWERFUL PE ARCHITECTURE
t0.2.1 The New Design In [Oh]sson84a~
proposed.
multiplication. bit-slices
Ohlsson84b] a new PE architecture to suit signal processing applications is
In these applications the operation being the prime candidate for PE support is
to
M u l t i p l i c a t i o n on LUCAS be sent
between the
of b bit operands requires a p p r o x i m a t e l y 3*b 2
memory
and the
processors.
This
is quite
a lot
compared to the 4*b memory-processor transfers required just to read the operands and to store the result The use of shift registers in the PEs to hold the partial products makes the constant of p r o p o r t i o n a l i t y drop f r o m 3 to slightly above 1,
but execution t i m e is still quadratic w i t h
respect to the number of bits which seems to be a f a i r l y small pay-off. Ohlsson's approach is to add some extra logic to the shift registers to make them bitserial multipliers. output,
it
A bit-serial m u l t i p l i e r is a cellular structure with bit-serial input and
uses the principle
of
carry
save addition to
compute the sum of partial
products, The proposed m u l t i p l i e r is shown in Figure 10.I. in [Gosling80] for
It is based on a carry-save adder shown
multiplication of unsigned integers~
modified for
the M flip-flops,
two's
complement
represented numbers.
One array of flip-flops,
is used to hold the bits
of the multiplicand.
The partial product is contained in the S and C flip-flops.
The S
f l i p - f l o p of one celt holds the sum bit generated by the full adder in that cell and the C f l i p - f l o p holds the carry bit.
The sum bit is propagated to the neighbouring cell to the
left,
is fed back into the same cell.
shifting
whereas the carry bit in the multiplicand,
most significant bit first,
It is operated by first
into the array of M flip-flops.
The bits of the m u l t i p l i e r are then successively applied to the input, first,
and the product bits appear at the output,
mode of operation,
also least significant bit first.
This
the bits of the multiplicand being applied in reversed order compared
to those of the multiplier and the product~ we assume that
least significant bit
is sometimes considered unfavourable.
But
the address processor can deliver bitslice addresses in a r b i t r a r y order,
why this argument is of no concern.
287
in
out
Figure 10.1. The bit-serial multiplier.
The function of the bit-serial m u l t i p l i e r can be described as follows; Let the cells be numbered from zero to n-I from l e f t to right. t h e multiplier is applied at the input,
A t time t=0 the [east significant bit of
The full adder function (sum and carry) c o m p u t e d
by cell n u m b e r i at t i m e t is: FAi, t = Si+l,t_ 1 + ci,t_ 1 + ai*b t
The bit produced at the output at time t is thus s(0~t) which is the t:th product.
bit of the
Refer to [Gosling80] for a more detailed description.
If a is r e p r e s e n t e d with n bits and b is r e p r e s e n t e d with m bits,
the time required for
m u l t p l i c a t i o n is n clock cycles to load a plus m clock cycles to apply each of the bits of b plus (n+m-1) clock cycles to store the bits of t h e product,
The e x e c u t i o n t i m e thus
equals t h e n u m b e r of required memory accesses. Sign extension of the partial product is accomplished by letting the sign bit be fed back to one of the inputs to the (n-1):th full adder. the figure)
to
all
the d-elements of
multiplicand is provided.
By having a broadcast line (not shown in
the multiplicand register sign extension of the
The sign bit of the multiplier is extended by letting it remain
on the input while the most significant bits of the produet are shifted out. accomplished with an external register. the functions listed in Table 10,1,
This can be
The operation of the multiplier is controlled by
288
Mnemonic
Function
NOOP
No change
CLRP
A l l S and C f l i p - f l o p s are set to zero
INITM
A l l M f l i p - f l o p s are set to the value on the M-input
5HFTM
The contents of the M f l i p - f l o p s are shifted
SHFTP
The S and C f l i p - f l o p s are loaded f r o m their inputs
] a b l e 10.1.
Multiplier functions
A new PE design with hardware enough only to s u f f i c i e n t l y support common operations in signal processing applications is also suggested,
The a r c h i t e c t u r e of the new PE is shown
in Figure 10.2.
COMMON I
MULTIPLIERI
D NST- t !1
I
Figure 10.2_. The proposed new processing element.
The A L U is smaller than the one in LUCAS.
[t has three inputs (A,
X and D) and two
outputs (A and X),
which is a minimum
since it must be able to perform
function e f f i c i e n t l y .
The A f l i p - f l o p serves as an accumulator register by holding one of
a full adder
the operands (except when the m u l t i p l i e r is used) and storing one of the result bits. support m u l t i p l y - a n d - a c c u m u l a t e
operations one of the inputs to the A L U
from the output of the m u l t i p l i e r auxiiHary register.
instead of from the A-register.
In a r i t h m e t i c operations it holds the carry.
To
can be taken
The X f l i p - f l o p is an
The third operand comes
from the output of a data selector which serves as the interface to the interconnection network,
which wilt be discussed later.
One of the inputs to the data selector comes
289
f r o m the internal one-bit data bus that is connected to the PF-'s memory module and the I/O-register. channel. of
the
The width of the I/O-register should match the width of the external I/O-
The bus can also be supplied w i t h data from the A-register and from the output multiplier.
register,
Another input to the data selector comes from
the B-register.
m u l t i p l i e r when the most significant product bits are being computed. the
Select
register.
It
a general purpose
The primary use of this register is to hold the sign bit of the
is
used
to
control
the
interconnection
The S flip-flop is
network
and w i l l
be
described below. The PE instruction set contains the m u l t i p l i e r instructions described above, functions in given in Table 10.2.
plus the instructions in Table 10.3.
Mnemonic
A
NOP
A
X
LDA
D
X
LD,\X
Table 10.2
the A L U -
X
DX v A X "
X
CLRX
A
SETX
A
1
LDX
A
D
ADD
S(A,D~X)
C(A,D,X)
SUB
S(A,D',X)
C(A,D',X)
0
ALU-functions
in Table 10.2.
S(x,y~z) denotes the sum function: (x+y+z) modulo two and C(x,y~z) is the
carry function: (x+y+z) integer divided by two. can be taken from
the m u l t i p l i e r
adding a "P" to the mnemonic,
]n the three last functions the A-input
instead of from the A-register.
e.g.
The remaining PE instructions listed in Table 10o3.
all require only one parameter.
instructions LDB and LDS can have either "AREG" (A-register), a memory address as parameter.
This is denoted by
ADDP. The
"IOREG" (I/O-register) or
The instructions STA and STP can have either "IOREG"
or a memory address and IN and OUT can only have a memory address as parameter.
290
Mnemonic
Function
LDB
Load the B-register
LDS
Load the S-register
STA
Store the content of the A-register
STP
Store the output of the m u l t i p l i e r
OUT
One bit is shifted into the I/O-register
IN
One bit is shifted out of the I/O-register
Table 10 3. Other PE instructions
We
give
a
few
examples of
microprograms
to
illustrate
the
use of
the
bit-serial
multiplier, The first microprogram loads one of the multiplication operands, the multiplier.
the multiplicand,
into
The sign bit is first copied into all positions with the INITM-operation.
Then the remaining bits are shifted in~
most significant bit first.
NoOfBits is assumed
to be less or equal to the number of cells in the multiplier.
Microprogram LoadMultiplicand(Source,NoOfBits); begin Source~=Source+NoOfBits-1 INITM(Source,Direct) Source:=Source-1 ; iterate NoOfBits-1 times begin SHFTM(Source,Direct); Source:=Source-1 ; end;
end;
When the multiplicand has been loaded into the muitiplier~ take place, multiplier,
the actual multiplication can
The bits of the other operand are successively applied at the input of the The product bits then appear at the output°
To avoid transferring the sign
bit of the operand several times from memory it is saved in the B-register,
and is from
there applied to the multiplier input when the most significant bits of the product are shifted out.
291
Mieropragram IntegerMultiply(Source,Dest,NoOfBits); begin CLRP; iterate NoOfBits-1 times begin SHFTP(Source,Direct); Source:=Source+l ; STP(Dest); Dest:=Dest+l; end~ SHFTP(Seurce,Direct); LDB(Source); iterate NoOfBits-1 times begin STP(dest)~ Dest:=Dest+l; 5HFTP(Dummy,B); end; STP(dest); end;
The
last
example demonstrates how multiplication of
a field
can be combined with
addition to another field.
Microprogram FixMuitiplyAdd(MulSourc%AddSource~Dest,NoOfBits); begin CLRP; CLRX: iterate NoOfBits-1 times begin SHFTP(MulSource~Direct); MulSouree:=MulSource+l ; end; SHFTP(Mu]Sourc%Direct); LDB(Mu]Source)~ ADDP(AddSource,Direct); AddSource:=AddSource+l ; LDA(One,Direct); SHFTP(Dummy~B); iterate NoOfBits-1 times begin STA(Dest); Dest:=Dest+l; SHFTP(Dummy,B); ADDP(AddSource,Direct); AddSource:=AddSource+l ; end; STA(Dest); end;
10.2.2 Execution times with the new design With
the
new
design,
application
programs
invoJving
multiplications
are
executed
significantly faster.
The execution time for one iteration of an n-point FFT on a n/2 PE
array
clock cycles [Ohlsson84a],
is now 26"b
26*b*iog2n cycles (b is the number of data bits).
and the total execution time
is thus
On a 128 PE array the execution time
for a 256-point FFT with 16-bit data becomes 0.66 ms assuming a 5MHz clock.
The time
on the existing LUCAS with 5 MHz clock is 9.1 mso Multiplication of two 128 by 128 element matrices of b-bit data on the new architecture takes 214"4b clock cycles,
compared to 214*(b2+10b) on the existing LUCAS.
the time is reduced from 0.5 to 0.1 seconds. seconds to 0.2 seconds.
For b=8
For b=16 the time is reduced from 1.4
292
In [Ohlsson84a]
the execution times for
FFT~
c o n v o l u t i o n and m a t r i x
multiplication
on
both the existing LUCAS and the proposed new a r c h i t e c t u r e are eompared w i t h those of a pipetined
sequential
addition/subtraction size
is
[arge
sequential
processor
capable
enough~
a
parallel
processor by equipping
10.4 shows
the
architecture
or
of
performing
one
on 16-bit data words e v e r y clock cycle.
number the
machine
it
with
of PEs required
can
of
sufficiently to
course
and
one
be
made
faster
than
the
many processing elements.
Table
parallel processor - the
LUCAS
make the
proposed new one - as fast
multiplication
Provided that the problem
as the sequential processor
when the
precision is 16 bits.
Improved LUCAS
16.3
69
1135
FFT
Ratio
architecture
48
FIR-filter
128
IIR-filter Matrix mutt
p x
p
M a t r i x m u l l . v~p x
416
64
6.5
944
192
4-9
Table 10./4 The number of PEs required to make the parallel processor pipetined sequential processor w i t h the same clock rate
It
should be noted that
pipelined
processor
However~ architeeture
and
the comparison the
problem
it can be concluded that~ that
we
discuss
in
is coarse = the wordtength
size
is
chosen
to
fit
the
as fast
is chosen to f i t parallel
book
is
eompetitive
also
in
the
architecture,
in spite of its b i t - s e r i a l working mode~
this
as a
signal
the kind of processing
applications of m o d e r a t e size if special care is t a k e n to make m u l t i p l i c a t i o n faster,
10.3 VLSI I M P L E M E N T A T I O N
OF THE PROCESSOR A R R A Y
An i m p o r t a n t f e a t u r e of t o d a y ' s t e c h n o l o g y as compared to the technology at hand when the von Neumann c o m p u t e r model was suggested is t h a t m e m o r y and processing logic are now made using the same technique.
T h e r e f o r e there is no reason to d i s t i n c t l y separate
m e m o r y f r o m logic.
f r o m a pure technological point of v i e w the use of
data
processing
In other words~
memory
is reasonable,
tn a processor of L U C A S
kind the
distinction
293
between m e m o r y and processing logic is not as distinct as in sequential computers.
This
suggests that the use of large scale integration technology has extraordinary advantages in such processors, Due to its regular structure the kind of processor that we discuss in this book is very well suited for VLSI implementation. Processing Element
As part of a m u l t i - p r o j e c t chip the logic of one
(excluding memory) was in fact
implemented in CMOS/SOS by the
project group in 1981. We will investigate what the consequences of integrating many processing elements on one chip would be in terms of number of gate functions and number of pins per chip,
10.3.1 Off-chip memory We first
consider the consequences of using ordinary r e a d / w r i t e memory chips for the
memory
modules.
We
further
implemented in LUCAS. needs
one
input
shuffle+exchange.
assume
that
This means that
from
above~
We further
one
from
assume t h a t
we
for
want
exactly
the
facilities
below, the
control
one
for
shuffle
signals for I/O
and
8 bits would be appropriate to implement the present possibilities,
Tabie
10,5
elements,
describes Table
the
In LUCAS~ number of
10,6 lists
the
Finally,
one
for
data register%
m u l t i p l e x e r and A L U functions are gathered to a single "instruction code"~
bits wide I/O data registers.
now
interconnection each processing element
k bits wide. we assume b
b is 8, pins
function
needed on a chip for
different
comprising
values of n for
n processing two
different
combinations of b and k. The first LUCAS.
combinatio%
b=8 and k=8,
represents what is implemented on the current
b=16 means improving the [/O rate by a f a c t o r of two.
the number of ALU functions significantly.
k=t2 means increasing
294
Specification
pins
I / O d a t a bus I/O write I/O data register address Chip select Select First chain Data in/out CenTnon R e g i s t e r output Shuffle input Above/Below Instruction code Power,Ground,Clock
Sum:
b 1 log2n 1 2 n 1 n 2 k 3
b+k+10+2n+log2n
Table 10.5 The number of pins of an n processor chip
n=l n=2 n=4 !n=8 n=16 n=32 n=64
(;)
(;i)
b=8 k=8
b=16 k=12
28 31 36 /45 62 95 160
40 43 48 57 74 I07 172
Table 1 0 6 The number of pins for d i f f e r e n t values of n assuming (I) 8-bit data and 8-bit instruction
code
and
(II)
16-bit
data
and
12-bit
instruction
code
The number of gate functions needed to implement the processors is t o t a l l y dominated by the logic required to implement the a r i t h m e t i c / l o g i c unit. k instruction bits, PE~
Assuming k I bits~
are needed to specify the function and assuming f flip flops in each
one bit f r o m Memory and one f r o m Common~ G1 = f
*
2 f + k l +1+1
memory cells are needed to implement the ALU of one PE as a ROM. limit~
out of the
(This is an upper
since the number can be reduced considerably if a PLA structure is used instead
of a ROM.) In LUCAS,
we have f=4 and k1=5 ,
which gives
295
GI(LUCAS)
= 4 * 211
In an n processor chip,
= 213
nG 1 memory cells are required.
Table 10.7 lists nG 1 for
different values of f and k 1. Using these tables we can now choose parameters and a value of n that give values of cell count and number of pins within the l i m i t of available technology, f=2
f=4
f=4
k1=4
k1=5
k1=7
n=l
29
213
215
n:2 n:4
210
214
216
211
215
217
n:8
212
216
218
n=t6
213
217
2t9
n:32
214
218
220
n=64
215
219
221
Table 10.7 The number of memory ceils required for an n-PE chip with f flip-flops per PE and 2kl functions
For example
b=8, k=8, f=4, k1=5 as in LUCAS,
would make i t possible to put 32 PEs in a 95 (or maybe 96) pin chip
comprising 218 (=256k) celts which is possible with current VLSI technology. We assumed the memory modules were outside these chips. word length of 64 kbits.
Suppose we want a memory
We could then use memory chips of 6/4 K 8-bit words.
these would be required to support one 32-PE chip of the above kind. with 80 chips would then have 512 Processing Elements,
Four of
A circuit board
each with 64 kbits of memory.
Some additional chips for I/O address decoding and buffering would be needed on each board. A particular problem appears with the perfect shuffle/exchange network - if this is the one chosen.
We would like to be able to use many of the 512-PE boards together,
be able to perform shuffle permutation on the total of PEs.
and
Can this be implemented
without rewiring the whole network when new boards are added? The answer is yes - with some loss in efficiency. an internal perfect
shuffle/exchange network,
If each 512-PE board is equipped with a 102/4 PE shuffle permutation can be
performed in twice the time if the two boards can exchange data over a 512 bit bus.
In
296
general,
if m boards are used,
a 512 x m shuffle can be made in a t i m e of m shuffles.
(It is assumed that an individual PE can choose the shuffle or the shuffle/exchange input based on e.g.
the tag contents).
10o3.2 On-chip memory_ We next consider the case of including the m e m o r y modules in the PE chips. to equip the PEs w i t h index registers for addresse% chosen.
Otherwise,
If we w a n t
this a l t e r n a t i v e should probably be
the PEs must o u t p u t the m e m o r y addresses.
To provide an address for the bit slice of the m e m o r y ,
address pins are needed.
assume 2m-bits
Furthermore,
m e m o r y words,
requiring m address bits.
signal is needed.
Thus,
With
16 processors/chip
m=16,
the
We
a write control
the pin count exceeds what we had in Table 10.6 w i t h m + l . case would require
79 pins/chip.
A board w i t h 64
such chips would thus have 1024 processors in t o t a l . The number of gates needed for the m e m o r y modules would dominate the gate count in such a chip. are needed.
Assuming n processors w i t h 2 m bits of m e m o r y each, For n~--16 and m=16~
for the PE part,
this makes 2207
according to Table 10.7,
nZ m m e m o r y cells
which is one m i l l i o n .
The ceil count
ranges b e t w e e n 213 and 219,
depending on
complexity. We
conciude
that,
with
memory
on
the
chip~
it
is probably
the
number
of
gate
functions t h a t puts the l i m i t on how many processors can be i m p l e m e n t e d on one chip. Before leaving this example we also point to the a t t r a c t i v e possibility of using r e a d / w r i t e memory
to i m p l e m e n t
the a r i t h m e t i c / l o g i c
unit.
Loading of the A L U
m e m o r y can e.g.
be done using the m e m o r y address pins and I/O data pins.
10.3.3 No i n t e r c o n n e c t i o n n e t w o r k In the cases considered above we have assumed that processing
elements.
In some appiications
m a n a g e m e n t is the p r i m e example.
this
communication
is not
required.
is needed b e t w e e n
Relational
data
base
We end up by considering the consequences of this
for VLS[ i m p l e m e n t a t i o n . The number of pins required for an n-PE chip w i l l be (cf.
Table 1Q.5)
b+k+10+ i og2n+m+l where
b is the I/O
data bus width,
m e m o r y address length.
k is the instruction
(We assume m e m o r y on chip).
code length,
and m is the
297
Table 10.8 lists this number for d i f f e r e n t
n=l n=4 n=16 n=64 n=256 n= I02/4.
values of the parameters.
We can see t h a t the
even w i t h very many PEs on the chip.
pin count is very low,
b=8 k=8 m=l 2
b=16 k=8 m=l 6
b=32 k=8 m=l 6
39 t4t
51 53 55 57 59 61
67 69 71 73 75 77
43 45 47 /49
Table 10.8 The number of pins of an n-processor chip w i t h o u t e x t e r n a l i n t e r c o n n e c t i o n , b is the l/O data width, k is the length of the instruction code, and m is the length of the m e m o r y address.
Clearly,
it is the number af gate functions that puts the l i m i t on w h a t is i m p l e m e n t a b l e
in this case. f * 2 f + k l +1
The number of m e m o r y celts per processor is 2 m for the m e m o r y words and for the PEs.
Assuming f=4~
counts have the same value if re=k1+7. data base processing, this
case.
Thus,
i.e.
each PE having four f l i p - f l o p s ,
Assuming f=2,
the t w o counts are equal if m = k l + 4 , we
conclude
that
the
number
m e m o r y modules w i l l again be t o t a l l y d o m i n a t i n g . m e m o r y cells on a chip,
of
the t w o
which is probably s u f f i c i e n t
gate
k1=4 is probably s u f f i c i e n t functions
to
implement
for in the
Assuming we can have 220 (1 mitlion)
and that the word length is chosen to be 212 bits.
processors could be i m p l e m e n t e d on a single chip.
Then 256
A data base processor w i t h thousands
of processing elements would be easily i m p l e m e n t e d w i t h these circuits.
10./4 F I N A L WORDS The research presented in this book is intended to explore the possibilities o f f e r e d by the concept of an A s s o c i a t i v e A r r a y in d i f f e r e n t
a p p l i c a t i o n areas.
We feet t h a t it has been
a great advantage to have available a w o r k i n g associative a r r a y c o m p u t e r . of
the
project
expected.
we
have found
that
the
range
of
applicability
is wider
In the course than we f i r s t
We have found e f f e c t i v e solutions to many problems that were not considered
when the a r c h i t e c t u r e was decided.
Surprisingly,
problems
machine,
on the
architecture
with
structure a parallel
range of applications.
of
the
interconnection
i t has o f t e n been very easy to map the
contradicting
the
scheme is e f f e c t i v e
opinion
that
a parallel
only on a very l i m i t e d
Appendix 1 ALU Functions
The ALU implements 32 d i f f e r e n t functions. Some functions are l i s t e d more than once, under d i f f e r e n t mnemonics, since they n a t u r a l l y belong to more than one function aroup. Registers not mentioned are l e f t i n t a c t . XO=M leaves the X-register i n t a c t i f , by the Data Select code, M is chosen to be X. + is mod 2 addition, v is OR. No operation NOP
No operation (=LXMA)
X0=M
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Set, Clear~ Complement SETT COT
Set tags Complementtaqs
TO=I TO=T'
XO=T X0=M
SCA CCA
Set C, a l l Clear C~ a l l
CO=I C0=0
XO=M XO=M
CRA CORA C0RT
Clear R, a l l ComplementR, a l l ComplementR, tagmasked (=XORRTA)
RO=0 XO=M R0=R' XO=M Where T=I do RO=R' elsewhere RO=R XO=M
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Load Reqisters LRMA LRMT LRCA LRTA
Load Load Load Load
LTMA LTMT
Load T from M, a l l Load T from M, tagmasked (=CMOT =ANDTMA) LOad T from R, a l l Load T from R, tagmasked (=CROT =ANDTRA) Load T from M inverted, tagmasked (=CMZT =ANDTMIA) Load T from R inverted, tagmasked (=CRZT =ANDTRIA)
T0=M XO=T Where T=I do TO=M elsewhere TO=T=0 XO=T TO=R XO=M Where T=I do T0=R elsewhere TO=T=~
LCRA
Load C from R, a l l
CO=R
LXMA
Load X from M, a l l (=NOP)
XO=M
XRT
ExchangeR and T
TO=R
LTRA LTRT LTMIT LTRIT
R R R R
from from from from
M, M, C, T,
all tagmasked all all
RO=M XO=R Where T=] do RO=M elsewhere R0=R R0=C XO=M R0=T X0=M
XO=M
Where T=I do TO=M' elsewhere TO=T=0 XO=T Where T=I do TO=R' elsewhere TO=T=~ XO=M
XO=M
RO=T
XO=M
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
299
~,9~pare (Result in T) CRZT
CompareR to Zero, tagmasked (=LTRIT =ANDTRIA)
Where T=I and R=~ do TO=I elsewhere TO=O XO=M
CROT
CompareR to One, tagmasked (=LTRT =ANDTRA)
Where T=I and R=I do TO=I elsewhere TO=O XO=M
CRMT
CompareR to M, tagmasked
Where T=I and R=M do TO=I elsewhere TO=O XO=T
CRCT
CompareR to COM, tagmasked
Where T=I and R=COMdo TO=I elsewhere TO=O XO=M
CMZT
CompareM to Zero, tagmasked (=LTMIT =ANDTMIA)
Where T=I and M=O do TO=I elsewhere TO=O XO=T
CMOT
CompareM to One, tagmasked (=LTMT =ANDTMA)
Where T=I and M=I do TO=I elsewhere TO=O XO=T
CMCT
CompareM to COM, tagmasked
Where T=I and M=COMdo TO=I elsewhere TO=O XO=T
AND T with R, a l l (=LTRT =CROT) AND T with M, a l l (=LTMT =CMOT) AND T with M, inverted, a l l (=LTMIT =CMZT) AND T With R inverted, a l l (=LTRIT =CRZT)
TO = T AND R TO = T AND M TO = T AND M'
XO=M XO=T XO=T
TO = T AND R'
XO=M
Loqical ANDTRA ANDTMA ANDTMIA ANDTRIA ANDRMA ORRMA XORRMA
AND R with M, a l l OR R with M, a l l ×OR R with M, a l l
RO = R AND M RO = R OR M RO = R XOR M
XO=R XO=R XO=R
XORRTA
XOR R with T, a l l
RO = R XOR T
XO=M
300
Arithmetic ADMA
Add M to R w i t h c a r r y , a l l
RO=M+R+C CO=MRvMCvRC XO=Overflow = MR(RO)'vM'R'(RO)
ADMIA
Add M i n v e r t e d to R w i t h c a r r y , a l l
RO=M'+R÷C CO=M'RvM'CvRC XO=Overflow = M'R(RO)'vMR'(RO)
ASMT
Add/sub M t o / f r o m R w i t h c a r r y where T=I/O
Where T=I: same as ADMA Where T=@: RO=M+R+C CO=R'MvC(R+M)' XO=Overflow = R'M(RO)vRM'(RO)'
ACMA
Add COM to M w i t h c a r r y , a l l
RO=COM+M+C CO=(COM)Mv(COM)CvMC XO=Overflow = M(COM)(RO)'vM'(COM)' (RO)
ACIMA
Add COM i n v e r t e d to M w i t h c a r r y , a l l
RO=(COM)'+M+C CO=(COM)'Mv(COM)'CvMC XO=Overflow = M(COM)'(RO)'vM'(COM) (RO)
ACMIA
Add COM t o M i n v e r t e d w i t h c a r r y , a l l
RO=COM+M'+C CO=(COM)M'v(COM)CvM'C XO=Overflow = M'(COM)(RO)'vM(COM} ' (RO)
App~endix 2 LUCAS Microprogramming Language
COMPILATION UNIT~ MODULES AND MICROPROGRAMS ::= <module> .
<module> ::: module ; <declaration part> <submodule part> endmod <declaration part> ::= <empty> I I I <subroutine declaration> I I <subroutine declaration> I <subroutine declaration> 1 <subroutine declaration> <subroutine deelartion> <submodule part> ::= <empty> I <module> /;<module>/ ~ <microprogram> /~<microprogram>/
<microprogram> ::= <microprogram heading> <statement part> <microprogram heading> ::= microprogram ; I microprogram
();
::= <microprogram parameter> /,<microprogram parameter>/) <microprogram parameter> ::= <empty> I /7 =<sign>/;
var
::=
/,/;
<subroutine declaration>
::=
<subroutine heading><statement part> <subroutine heading>
::=
subroutine ; I subroutine ();
::=
/,/
::= <empty> ! I I
STATEMENT
PART
<statement part> ::= begin <statement list> end <statement list> ::= <statement> /l<statement>/ <statement>
::=
<empty> I <subroutine call> I I <stack operation> I 1 begin <statement list> end I <exit statement> I t
303
::=
I () I (~)
::=
::=
::=
::= I
::=
DIRECT I SHUFFLE I NSHUFFLE J ABOVE I BELOW
<subroutine call> ::= cali I call () ::= /, /
I <sign>
::=
:=<sign> I := I < v a r i a b l e l > : = < v a r i a b l e l > < a r i t h m e t i c operator> I : = < v a r i a b l e l > < a r i t h m e t i c operator>
<stack operation> ::= SPUSH () I SPDP () :-= if then <statement> I if then <statement> else <statement>
304
<exit statement> ::= exit I exit ()
::=
::=
<while statement> I I
::=
<empty> I
<while statement>
:
::=
while do <statement>
::=
repeat <statement Hst> until
::= iterate times <statement>
::=
I 0 I TRUE I FALSE ! SOME I NONE I ZMASK() 1 NZMASK() ::=
::=
I
I
::=
: : :
305
<sign> ::=
<empty> I -
< a r i t h m e t i c operator> ::=
::=
+ I -
= I
PE i N S T R U C T I O N SET
The PE instructions embrace operations p e r f o r m e d on the registers and on the memory in the Processing Elements. Without
parameters.
The instructions are of three kinds:
These instructions
use the PE registers as operands and leave the
result in the registers. With one p a r a m e t e r .
These instructions either use the Common Register as an operand or
store
the R register in the PE memory.
With
two parameters.
The p a r a m e t e r specifies the PE memory address.
These are instructions where one of the operands comes from the
interconnection network.
The f i r s t p a r a m e t e r gives the PE memory address of the source
bit,
The second parameter specifies the p e r m u t a t i o n of data over the network,
The
PE
instruction
instruction list
set
below
may
be
describes
changed
the
by
current
reprogramming
instruction
set.
the
ALU
PROMs,
In the list
the
following
conventions are used=
Several of (the
the
instructions
instruction
selected PEs. "A",
arithmetic previous
ends with
in two the
versions:
letter
which
affects
(the name ends with
only
the
the l e t t e r
which affects all PEs.
operations:
R
receives
the
result,
C
o v e r f l o w (used only when the last bit value
A tag-masked instruction
"T"),
A non-tag-masked instruction
for "all"),
Arithmetic
name
exist
of
the
C
register
is
the
carry
and
X
has been processed).
used
as
incoming
the The
carry,
The results of the compare instructions a f f e c t the Tag in the f o l l o w i n g way: A Tag which
has the value "zero"
is not affected.
gets the value "zero" if the compare fails.
A Tag which
The
is "one"
306
PE INSTRUCTIONS WITHOUT PARAMETERS Load/Exchange Register LTRA
Load T f r o m R
LTRT
Load T from R
LTRIT
Load T from R" (R" stands for R-inverted)
LTXA
Load T from X
T to X
LTXT
Load T from X
T to X
LTXIT
Load T from X "
T to X
LRTA
Load R from T
LRCA
Load R from C
LRXA
Load R from X
R to X
LRXT
Load R from X
R to X
LCRA
Load C from R
XRT
Exchange R and T
Set/Reset/Complement Register STA
Set T
SCA
Set C
CCA
Clear C
CRA
Clear R
COTA
Complement T
CORA
Complement R
CORT
Complement R
SELF
SELECT FIRST. where PE no,
Clear T in all PEs with number > i, i is the first PE where T is One
Compare CRZT
Compare R to Zero
CROT
Compare R to One
CRXT
Compare R to X T to X
CXOT
Compare X to One
TtoX
CXZT
Compare X to Zero
TtoX
Logical ANDTRA
T AND R to T
ANDTRIA
T AND R" to T
ANDRXA
R AND X to R
R to X
ANDTXA
T AND X to T
T to X
A N D T X I A T AND X " to T R
ORRXA
R OR X
XORRTA
R XOR T to R
XORRXA
R XOR X to R
to
T to X R
to
X
R to X
307
Arithmetic ADXA
Add X t o R
ADXIA
Add X " to R
ASXT
Add/Subtr X To/From R where T = I / 0
SUXA
Subtr X from R
PE INSTRUCTIONS WITH ONE PARAMETER The parameter specifies a bit address to the PEs. WRRA
Write R into PE memory
WRRT
Write R into PE memory
CRCT
Compare R to Common
CXCT
Compare X to Common
ACXA
Add X to Common
SCXA
Subtr Common from X
TtoX
PE INSTRUCTIONS WITH TWO PARAMETERS The first p a r a m e t e r specifies a bit address to the PEs. permutation of the bit-slicer the instruction list below~
The second parameter specifies a
which is performed before the data enters the PE ALU.
"M" is used to denote incoming data from the Interconnection
Network, Possible permutations are: DIRECT
No permutation
SHUFFLE
The bit-slice is shuffled (see section 1,6,2)
NSHUFFLE
The bit-slice is shuffied~
ABOVE
The bit-slice is rotated one step down,
then exchanged
Data to PE no.i comes f r o m PE no,
BELOW
In
The bit-slice is rotated one step up
Load Register LRMA
Load R from M
R to X
LRMT
Load R from M
R to X
LTMA
Load T from M
T to X
LTMT
Load T f r o m M
T to X
LTMIT
Load T from M"
T to X
LXMA
Load X from M
Q-l) mod t28
308
Compare CMOT
Compare M to One
T to X
CMZT
Compare M to Zero
T to X
CMCT
Compare M to Common
T to X
CRMT
Compare R to M T to X
Logical ANDRMA
R AND M to R
R to X
ANDTMA
T AND M to T
T to X
ANDTMIA
T AND M" to T
T to X
ORRMA
R OR M to R
R to X
XORRMA
R XOR M to R
R to X
Arithmetic ADMA
Add M to R
ADMIA
Add M" to R
ACMA
Add M to Common
ASMT
Add/Subtr M To/From R where T=I/0
ACIMA
Add Common" to M
ACMIA
Add Common to M"
Appendix 3 Pascal/L
-
SYNTAX iN BNF
D A T A DECLARATIONS <selector type>
":=
selector[] I selector[] :=
::=
..
::= =>
::=
I .. I .. step ::= parallel array[] o f <parallel type> I parallel array[,..] o__ L <parallel type> <parallel type> ::= <parallel type identifier> I <parallel standard type> I record <parallel field list) end <parallel type identifier> ::= <parallel standard type> ::=
310 integer() I unsigned integer() I Boolean I fixed(,) I char I string() <parallel field list> ::= <parallel record section> /;<parallel record section>/ <parallel record section> ::= /,/ : <parallel standard type> *=
MICROPROGRAM DECLARATION <microprogram declaration> *= microprogram <microprogram parameter list>; exter,n,al; <micropragram parameter list> ::= <empty> I ( <microprogram parameter> /,<microprogram parameter>/ ) <mieroprogram parameter> .:= <empty>
I
INDEXING
::= <parallel variable identifier> 1 <parallel variable identifier>[ ] i <parallel variable identifier>[ ,<expression>] <parallel variable identifier> ::=
311
::=
* 1
I .,
t < s e l e c t o r expression>
STATEMENTS
<where statement>
::=
w h e r e < s e l e c t o r expression> d._£o< s t a t e m e n t >
I
where <selector expression> d.__£o< s t a t e m e n t > elsewhere <statement>
<parallel case s t a t e m e n t > ::= case w h e r e < p a r a l l e l expression> of /; /; end
::= /,/ : <statement>
::= <empty> t o t h e r s : <statement>
<while and where s t a t e m e n t > ::= while and where <selector expression> do < s t a t e m e n t >
REFERENCES
[Anderson74]
Anderson~ G.A. "Multiple Match Resolutions: A New Design Method", [EEE Trans. on Computers, Dec. 1974.
[Astrahan et al.79] Astrahan M.M. et al., "System R, A Relational Database Management System", IEEE Computer, Vol. 12, No. 5~ May 1979 [Baba and Hagiwara 81] Baba, T. and Hagiwara~ H. "The MPQ System: Independent Efficient Microprogram Generator"y IEEE Computers, Vo[ C-30, No. 6, 3une t981.
A MachineTrans. on
[Banerjee at ai.78] Banerjee J., Baum R.[.~ Hsiao D.K., "Concepts and Capabilities of a Database Computer '~, ACM TODS, Vol. 3, No. 4, December 1978. [Banerjee at al.79] Banerjee .3., Hsiao O.K, Kannan K., "DBC - A Database Computer for Very Large Databases'~ IEEE Trans. on Computers, Voi. C-28~ No. 6~ 3une 1979. [Barnes et ai.68] Barnes~ G.H., Brown, R.M., Kato, M., Kuck, D.3., Slotnick, D.L., Stokes, R.A. "The ILLIAC IV Computer", IEEE Trans. on Computers, Vot. C-17, pp. 746-757, Aug. I968. [Batcher74]
Batcher, K.E. "STARAN parallel processor system hardware", Proc of the 1 9 7 4 National Computer Conference~ pp. 405-/410.
[Batcher76]
Batcher, K.E. "The Flip Network in STARAN"~ Int. Conf. on Parallel Processing, 1976.
[Batcher77]
Batcher, K.E. "The multidimensional access memory in STARAN", tEEE Transactions on Computers, Vol. C-26, No. 2, 1977~ pp. 174-177.
[Batcher79]
Batcher, K.E. "The STARAN Computer", Infotech State of the Art Report Supercomputers, Infotech Intl. Ltd., Maidenhead, Berks., UK, 1979.
[Batcher80]
Batcher, K.E. "Design of a massively parallel processor", IEEE Transactions on Computers, VoJ. C-29, 1980, pp. 836-8/40.
[Batcher82]
Batcher~ K.E. "Bit-serial parallel processing systems", IEEE Transactions on Computers, VoL C - 3 1 ~ 19821 pp. 377-384.
Proc.
of the 1976
313 [Bentley79]
Bentley, trees", Control,
3.L. "A parallel algorithm for constructing minimum spanning Seventeenth Annual Allerton Conference on Communication, and Computing, t979, pp. 11-20.
[Bernstein and Chiu 81] Bernstein P.A, Chiu O.W., "Using Semi-3ains to Solve Relationsl Queries", Journal of the Association for Computing Machinery, Vol. 28, No. 1, January 1981. [Berra and Oliver 79] Berra B.P., Oliver E., "The Role of Associative Array Processors in Data Base Machine", IEEE Computer, Vo1.12, No. 3, March 1979. [Bratbergseugen et el.79] Bratsbergseugen, K., Risnes, O., Amble, T. "ASTRAL - A Structured and Unified Approach to Data Base Design and Manipulation", RUNtT Cornp. Center at the University of Trondheim, Norway. Report No. STF14.A80Q03, 1979. [Chamberlin76]
Chamberlin Computing
D.D, Surveys,
"Relational Data-Base Vol. 8, No.
Management Systems", 1, March 1976.
[Cimsa79]
Cimsa, "Processeur Paratletle Associatif - PROPAL 2. Presentation", Cimsa - Compagnie dqnformatique Militaire Spatiale et Aeronautique, 781 40 Velizy, France, 1979 (in French).
[Codd70]
Codd E.F, Comm.
[Codd82]
Codd E.F., Productivity",
"A relational model of data for large shared data banks", ACM, Vol. 13, No. 6, June 1970. "Relational Database: A Comm. ACM, Vol. 25,
Practical No. 2,
Foundation for February 1982.
[Cooley and Tukey 65] Cooley, J.W. and Tukey g.W. "An algorithm for the machine catculation of complex Fourier series", Math. of Compu, Vol. 19, 1965~ pp. 297-301. [Danielsson81]
Danielsson, P.E. "Getting the median faster"~ image Processing 17, 1981, pp. 71-78.
[Danielsson82]
Danielsson, P.E. "Operations on binary images", Linkoping University, 1982 (in Swedish).
[Danielsson84]
Danielsson, Computers,
P.E. "Serial/Parallel Convolvers", Vol.C-33, No.7, July 198/4.
Computer Graphics and
Internal Report,
IEEE Transactions on
[Danielsson and Ericsson 82] Danielsson, P.E. and Ericsson T. "Suggestions for an image processor array", Internal Report, Linkoping University, Sweden, t982.
314
[Danielsson and Levialdi 8I] Danielsson, P.E. and S. Leviatdi~ "Computer architectures for pictorial information systems", Computer, November 198t, pp. 5367. [Dasgupta and Tartar 76] Oasgupta, S., Tartar, J. "The Identification of Maximal Parallelism in Straight-Line Microprograms", IEEE Trans. on Computers, Vol. C-25, No. 10, Oct. 1976.
[DasguptaB0]
Dasgupta, Computing
S. "Same Aspects of Surveys, Vol. t 2,
High-Level Microprogramming", No. 3, Sept. 1980.
[Date81]
Date C.,,1., An Introduction to Database Systems, Publishing Company, Reading, Mass., 1981.
[Date83]
Date C.J., An Introduction to Database Systems - Volume II, Wesley Publishing Company~ Mass., 1983.
Addison-Wesley
Addison-
[Davidson et al.81] Davidson, S., Landsko% D., Shriver, B.D., Mailett, P.W. "Some Experiments in Local Microcede Compaction for Horizontal Machines", IEEE Trans. on Computers, Vol. C-30, No. 7, July 1981. [Deo and Yoo 81] Deo, N. and Yoo, Y.B. "Parallel algorithms for the minimum spanning tree problem", Proceedings of the 1981 International Conference on Parallel Processing, IEEE New York, 1981~ pp 188-
189. [OeWitt76a]
DeWitt, D.3. "A Machine-Independent Approach to the Production of Horizontal Microcode", PhD thesis, Univ. of Michigan, Ann Arbor~ June 1976.
[DeWitt76b]
DeWitt, D.J. "Extensibitity - A New Approach for Designing MachineIndependent Microprogramming Languages", Proc. 9th Ann. Workshop on Microprogramming (ACM), Sept. 1976.
[DeWitt and Hawthorn 81] DeWitt D.3.,Hawthorn P.B.,"A Performance Evaluation of Database Machine Architectures", Proc 7-th VLDB Conf., Cannes, September 1981. [Dewitt79]
DeWitt D.J.,"DIRECT - A multiprocessor organization for supporting relational database management systems", IEEE Trans. on Computers, Vol. C-28~ No. 6,June 1979.
[Digby73]
Digby D. W.,'A Search Memory for Many-to-Many Comparisons",IEEE Trans. on Computers, Vol. C-22, No. 8, August 1973.
[Dijkstra59]
OUkstra , E.W. "A note on two problems in connection with graphs", Numerische Math., Vol. 1, 1959, pp. 269-271.
315 [Duff79]
Duff, M.J.B. "Parallel processors for digital image processing", in "Advances i n Digital Image Processing", edited by P. Stucki, Plenum Press, New York, 1979, pp. 265-276.
[Duff and Levialdi 81] Duff, M.3.B. and Leviatdi S. (editors), for Image Processing", Academic
"Languages and Architectures Press, London, 1981.
[Feierbach and Stevenson 78] Feierbach, Q.F., Stevenson, D.K. "The Phoenix Array Processor 'j, Prec. 17th Ann. Tech. Symposium, June 1978. [Fernstrom82]
Fernstrom, C. "Programming Techniques on the LUCAS Associative Array Computer", Proc. of the 1982 International Conf. on Parallel Processing, Aug. t982.
[Fernstrom83]
Fernstrom, C. "The LUCAS Associative Array Processor and its Programming Environment", PhD thesis, Dept. of Computer Engineering, University of Lund, Sweden, 1983.
[Fernstrom et al.83] Fernstrom, associative Proceedings 1983, pp.
C., Kruzela, I. 1 Ohlsson~ L. I and Svensson, B. "An parallel processor used in real time signal processing"~ of the Second European Signal Processing Conference~ Sept 793-796,
[Ftanders et at.77]Flanders, P.M., Hunt D . O . , Reddaway S.F., and Parkinson D. "Efficient high speed computing with the Distributed Array Processor", in High Speed Computer and Algorithm Organization I edited by D.J. Kuck I D. Lawrie I and A.H. SamehI Academic Press~ New York, 1977. [Floyd62]
Floyd, R . W . "Algorithm 97: shortest path", 1962, p. 345.
Comm ACM,
[Flynn66]
Flynn, M.J. "Very High-Speed Computing Systems"~ Prec. IEEE~ Vol. 54~ No. 12~ Dec. 1966.
[Foster76]
Foster, C.C. Reinhold Co. I
"Content Addressable Parallel Processors" I 1976.
Vat.
51
of the
Van Nostrand
[Fountain and Goetcherian 80] Fountain, T.J. and Goetcherian V. "Clip parallel processing system", IEEE Proceedings~ Vol. 127~ Pt.E. I No. 5, 19881 pp. 219-224. [Oemmar et al.81]Gemmar, P., Ischen H.~ and Luetjen K. "FLIP: A multiprocessor system for image processing", in [Duff and Levialdi 81] pp. 245-256. [Oolumb61]
Golumb I S.W. "Permutations by Cutting and Shuffling", Vol. % pp. 293-297~ Oct. 1961.
SIAM Rev.,
316 [Gonzalez and Wintz 77] Gonzalez, R..C. and Wintz P. "Digital Image Processing", Wesley, Reading, Massachusetts, 1977. "Digital GAC,
image Akron,
Addison-
[Goodyear76]
Goodyear Aerospace Corporation STARAN", Report GER-16336,
processing and Ohio, 1976.
[Gosling80]
Gosling, J.B. "Design of Arithmetic Units for Digital Computers", MacMillan Press Ltd, London, 1980.
[Graniund81]
Granlund, G. "GOP: A fast and flexible processor for image analysis", in [Duff and Levialdi 81], pp. 179-188.
[Hawthorn and DeWitt 82] Hawthorn P. B., DeWitt D , "Performance Analysis of Alternative Database machine Architectures", IEEE Trans. on Software Engineering, VoJ SE-8, No. 1, January t982. [Held et al.75]
Held G. D.,Stonebraker M. R., data base management system", Conf, AFIPS Press, 1975.
and Wong E, "INGRES - A relational Proc. AFIPS 1975 National Computer
[Hockney and 3esshope 8I] Hockney, R . W . and Jesshope C.R. "Parallel Computers: Architecture, Programming and Algorithms", Adam Hilger Ltd, Bristoi~ 1981. [Hsiao8g]
Hsiao O. Vol. 19,
K., ed.
"Data Base Computers", in Advances in Computers, Yovits M. C., Academic Press, Toronto, 198g.
[Hang and Su 81] Hong Y.C, Su S.Y.W, for Integrity Control", 1981.
"Associative Hardware and Software Techniques ACM TODS, Vol. 6, No. 3, September
[Hwang79]
Hwang, Design",
K. "Computer Arithmetic: Wiley, New York, 1979.
[IEEE G-AE 67]
IEEE G-AE Subcommittee on Measurement Concepts "What is the fast Fourier transform?", IEEE Transactions on Audio Electroacoustics, AU-15(2), 1967, pp. 45-55,
[Iliffe82]
Iliffe, 1982.
[Justusson80]
3ustusson, B.I. "On the use of medians and other order statistics in picture processing", Proceedings of the First Scandinavian Conference on Image Analysis, Linkaping~ 1980, pp. 84-86.
[Kingg0]
King W.F, "Relational Database Systems: Where We Stand Today", Proc. IFIP Congress 1980, Toronto 1980.
3.K.
Principles,
"Advanced Computer Design",
Architecture
Prentice Hall,
and
London,
317 [Knuth73]
Knuth D., The Art of Computer Programming, Vo[. Wesley Publishing Company, Reading, Mass., 1973.
3,
Addisson-
[Kordina83]
Kordina, S. "An I/O Processor for LUCAS", Master Thesis, Department of Computer Engineering, University of Lund, Sept. 1983.
[Kruse73]
Kruse, B. "A parallel picture processing machine", on Computers, C-22, 1973, pp. 1075-1087.
[Kruse77]
Kruse, B. "Design and Implementation of a Picture Processor", thesis, Linkoping Studies in Science and Technology Dissertation, 13, 1977.
[Kruse et at.80]
Kruse, B., Gudmunsson B., and Antonsson D. " F I P - the picap II filter processor", 5th International Joint Conference on Pattern Recognition, 1980, pp. 484-488.
[Kruzeta83]
Kruzela, I.K. "An Associative Array Processor Supporting a Relational Algebra", PhD thesis, Dept. of Computer Engineering, University of Lurid, Sweden, 1983.
IEEE Transactions
PhD No.
[Kruzela and Svensson 8t] Kruzela I.,Svensson B. A.,"The LUCAS Architecture and its Application to Relational Data Base Management", Proc of the 6-th Workshop on Computer Architecture for Non-Numerical Processing, INRIA, Hyeres, 1981. [Kuck68]
Kuck, Trans.
D.J. "ILLIAC IV Software Application Programming", on Computers, Vol. C-17, No. 8, Aug.
IEEE 1968.
[Kushner et al.81] Kushner, T., Wu A., and Rosenfeld A. "Image processing on MPP:I", Technical Report TR-1007, Computer Vision Laboratory, University of Maryland, College Park, MD, USA, 1981. [Landskov et al.80] Landskov, D., Davidson, S., Shriver, B., Maltett, P . W . "Local Microcode Compaction Techniques", Computing Surveys, Vol. 12, No. 3, Sept. 1980. [Lang76]
Long, T. "Interconnections Between Processors and Memory Modules Using the Shuffle-Exchange Network", IEEE Trans. on Computers, Vol. C-25, No. 5, May 1976.
[Langdon78]
Langdon G. management",
[Lawrie74]
Lawrie, D.H. "Glypnir Programming Manual", ILLIAC IV Doc. No. 232, ILLIAC tV Proj., University of illinois at Urbana-Champaign, Urbana, Ill. 1974.
O., ACM
"A note TODS,
on associative processors for data Vol. 3, No 2., June 1978.
318 [Lawrie et at.75] Lawrie, D.H., Layman, T., Baer, D., Randal, 3.M. "Glypnir - a Programming Language for ILLIAC IV", Comm. of the ACM, Voi. 18, No. 3, March 1975. [Lawrie75]
Lawrie, D.H. "Access and alignment of data in an array processor", IEEE Transactions on Computers, VoL C-24, No. 12, 1975, pp. 1145-1155.
[Levialdi et al.80] Levialdi, S., Isoldi, M., Uccella, G. "Programming in Pixal", IEEE Workshop on Picture Data Description and Management, Asilomar, California~ Aug. 1980. [Lindh at al.84]
Lindh G., Kruzela I., and Speck D.,"A Relational Algebra Machine", Prec. of the International Workshop On High-Level Computer Architecture 84, Los Angeles, May 198/4.
[Lipovski et al.78]Lipovski G.3, Su S.Y.W, "Architectural Features of CASSM: A Context Addressed Segment Sequential Memory", Prec. 5th Annum Symposium on Computer Architecture, Pale Alto, April 1978. [Loucks et a1.82] Loucks, W.M., Snelgrove M., and Zaky S.G. Based on One-Bit Microprocessors", Computer, 53-62.
"A Vector Processor February 1982, pp.
[Love75]
Love, Proc.
H.H. "Programming the Associative Linear Array Processor", of the 1975 Sagamore Comp. Conference on Parallel Processing.
[Maller79]
Maller V.A.3, "The Content Addressable File Store - CAFS", ICL Technical Journal, Vol. 1, No. 3., November 1979.
[MarksS0]
Marks, P. "Low-level vision using an array processor", Graphics and Image Processing, VoL 1/4., 1980, pp.
Computer 281-292.
[McGee81]
McGee W. C., "Data Base Technology", Vol. 25, No. 5, September 1981.
Develop.,
iBM J.
Res.
[Menon and Hsiao 81] Menon M. 3. Hsiao O. K., "Design and Analysis of A lRelational Join Operation for VLSF', Proc 7-th VLDB Conf., Cannes, September 1981. [Mezzalama et al.82] Mezzaiama, M., Prinetto, P., Filippi, G. "Microcode Compaction via Microblock Definition", Prec. 15th Ann. Workshop on Microprogramming (ACM)~ 1982. [Mick and Brick 80] Mick, J. and Brick, J. "Bit-Slice Microprocessor Design", Hill Book Company, 1980.
Mc Graw-
319
[Millstein73]
Millstein, of the
[Millstein and Muntz 75] Millstein, Compiler", Compilers
R.E. "Control Structures in ILLIAC IV FORTRAN", ACM, Vol. 16, No. 10, Oct.
Comm. 1973.
R.E. and Muntz, C.A. "The ILLIAC IV FORTRAN Proc. of a Conf. on Programming Languages and for Paraliel and Vector Machines, March 1975.
[Mueller et aLB0] Mueller, P.T. Jr, Siegel, L..L, for Image and Speech Processing", 1980.
Siegel, H.3. "A Parallel Language Proc. of the COMPSAC 80~ Oct.
[Nussbaumer81]
Nussbaumer, H.J. "Fast Fourier Transform Algorithms", Springer-Verlag, Berlin, 1981.
and
Cenvolutional
[Oh[ssonB2]
Ohlsson, L. "Real time spectral analysis of speech on a small associative computer", Master Thesis Technical Report, Department of Computer Engineering, University of Lund, 1982.
[Ohlsson and Svensson B3] Ohlsso% L. and Svensson, B. "Matrix multiplication on LUCAS", 6th Symposium on Computer Arithmetic, I E E E I983, pp. 116-122. [Ohlsson84a]
Ohlsson, L. "An ~mproved LUCAS architecture for Signal Processing", Technical Report, Department of Computer Engineering, University of Lund, Sept 1984.
[Ohlsson84b]
Ohlsson, L. "An S I M D processor with bit-serial multipliers", International Conference on Computers, Systems and Signal Processing, Bangalor% India, Dec 198/4.
[Oliver79]
Oliver E. J., "RELACS, An Associative Computer Architecture to Support a Relational Data Model", PhD thesis, Syracuse University, 1979.
[Orcutt74]
Orcutt, S . E . "Efficient Data Routing Schemes for [LLIAC IV - type Computers", Digital Systems Lab, Stanford University, Tech. Rep. 70, Apr 1974.
[Ozkarahan et al.75] Ozkarahan E.A., Schuster S.A., Smith K.C., "RAP - An Associative Processor for Data Base Management", Proc AFIPS 1975 National Computer Conf., AFIPS Press, 1975. [Ozkarahan and Sevcik 77] Ozkarahan E.A., Sevcik K.C., "Analysis of Architectural Features for Enhancing the Performance of a Database Machine", ACM TODS, Vol 2, No. 4, December 1977. [Ozkarahan et al.77] Ozkarahan E.A., Schuster S.A., Sevcik K.C., "Performance evaluation of a Relational Associative Processor", ACM TODS, Vol. 2, No. 2, 3une 1977a.
320
[Pahrami73]
Pahrami B., "Associative Memories and Processors= An Overview and Selected Bibliography", Proc of the IEEE, Vol. 61, No. 6, June 1973.
[ParkerS0]
Parker, D . S . "Notes on Shuffle/Exchange-Type Switching Networks", IEEE Trans. on Computers, Vol. C-29, No. 3, Mar. 1980.
[Pease68]
Pease, M.C. "An adaptation of the fast Fourier transform for parallel processing", Journal of the Association for Computing Machinery~ Vot.
15,
No.
2,
t968,
pp.
252-264.
[Pease77]
Pease, M.C. "The indirect binary n-cube microprocessor array", IEEE Transactions on Computers, Vol. C-2 , No. 5, 1977~ pp. 458-473.
[Perrott79]
Perrott, R.H. "A Language for Array and Vector Processors", Trans. on Prog. Languages and Systems, Vol. 1, No. 2,
ACM Oct.
1979.
[Presberg and 3ohnson 75] Presberg, D.L. and Johnson, N . W . "The Paralyzer: IVTRAN's Parallelism Analyzer and Synthesizer", Proc. of a Conf. on Programming Languages and Compilers for Parallel and Vector Machines, March 1975. [Potter78]
Potter, 3.L. "The STARAN architecture and its application to image processing and pattern recognition algorithms", National Computer Conference 1978, pp. 1041-1047.
[Quinn and Deo 84] Oulnn, M.J. and Deo, N. "Parallel graph algorithms", Surveys, Vo1.16~ No.3, Sept 1988, pp 319-348.
Computing
[Ramamoovthy and Tsuchiya 7/4] Ramamoorthy, C.V. and Tsuchiya, M. "A High-Level Language for Horizontal Microprogramming", IEEE Trans. on Computers, Vol. C-2% No. 8, Aug. 1974. [Rao et al.76]
Rao, C.V.K., Prasada B., and Sarma K.R. algorithm for binary patterns", Computer Processing, Vol. 5, 1976, pp. 265-270.
[Reddaway79]
Reddaway, S. "The DAP approach", on Super Computers, rot. 2, 1979.
[Reeves84]
Reeves, A. "Parallel computer architectures for image processing", Computer Vision, Graphics, and Image Processing, Vol.25, I984, pp 68-88.
[Reeves et al.80] Reeves, A.P., Bruner, Language Parallel Pascal",
"A parallel shrinking Graphics and Image
Infotech State of the Art Report
.I.D., Parer, M . S . "The Programming 1980 Internat. Conf on Parallel Processing.
321 [Reeves and Brunet 80] Reeves, A.P. and Brunet, 3.D. "High Level Language Specification and Efficient Function Implementation for the MPP", internal Purdue Electrical Engineering Report TR-EE 80-32, gut. 1980. [Reeves et aL81] Reeves, A.P., Brunet, J.D., Brewer, T.M. "High Level Languages for the MPP", Internal Purdue Electrical Engineering Report TR-EE 81-45, Nov. 1981. [Resnick and Larson 75] Resnick, H.K. and Larson, A.G. "DMAP - A COBOL Extension for Associative Processors", Proc. of a Conf. on Programming Languages and Compilers for Parallel and Vector Machines, March 1975. [Rieger et al.80] Rieger, C., B a n e 3., and Trigg R. "ZMOB: A highly parallel multiprocessor", 1980 [EEE Workshop on Picture Data Description and Management, pp 298-304. [Roberts65]
Roberts, L.G. "Machine perception of three-dimensional solids", in "Optical and Electrooptical information Processing" (J.T. Tippett et al., eds.), pp.1 59-197, MiT Press, Cambridge, Massachusetts, 1965.
[Rosenfeld and Kak 76] Rosenfeld, A. and Kak A.C. Press, New York, 1976. [Schlumberger74] Schlumberger, Science Dept., [Schomberg77]
"Digital Picture Processing",
M.L. "De Bruijn Networks", Stanford, Cal. 3une 1974.
PhD thesis,
Academic
Comp.
Schomberg, H. "A Peripheral Array Computer and its Applications", Parallel Computers Parallel Mathematics, M. Feilmeier (ed.), internal. Assoc. for Mathematics and Computers in Simulation, 1977.
[Schuster et al.78]Schuster S.A., Nguyen H.B., Ozkarahan E.A., Smith K.C., "RAP.2 An Associative Processor for Data Bases", Proc. 5th Annual Symposium on Computer Architecture, Palo Alto, April 1978. [Shaw80]
Shaw D. Machine",
E., "Knowledge-Based Retrieval on A Relational Database PhD thesis, Stanford University, t980o
[Siegel81]
Siegel, H.J. "PASM: A reconfigurable muJtimicrocomputer system for image processing", in [Duff and Levia]di 81], pp. 257-265.
[SlotnickB2]
Slotnick, D.L. "The Conception and Development of Parallel Processors - a Personal Memoir", Ann. of the History of Comp. Vol. 4, No.l, Jan. 1982.
[Slotnik7O]
Slotnik D. L., "Logic per track devices", in Advances in Computers, Vol. 10, ed. Alt F., Academic Press, Toronto 1970.
322 [Sternberg79]
Sternberg~ S.R. "Parallel architectures for image processing"~ Proceedings of the 3rd International IEEE COMPSAC~ Chicago 1979, pp 712-717.
[Stevens75]
Stevens~ K.Q. Jr. "CFD - A FORTRAN-Like Language for the [LLIAC IV", Prec. of a Conf. on Programming Languages and Compilers for Parallel and Vector Machines~ March 1975.
[Stone71]
Stone, H.S. "Parallel processing with the perfect shuffle", IEEE Transactions on Computers, Vol. C-20~ No. 2, 1971, pp. 153-161.
[Stonebraker et al.79] Stonebraker M. R., Wong E., and Kreps P., "The Design and Implementation of INGRES", ACM TODS~ Vol 1, No. 3, September 1976.
[Su and Lipovski 75] Su S.Y.W~ Lipovski G.3~ "CASSM: A Cellular System for Very Large Data Bases"~ Prec. Int. Conf Very Large Databases~ September 1975. [Svensson83a]
Svensson, B. "LUCAS Processor Array - Design and Applications", PhD thesis~ Department of Computer Engineering, University of Lund, 1983.
[Svensson83b]
Svensson, B. "Image operations performed on LUCAS - an array of bit-serial processors", 3rd Scandinavian Conference on Image Analysis, July 198% pp. 308-313.
[Thurber and Ward 75] Thurber K. 3.~ Watd L. D.~ "Associative and Parallel Processors'~ 7~ No. /4, December 1975. Computing Surveys~ Vol. [Thurber76]
Thurber, K.3. "Large Scale Computer Architecture"~ Comp.~ Rochelle Park, New 3ersey, 1976.
Hoyden Book
[Tong and Yao 81] Tong F., Yao B. S., "Design of a Two-Dimentional Join Processor Array", Proc of the 6-th Workshop on Computer Architecture for Non-Numerical Processing, INRIA, Hyeres, 3une 1981. [Tong and Yao 82] Tong F., processors", t982.
Yao S. B., "Performance analysis of database join Proc AFIPS 1982 Nationai Computer Conf., AFIPS Press,
[Unger58]
Unger~ S.H. Proceedings of
[Uhr79]
Uhr, L. Pascal",
"A computer IRE, Vet.
oriented towards 46, 1958,
spatial problems", pp. t744-1750.
"A Language for Parallel Processing of Arrays, Embedded in Camp. Sciences Technical Report #365, Sept. 1979.
323
[Wirth71]
Wirth, N. "The Design of a PASCAL Compiler", and Experience, 1, No. 4, 1971.
Software-Practice
[Wong and Youssefi 76]
Wong E., Youssefi K., Processing", ACM TODS, [Yao79]
"Decomposition Vol. 1, No.
A 3,
Strategy for Query September 1976.
Yao S. B., "Optimization of Query Evaluation Algorithms", TODS, Vot. /4, No. 2, June 1979.
ACM
[Yau and Feng 77] Yau S.
Survey",
S., Fung H. S., "Associative Processor Architecture - A Computing Surveys, Vot. 9, No. 1, March 1977.
[Yew and Lawfie 81] Yew, P-C. and Lawrie D.H. (1981) "An easily controlled network for frequently used permutations 'v, [EEE Transactions on Computers, Vol. C-30, No. /4, 1981, pp. 296-298. [Zloof75]
Zloof M. M., Computer Conf.,
~'Query By Example", AFIPS Press, 1975.
Proc AFIPS 1975 National