The Regularized Fast Hartley Transform: Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in

SIGNALS AND COMMUNICATION TECHNOLOGY For other titles published in this series, go to http://www.springer.com/series/4...

Author: Keith Jones

25 downloads 512 Views 1MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

SIGNALS AND COMMUNICATION TECHNOLOGY

For other titles published in this series, go to http://www.springer.com/series/4748

Keith Jones

The Regularized Fast Hartley Transform Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in Resource-Constrained Environments

123

Dr. Keith Jones L-3 Communications TRL Technology Shannon Way, Ashchurch, Tewkesbury Gloucestershire, GL20 8ND, U.K.

ISBN 978-90-481-3916-3 e-ISBN 978-90-481-3917-0 DOI 10.1007/978-90-481-3917-0 Springer Dordrecht Heidelberg London New York Library of Congress Control Number: 2009944070 c Springer Science+Business Media B.V. 2010 No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microﬁlming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied speciﬁcally for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Cover design: WMXDesign GmbH Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Most real-world spectrum analysis problems involve the computation of the real-data discrete Fourier transform (DFT), a unitary transform that maps elements of the linear space of real-valued N-tuples, RN , to elements of its complex-valued counterpart, CN , and when carried out in hardware it is conventionally achieved via a real-from-complex strategy using a complex-data version of the fast Fourier transform (FFT), the generic name given to the class of fast algorithms used for the efficient computation of the DFT. Such algorithms are typically derived by exploiting the property of symmetry, whether it exists just in the transform kernel or, in certain circumstances, in the input data and/or output data as well. In order to make effective use of a complex-data FFT, however, via the chosen real-from-complex strategy, the input data to the DFT must first be converted from elements of RN to elements of CN . The reason for choosing the computational domain of real-data problems such as this to be CN , rather than RN , is due in part to the fact that computing equipment manufacturers have invested so heavily in producing digital signal processing (DSP) devices built around the design of the complex-data fast multiplier and accumulator (MAC), an arithmetic unit ideally suited to the implementation of the complex-data radix-2 butterfly, the computational unit used by the familiar class of recursive radix-2 FFT algorithms. The net result is that the problem of the real-data DFT is effectively being modified so as to match an existing complex-data solution rather than a solution being sought that matches the actual problem. The increasingly powerful field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) technologies are now giving DSP design engineers far greater control, however, over the type of algorithm that may be used in the building of high-performance DSP systems, so that more appropriate algorithmically-specialized hardware solutions to the real-data DFT may be actively sought and exploited to some advantage with these technologies. The first part of this monograph thus concerns itself with the design of a new and highly-parallel formulation of the fast Hartley transform (FHT) which is to be used, in turn, for the efficient computation of the DFT. The FHT is the generic name given to the class of fast algorithms used for the efficient computation of the discrete Hartley transform (DHT) – a unitary (and, in fact, orthogonal) transform and close relative of the DFT possessing many of the same properties – which, v

vi

Preface

for the processing of real-valued data, has attractions over the complex-data FFT in terms of reduced arithmetic complexity and reduced memory requirement. It’s bilateral or reversal property also means that it may be straightforwardly applied to the transformation from Hartley space to data space as well as from data space to Hartley space, making it thus equally applicable to the computation of both the DFT and its inverse. A drawback, however, of conventional FHT algorithms lies in the loss of regularity (as relates to the algorithm structure) arising from the need for two sizes – and thus two separate designs – of butterfly for efficient fixed-radix formulations, where the regularity equates to the amount of repetition and symmetry present in the design. A generic version of the double butterfly, referred to as the “GD-BFLY” for economy of words, is therefore developed for the radix-4 FHT that overcomes the problem in an elegant fashion. The resulting single-design solution, referred to as the regularized radix-4 FHT and abbreviated to “R24 FHT”, lends itself naturally to parallelization and to mapping onto a regular computational structure for implementation with parallel computing technology. A partitioned-memory architecture for the parallel computation of the GD-BFLY and the resulting R24 FHT is next developed and discussed in some detail, this exploiting a single locally-pipelined high-performance processing element (PE) that yields an attractive solution, particularly when implemented with parallel computing technology, that is both area-efficient and scalable in terms of transform length. High performance is achieved by having the PE able to process the input/output data sets to the GD-BFLY in parallel, this in turn implying the need to be able to access simultaneously, and without conflict, both multiple data and multiple twiddle factors, or trigonometric coefficients, from their respective memories. A number of pipelined versions of the PE are described using both fast fixed-point multipliers and phase rotators – where the phase rotation operation is carried out in optimal fashion with hardware-efficient Co-Ordinate Rotation DIgital Computer (CORDIC) arithmetic – which enable arithmetic complexity to be traded off against memory requirement. The result is a set of scalable designs based upon the partitioned-memory single-PE computing architecture which each yield a hardware-efficient solution with universal application, such that each new application necessitates minimal re-design cost, as well as solutions amenable to efficient implementation with the silicon-based technologies. The resulting areaefficient and scalable single-PE architecture is shown to yield solutions to the real-data radix-4 FFT that are capable of achieving the computational density – that is, the throughput per unit area of silicon – of the most advanced commerciallyavailable complex-data solutions for just a fraction of the silicon resources. Consideration is given to the fact that when producing electronic equipment, whether for commercial or military use, great emphasis is inevitably placed upon minimizing the unit cost so that one is seldom blessed with the option of using the latest state-of-the-art device technology. The most common situation encountered is one where the expectation is to use the smallest (and thus the least expensive) device that is capable of yielding solutions able to meet the performance objectives, which often means using devices that are one, two or even three generations behind the latest specification. As a result, there are situations where there would be great merit

Preface

vii

in having designs that are not totally reliant on the availability of the increasingly large quantities of expensive embedded resources, such as fast multipliers and fast memory, as provided by the manufacturers of the latest silicon-based devices, but are sufficiently flexible to lend themselves to implementation in silicon even when constrained by the limited availability of embedded resources. The designs are thus required to be able to cater for a range of resourceconstrained environments where the particular resources being consumed and traded off, one against another, include the programmable logic, the power and the time (update time or latency), as well as the embedded resources already discussed. The choice of which particular FPGA device to use throughout the monograph for comparative analysis of the various designs is not considered to be of relevance to the results obtained as the intention is that the attractions of the solutions developed should be valid regardless of the specific device onto which they are mapped – that is, a “good” design should be device-independent. The author is well aware, however, that the intellectual investment made in achieving such a design may seem to fly in the face of current wisdom whereby the need for good engineering design and practice is avoided through the adoption of ever more powerful (and power consuming) computing devices – no apologies offered. The monograph, which is based on the fruits of 3 years of applied industrial research in the U.K., is aimed at both practicing DSP engineers with an interest in the efficient hardware implementation of the real-data FFT and academics /researchers/students from engineering, computer science and mathematics backgrounds with an interest in the design and implementation of sequential and parallel FFT algorithms. It is intended to provide the reader with the tools necessary to both understand the new formulation and to implement simple design variations that offer clear implementational advantages, both theoretical and practical, over more conventional complex-data solutions to the problem. The highly-parallel formulation of the real-data FFT described in the monograph will be shown to lead to scalable and device-independent solutions to the latency-constrained version of the problem which are able to optimize the use of the available silicon resources, and thus to maximize the achievable computational density, thereby making the solution a genuine advance in the design and implementation of high-performance parallel FFT algorithms. L-3 Communications TRL Technology, Shannon Way, Ashchurch, Tewkesbury, Gloucestershire, GL20 8ND, U.K.

Dr. Keith Jones

Acknowledgements

Firstly, and most importantly, the author wishes to thank his wife and partner in crime, Deborah, for her continued support for the project which has occupied most of his free time over the past 12 months or so, time that would otherwise have been spent together doing more enjoyable things. Secondly, given his own background as an industrial mathematician, the author gratefully acknowledges the assistance of Andy Beard of TRL Technology, who has painstakingly gone through the manuscript clarifying those technology-based aspects of the research least familiar to the author, namely those relating to the ever-changing world of the FPGA, thereby enabling the author to provide a more comprehensible interpretation of certain aspects of the results. Finally, the author wishes to thank Mark de Jongh, the Senior Publishing Editor in Electrical Engineering at Springer, together with his management colleagues at Springer, for seeing the potential merit in the research and providing the opportunity of sharing the results with you in this monograph.

ix

Contents

1

Background to Research .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.2 The DFT and Its Efficient Computation .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.3 Twentieth Century Developments of the FFT . . . . . . . . . . . . . .. . . . . . . . . . . 1.4 The DHT and Its Relation to the DFT. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.5 Attractions of Computing the Real-Data DFT via the FHT .. . . . . . . . . 1.6 Modern Hardware-Based Parallel Computing Technologies .. . . . . . . . 1.7 Hardware-Based Arithmetic Units . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.8 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.9 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.10 Organization of the Monograph . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

1 1 2 4 6 7 8 9 10 11 12 13

2

Fast Solutions to Real-Data Discrete Fourier Transform . . . . .. . . . . . . . . . . 2.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.2 Real-Data FFT Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.2.1 The Bergland Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.2.2 The Brunn Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.3 Real-From-Complex Strategies .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.3.1 Computing One Real-Data DFT via One Full-Length Complex-Data FFT . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.3.2 Computing Two Real-Data DFTs via One Full-Length Complex-Data FFT . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.3.3 Computing One Real-Data DFT via One Half-Length Complex-Data FFT . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.4 Data Re-ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.5 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

15 15 16 16 18 19

The Discrete Hartley Transform .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.2 Normalization of DHT Outputs.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.3 Decomposition into Even and Odd Components . . . . . . . . . .. . . . . . . . . . .

27 27 28 29

3

20 20 22 23 24 25

xi

xii

4

5

Contents

3.4

Connecting Relations Between DFT and DHT . . . . . . . . . . . .. . . . . . . . . . . 3.4.1 Real-Data DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.4.2 Complex-Data DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5 Fundamental Theorems for DFT and DHT . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.1 Reversal Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.2 Addition Theorem .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.3 Shift Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.4 Convolution Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.5 Product Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.6 Autocorrelation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.7 First Derivative Theorem.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.8 Second Derivative Theorem . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.9 Summary of Theorems .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.6 Fast Solutions to DHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.7 Accuracy Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.8 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

29 30 30 31 32 33 34 34 35 35 35 36 36 37 39 39 40

Derivation of the Regularized Fast Hartley Transform . . . . . . .. . . . . . . . . . . 4.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 4.2 Derivation of the Conventional Radix-4 Butterfly Equations . . . . . . . . 4.3 Single-to-Double Conversion of the Radix-4 Butterfly Equations .. . 4.4 Radix-4 Factorization of the FHT . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly . . . . 4.5.1 Twelve-Multiplier Version of Generic Double Butterfly . . . . 4.5.2 Nine-Multiplier Version of Generic Double Butterfly .. . . . . . 4.6 Trigonometric Coefficient Storage, Accession and Generation .. . . . . 4.6.1 Minimum-Arithmetic Addressing Scheme .. . . . . . .. . . . . . . . . . . 4.6.2 Minimum-Memory Addressing Scheme . . . . . . . . . .. . . . . . . . . . . 4.6.3 Trigonometric Coefficient Generation via Trigonometric Identities . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 4.7 Comparative Complexity Analysis with Existing FFT Designs . . . . . 4.8 Scaling Considerations for Fixed-Point Implementation ... . . . . . . . . . . 4.9 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

41 41 42 45 46 48 54 54 56 57 57

Algorithm Design for Hardware-Based Computing Technologies . . . . . . 5.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.2 The Fundamental Properties of FPGA and ASIC Devices .. . . . . . . . . . 5.3 Low-Power Design Techniques.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.3.1 Clock Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.3.2 Silicon Area.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.3.3 Switching Frequency .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.4 Proposed Hardware Design Strategy . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.4.1 Scalability of Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

65 65 66 67 68 68 70 70 71

58 59 61 62 63

Contents

6

xiii

5.4.2 Partitioned-Memory Processing . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.4.3 Flexibility of Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.5 Constraints on Available Resources . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.6 Assessing the Resource Requirements . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.7 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

71 72 73 73 74 75

Derivation of Area-Efficient and Scalable Parallel Architecture .. . . . . . . 6.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 6.2 Single-PE Versus Multi-PE Architectures . . . . . . . . . . . . . . . . .. . . . . . . . . . . 6.3 Conflict-Free Parallel Memory Addressing Schemes . . . . .. . . . . . . . . . . 6.3.1 Data Storage and Accession . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 6.3.2 Trigonometric Coefficient Storage, Accession and Generation . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 6.4 Design of Pipelined PE for Single-PE Architecture . . . . . . .. . . . . . . . . . . 6.4.1 Internal Pipelining of Generic Double Butterfly ... . . . . . . . . . . 6.4.2 Space Complexity Considerations .. . . . . . . . . . . . . . . .. . . . . . . . . . . 6.4.3 Time Complexity Considerations .. . . . . . . . . . . . . . . . .. . . . . . . . . . . 6.5 Performance and Requirements Analysis of FPGA Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 6.6 Constraining Latency Versus Minimizing Update-Time . .. . . . . . . . . . . 6.7 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

77 77 78 80 80 84 89 90 91 92 93 95 97 98

7

Design of Arithmetic Unit for Resource-Constrained Solution .. . . . . . . . .101 7.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .101 7.2 Accuracy Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .102 7.3 Fast Multiplier Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .103 7.4 CORDIC Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .104 7.4.1 CORDIC Formulation of Complex Multiplier .. . .. . . . . . . . . . .104 7.4.2 Parallel Formulation of CORDIC-Based PE . . . . . .. . . . . . . . . . .105 7.4.3 Discussion of CORDIC-Based Solution . . . . . . . . . .. . . . . . . . . . .106 7.4.4 Logic Requirement of CORDIC-Based PE . . . . . . .. . . . . . . . . . .109 7.5 Comparative Analysis of PE Designs . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .110 7.6 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .112 References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .115

8

Computation of 2n -Point Real-Data Discrete Fourier Transform .. . . . . .117 8.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .117 8.2 Computing One DFT via Two Half-Length Regularized FHTs . . . . . .118 8.2.1 Derivation of 2n -Point Real-Data FFT Algorithm . . . . . . . . . . .119 8.2.2 Implementational Considerations.. . . . . . . . . . . . . . . . .. . . . . . . . . . .122 8.3 Computing One DFT via One Double-Length Regularized FHT . . . .129 8.3.1 Derivation of 2n -Point Real-Data FFT Algorithm . . . . . . . . . . .129 8.3.2 Implementational Considerations.. . . . . . . . . . . . . . . . .. . . . . . . . . . .130

xiv

Contents

8.4 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .132 References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .134 9

Applications of Regularized Fast Hartley Transform . . . . . . . . .. . . . . . . . . . .135 9.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .135 9.2 Fast Transform-Space Convolution and Correlation.. . . . . .. . . . . . . . . . .136 9.3 Up-Sampling and Differentiation of Real-Valued Signal. .. . . . . . . . . . .137 9.3.1 Up-Sampling via Hartley Space . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .138 9.3.2 Differentiation via Hartley Space .. . . . . . . . . . . . . . . . .. . . . . . . . . . .139 9.3.3 Combined Up-Sampling and Differentiation . . . . .. . . . . . . . . . .139 9.4 Correlation of Two Arbitrary Signals . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .140 9.4.1 Computation of Complex-Data Correlation via Real-Data Correlation .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .141 9.4.2 Cross-Correlation of Two Finite-Length Data Sets . . . . . . . . . .142 9.4.3 Auto-Correlation: Finite-Length Against Infinite-Length Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .143 9.4.4 Cross-Correlation: Infinite-Length Against Infinite-Length Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .145 9.4.5 Combining Functions in Hartley Space . . . . . . . . . . .. . . . . . . . . . .147 9.5 Channelization of Real-Valued Signal . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .149 9.5.1 Single Channel: Fast Hartley-Space Convolution.. . . . . . . . . . .149 9.5.2 Multiple Channels: Conventional Polyphase DFT Filter Bank .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .151 9.6 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .155 References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .156

10 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .159 10.1 Outline of Problem Addressed.. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .159 10.2 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .160 10.3 Conclusions .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .162 Appendix A A.1 A.2

A.3 A.4

Computer Program for Regularized Fast Hartley Transform .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .163 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .163 Description of Functions .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .164 A.2.1 Control Routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .164 A.2.2 Generic Double Butterfly Routines .. . . . . . . . . . . . . . .. . . . . . . . . . .164 A.2.3 Address Generation and Data Re-ordering Routines . . . . . . . .165 A.2.4 Data Memory Accession and Updating Routines .. . . . . . . . . . .165 A.2.5 Trigonometric Coefficient Generation Routines . .. . . . . . . . . . .166 A.2.6 Look-Up-Table Generation Routines .. . . . . . . . . . . . .. . . . . . . . . . .167 A.2.7 FHT-to-FFT Conversion Routines .. . . . . . . . . . . . . . . .. . . . . . . . . . .167 Brief Guide to Running the Program .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .167 Available Scaling Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .169

Contents

xv

Appendix B Source Code Listings for Regularized Fast Hartley Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .173 B.1 Listings for Main Program and Signal Generation Routine .. . . . . . . . .173 B.2 Listings for Pre-processing Functions .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .185 B.3 Listings for Processing Functions . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .189 Glossary . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .221 Index . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .223

Biography

Keith John Jones is a Chartered Mathematician (C.Math.) and Fellow of the Institute of Mathematics & Its Applications (F.I.M.A.), UK, having obtained a B.Sc. Honours degree in Mathematics from the University of London in 1974 as an external student, an M.Sc. in Applicable Mathematics from Cranfield Institute of Technology in 1977, and a Ph.D. in Computer Science from Birkbeck College, University of London, in 1992, again as an external student. The Ph.D. was awarded primarily for research into the design of novel systolic processor array architectures for the parallel computation of the DFT. Dr. Jones currently runs a mathematical/software consultancy in Weymouth, Dorset, with his wife Deborah, as well as being employed as a part-time consultant with TRL Technology in Tewkesbury, Gloucestershire, where he is engaged in the design and implementation of high-performance digital signal processing algorithms and systems for wireless communications. Dr. Jones has published widely in the signal processing and sensor array processing fields, having a particular interest in the application of number theory, algebra, and nonstandard arithmetic techniques to the design of low-complexity algorithms and circuits for efficient implementation with suitably defined parallel computing architectures. Dr. Jones also holds a number of patents in these fields. Dr. Jones has been named in both “Who’s Who in Science and Engineering” and the “Dictionary of International Biography” since 2008.

xvii

Chapter 1

Background to Research

Abstract This chapter provides the background to the research results discussed in the monograph that relate to the design and implementation of the regularized FHT. Following a short historical account of the role of the DFT in modern science a case is made for the need for highly-parallel FFT algorithms geared specifically to the processing of real-valued data for use in the type of resource-constrained (both silicon and power) environments encountered in mobile communications. The relation of the DHT to the DFT is given and the possible benefits of using a highlyparallel formulation of the FHT for solving the real-data DFT problem discussed. This is followed by an account of the parallel computing technologies now available via the FPGA and the ASIC with which such a formulation of the problem might be efficiently implemented. A hardware-efficient arithmetic unit is also discussed which can yield a flexible-precision solution whilst minimizing the memory requirement. A discussion of performance metrics for various computing architectures and technologies is then given followed by an outline of the organization of the monograph.

1.1 Introduction The subject of spectrum or harmonic analysis started in earnest with the work of Joseph Fourier (1768–1830), who asserted and proved that an arbitrary function could be represented via a suitable transformation as a sum of trigonometric functions [6]. It seems likely, however, that such ideas were already common knowledge amongst European mathematicians by the time Fourier appeared on the scene, mainly through the earlier work of Joseph Louis Lagrange (1736–1813) and Leonhard Euler (1707–1783), with the first appearance of the discrete version of this transformation, the discrete Fourier transform (DFT) [36,39], dating back to Euler’s investigations of sound propagation in elastic media in 1750 and to the astronomical work of Alexis Claude Clairaut (1713–1765) in 1754 [24]. The DFT is now widely used in many branches of science, playing in particular a central role in the field of digital signal processing (DSP) [36, 39], enabling digital signals – namely those that have been both sampled and quantized – to be viewed in the frequency domain K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 1, c Springer Science+Business Media B.V. 2010

1

2

1 Background to Research

where, compared to the time domain, the information contained in the signal may often be more easily extracted and/or displayed, or where many common DSP functions, such as that of the finite impulse response (FIR) filter or the matched filter [36, 39], may be more easily or efficiently carried out. The monograph is essentially concerned with the problem of computing the DFT, via the application of various factorization techniques, using silicon-based parallel computing equipment – as typified by field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) technologies [31] – bearing in mind the size and power constraints relevant to the particular field of interest, namely that of mobile communications, where a small battery may be the only source of power supply for long periods of time. The monograph looks also to exploit the fact that the measurement data, as with many real-world problems, is real valued in nature, with each sample of data thus belonging to R, the field of real numbers [4], although the restriction to fixed-point implementations limits the range of interest still further to that of Z, the commutative ring of integers [4].

1.2 The DFT and Its Efficient Computation Turning firstly to its definition, the DFT is a unitary transform [17], which for the case of N input/output samples, may be expressed in normalized form via the equation N1 1 X X.F/ Œk D p xŒn:Wnk N; N nD0

k D 0; 1; : : : ; N 1

(1.1)

where the input/output data vectors belong to CN , the linear space of complexvalued N-tuples [4], and the transform kernel – also known as the Fourier Matrix and which is, as one would expect, a function of both the input and output data indices – derives from the term p WN D exp.i2 =N/; i D 1; (1.2) the primitive Nth complex root of unity [4, 32, 34]. The unitary nature of the DFT means that the inverse of the Fourier Matrix is equal to its conjugate-transpose, whilst its columns form an orthogonal basis [6, 7, 17] – similarly, a transform is said to be orthogonal when the inverse of the transform matrix is equal simply to its transpose, as is the case with any real-valued kernel. Note that the multiplication of any power of the term WN by any number belonging to C, the field of complex numbers [4], simply results in a phase shift of that number – the amplitude or magnitude remains unchanged. The direct computation of the N-point DFT as defined above involves O.N2 / arithmetic operations, so that many of the early scientific problems involving the DFT could not be seriously attacked without access to fast algorithms for its efficient solution, where the key to the design of such algorithms is the identification and

1.2 The DFT and Its Efficient Computation

3

exploitation of the property of symmetry, whether it exists just in the transform kernel or, in certain circumstances, in the input data and/or output data as well. One early area of activity with such transforms involved astronomical calculations, and in the early part of the nineteenth century the great Carl Friedrich Gauss (1777–1855) used the DFT for the interpolation of asteroidal orbits from a finite set of equally-spaced observations [24]. He developed a fast two-factor algorithm for its computation that was identical to that described in 1965 by James Cooley and John Tukey [12] – as with many of Gauss’s greatest ideas, however, the algorithm was never published outside of his collected works and only then in an obscure Latin form. This algorithm, which for a transform length of N D N1 N2 involves just O..N1 C N2 / N/ arithmetic operations, was probably the first member of the class of algorithms now commonly referred to as the fast Fourier transform (FFT) [5, 6, 9, 12, 17, 35], which is unquestionably the most ubiquitous algorithm in use today for the analysis or manipulation of digital data. In fact, Gauss is known to have first used the above-mentioned two-factor FFT algorithm for the solution of the DFT as far back as 1805, the same year that Admiral Nelson routed the French fleet at the Battle of Trafalgar – interestingly, Fourier served in Napoleon Bonaparte’s army from 1798 to 1801, during its invasion of Egypt, acting as scientific advisor. Although the DFT, as defined above, allows for both the input and output data sets to be complex valued (possessing both amplitude and phase), many real-world spectrum analysis problems, including those addressed by Gauss, involve only realvalued (possessing amplitude only) input data, so that there is a genuine need for the identification of a subset of the class of FFTs that are able to exploit this fact – bearing in mind that real-valued data leads to a Hermitian-symmetric (or conjugatesymmetric) frequency spectrum: complex-data FFT ) exploitation of kernel symmetry; whilst real-data FFT ) exploitation of kernel & spectral symmetries; with the exploitation of symmetry in the transform kernel being typically achieved by invoking the property of periodicity and of the Shift Theorem, as will be discussed later in the monograph. There is a requirement, in particular, for the development of real-data FFT algorithms which retain the regularity – as relates to the algorithm structure – of their complex-data counterparts as regular algorithms lend themselves more naturally to an efficient implementation. Regularity, which equates to the amount of repetition and symmetry present in the design, is most straightforwardly achieved through the adoption of fixed-radix formulations, such as with the familiar radix-2 and radix-4 algorithms [9, 11], as this essentially reduces the FFT design to that of a single fixed-radix butterfly, the computational engine used for carrying out the repetitive arithmetic operations. Note that with such a formulation the radix actually corresponds to the size of the resulting butterfly, although in Chapter 8 it is seen how a DFT, whose length is a power of two but not a power of four, may be solved by means of a highly-optimized radix-4 butterfly.

4


An additional attraction of fixed-radix FFT formulations, which for an arbitrary radix “R” decomposes an N-point DFT into logR N temporal stages each comprising N/R radix-R butterflies, is that they lend themselves naturally to a parallel solution. Such decompositions may be defined over the dimensions of either space – facilitating its mapping onto a single-instruction multiple-data (SIMD) architecture [1] – or time – facilitating its mapping, via the technique of pipelining, onto a systolic architecture [1, 29] – that enables them to be efficiently mapped onto one of the increasingly more accessible/affordable parallel computing technologies. With the systolic solution, each stage of the computational pipeline – referred to hereafter as a computational stage (CS) – corresponds to that of a single temporal stage. A parallel solution may also be defined over both space and time dimensions which would involve a computational pipeline where each CS of the pipeline involves the parallel execution of the associated butterflies via SIMD-based processing – such an architecture being often referred to in the computing literature as parallel-pipelined.

1.3 Twentieth Century Developments of the FFT As far as modern-day developments in FFT design are concerned it is the names of Cooley and Tukey that are always mentioned in any historical account, but this does not really do justice to the many contributors from the first half of the twentieth century whose work was simply not picked up on, or appreciated, at the time of publication. The prime reason for such a situation was the lack of a suitable technology for their efficient implementation, this remaining the case until the advent of the semiconductor technology of the 1960s. Early pioneering work was carried out by the German mathematician Carl Runge [40], who in 1903 recognized that the periodicity of the DFT kernel could be exploited to enable the computation of a 2N-point DFT to be expressed in terms of the computation of two N-point DFTs, this factorization technique being subsequently referred to as the “doubling algorithm”. The Cooley–Tukey algorithm, which does not rely on any specific factorization of the transform length, may thus be viewed as a simple generalization of this algorithm, as successive application of the doubling algorithm leads straightforwardly to the radix-2 version of the Cooley–Tukey algorithm. Runge’s influential work was subsequently picked up and popularized in publications by Karl Stumpff [45] in 1939 and Gordon Danielson and Cornelius Lanczos [13] in 1942, each in turn making contributions of their own to the subject. Danielson and Lanczos, for example, produced reduced-complexity solutions to the DFT through the exploitation of symmetries in the transform kernel, whilst Stumpff discussed versions of both the “doubling algorithm” and the “tripling algorithm”. All of the techniques developed, including those of more recent origin such as the “nesting algorithm” of Schmuel Winograd [49] and the “split-radix algorithm” of Pierre Duhamel [15], rely upon the divide-and-conquer [28] principle, whereby the computation of a composite length DFT is broken down into that of a number of smaller DFTs where the small-DFT lengths correspond to the factors

1.3 Twentieth Century Developments of the FFT

5

of the original transform length. Depending upon the particular factorization of the transform length, this process may be repeated in a recursive fashion on the increasingly smaller DFTs. When the lengths of the small DFTs have common factors, as encountered with the familiar fixed-radix formulations, then between successive stages of small DFTs there will be a need for the intermediate results to be modified by elements of the Fourier Matrix, these terms being commonly referred to in the FFT literature as twiddle factors. When the algorithm in question is a fixed-radix algorithm of the decimation-in-time (DIT) type [9], whereby the sequence of data space samples is decomposed into successively smaller sub-sequences, then the twiddle factors are applied to the inputs to the butterflies, whereas when the fixed-radix algorithm is of the decimation-in-frequency (DIF) type [9], whereby the sequence of transform space samples is decomposed into successively smaller sub-sequences, then the twiddle factors are applied to the outputs to the butterflies. Note, however, that when the lengths of the small DFTs have no common factors at all – that is, when they are relatively prime [4, 32] – then the need for the twiddle factor application disappears as each factor becomes equal to one. This particular result was made possible through the development of a new number-theoretic data re-ordering scheme in 1958 by the statistician Jack Good [20], the scheme being based upon the ubiquitous Chinese Remainder Theorem (CRT) [32,34,35] – which for the interest of those readers of a more mathematical disposition provides a means of obtaining a unique solution to a set of simultaneous linear congruences – whose origins supposedly date back to the first century A.D. [14]. Note also that in the FFT literature, the class of fast algorithms based upon the decomposition of a composite length DFT into smaller DFTs whose lengths have common factors – such as the Cooley–Tukey algorithm – is often referred to as the Common Factor Algorithm (CFA), whereas the class of fast algorithms based upon the decomposition of a composite length DFT into smaller DFTs whose lengths are relatively prime is often referred to as the Prime Factor Algorithm (PFA). Before moving on from this brief historical discussion, it is worth returning to the last name mentioned, namely that of Jack Good, as his background is a particularly interesting one for anyone with an interest in the history of computing. During World War Two Good served at Bletchley Park in Buckinghamshire, England, working alongside Alan Turing [25] on, amongst other things, the decryption of messages produced by the Enigma machine [19] – as used by the German armed forces. At the same time, and on the same site, a team of engineers under the leadership of Tom Flowers [19] – all seconded from the Post Office Research Establishment at Dollis Hill in North London – were, unbeknown to the outside world, developing the world’s first electronic computer, the Colossus [19], under the supervision of Turing and Cambridge mathematician Max Newman. The Colossus was built primarily to automate various essential code breaking tasks such as the cracking of the Lorenz code used by Adolf Hitler to communicate with his generals and was the first serious device – albeit a very large and a very specialized one – on the path towards our current state of technology whereby entire signal processing systems may be mapped onto a single silicon chip.

6


1.4 The DHT and Its Relation to the DFT A close relative of the Fourier Transform is that of the Hartley Transform, as introduced by Ralph Hartley (1890–1970) in 1942 for the analysis of transient and steady state transmission problems [23]. The discrete version of this unitary transform is referred to as the discrete Hartley transform (DHT) [8], which for the case of N input/output samples, may be expressed in normalized form via the equation 1 X X.H/ Œk D p xŒn:cas.2 nk=N/ N nD0 N1

k D 0; 1; : : : ; N 1

(1.3)

where the input/output data vectors belong to RN , the linear space of real-valued N-tuples, and the transform kernel – also known as the Hartley Matrix and which is, as one would expect, a function of both the input and output data indices – is as given by cas.2 nk=N/ cos.2 nk=N/ C sin.2 nk=N/:

(1.4)

Note that as the elements of the Hartley Matrix – as given by the “cas” function – are all real valued, the DHT is orthogonal, as well as unitary, with the columns of the matrix forming an orthogonal basis. Unlike the DFT, the DHT has no natural interpretation as a frequency spectrum, its most natural use being as a means for computing the DFT and as such, fast solutions to the DHT, which are referred to generically as the fast Hartley transform (FHT) [7, 8, 43], have become increasingly popular as an alternative to the FFT for the efficient computation of the DFT. The FHT is particularly attractive for the case of real-valued data, its applicability being made possible by the fact that all of the familiar properties associated with the DFT, such as the Circular Convolution Theorem and the Shift Theorem, are also applicable to the DHT, and that the complex-valued DFT output set and real-valued DHT output set may each be simply obtained, one from the other. To see the truth of this, note that the equality nk cas.2 nk=N/ D Re Wnk N Im WN

(1.5)

(where “Re” stands for the real component and “Im” for the imaginary component) relates the kernels of the two transformations, both of which are periodic with period 2 . As a result X.H/ Œk D Re X.F/ Œk Im X.F/ Œk ;

(1.6)

which expresses the DHT output in terms of the DFT output, whilst 1 .H/ X ŒN k C X.H/ Œk Re X.F/ Œk D 2

(1.7)

1.5 Attractions of Computing the Real-Data DFT via the FHT

7

and

1 .H/ X ŒN k X.H/ Œk ; (1.8) Im X.F/ Œk D 2 which express the real and imaginary components of the DFT output, respectively, in terms of the DHT output.

1.5 Attractions of Computing the Real-Data DFT via the FHT Although applicable to the computation of the DFT for both real-valued and complex-valued data, the major computational advantage of the FHT over the FFT, as implied above, lies in the processing of real-valued data. As most real-world spectrum analysis problems involve only real-valued data, significant performance gains may be obtained by using the FHT without any great loss of generality. This is evidenced by the fact that if one computes the complex-data FFT of an N-point real-valued data sequence, the result will be 2N real-valued (or N complex-valued) samples, one half of which are redundant. The FHT, on the other hand, will produce just N real-valued outputs, thereby requiring only one half as many arithmetic operations and one half the memory requirement for storage of the input/output data. The reduced memory requirement is particularly relevant when the transform length is large and the available resources are limited, as may well be the case with the application area of interest, namely that of mobile communications. The traditional approach to the DFT problem has been to use a complex-data solution, regardless of the nature of the data, this often entailing the initial conversion of real-valued data to complex-valued data via a wideband digital down conversion (DDC) process or through the adoption of a real-from-complex strategy whereby two real-data FFTs are computed simultaneously via one full-length complex-data FFT [42] or where one real-data FFT is computed via one half-length complex-data FFT [11, 42]. Each of the real-from-complex solutions, however, involves a computational overhead when compared to the more direct approach of a real-data FFT in terms of increased memory, increased processing delay to allow for the possible acquisition/processing of pairs of data sets, and additional packing/unpacking complexity. With the DDC approach, the information content of short-duration signals may also be compromised through the introduction of unnecessary filtering operations. The reason for such a situation is due in part to the fact that computing equipment manufacturers have invested so heavily in producing DSP devices built around the fast multiplier and accumulator (MAC), an arithmetic unit ideally suited to the implementation of the complex-data radix-2 butterfly, the computational unit used by the familiar class of recursive radix-2 FFT algorithms. The net result is that the problem of the real-data DFT is effectively being modified so as to match an existing complex-data solution rather than a solution being sought that matches the actual problem.

8


It should be noted that specialized FFT algorithms [2,10,15,16,18,30,33,44,46] do however exist for dealing with the case of real-valued data. Such algorithms compare favourably, in terms of arithmetic complexity and memory requirement, with those of the FHT, but suffer in terms of a loss of regularity and reduced flexibility in that different algorithms are typically required for the computation of the DFT and its inverse. Clearly, in applications requiring transform-space processing followed by a return to the data space, this could prove something of a disadvantage, particularly when compared to the adoption of a bilateral transform, such as the DHT, which may be straightforwardly applied to the transformation from Hartley space to data space as well as from data space to Hartley space, making it thus equally applicable to the computation of both the DFT and its inverse – this is as a result of the fact that its definitions for the two directions, up to a scaling factor, are identical. A drawback of conventional FHT algorithms [7, 8, 43], however, lies in the need for two sizes – and thus two separate designs – of butterfly for fixed-radix formulations, where a single-sized radix-R butterfly produces R outputs from R inputs and a double-sized radix-R butterfly produces 2R outputs from 2R inputs. A generic version of the double-sized butterfly, referred to as the generic double butterfly and abbreviated hereafter to “GD-BFLY”, is therefore developed in this monograph for the radix-4 FHT which overcomes the problem in an elegant fashion. The resulting single-design radix-4 solution, referred to as the regularized FHT [26] and abbreviated hereafter to “R24 FHT”, lends itself naturally to parallelization [1, 3, 21] and to mapping onto a regular computational structure for implementation with parallel computing technology.

1.6 Modern Hardware-Based Parallel Computing Technologies The type of high-performance parallel computing equipment referred to above is typified by the increasingly attractive FPGA and ASIC technologies which now give design engineers far greater flexibility and control over the type of algorithm that may be used in the building of high-performance DSP systems, so that more appropriate hardware solutions to the real-data FFT may be actively sought and exploited to some advantage with these silicon-based technologies. With such technologies, however, it is no longer adequate to view the complexity of the FFT purely in terms of arithmetic operation counts, as has conventionally been done, as there is now the facility to use both multiple arithmetic units – particularly fast multipliers – and multiple blocks or banks of fast memory in order to enhance the FFT performance via its parallel computation. As a result, a whole new set of constraints has arisen relating to the design of efficient FFT algorithms. With the recent and explosive growth of wireless technology, and in particular that of mobile communications, algorithms are now being designed subject to new and often conflicting performance criteria where the ideal is either to maximize the throughput (that is, to minimize the update time) or satisfy some constraint on the latency, whilst at the same time minimizing the required

1.7 Hardware-Based Arithmetic Units

9

silicon resources (and thereby minimizing the cost of implementation) as well as keeping the power consumption to within the available budget. Note, however, that the throughput is also constrained by the input–output (I/O) speed, as the algorithm cannot process the data faster than it can access it. Such trade-offs are considered in some considerable detail for the hardware solution to the R24 FHT, with the aim, bearing in mind the target application area of mobile communications, of achieving a power-efficient solution. The adoption of the FHT for wireless communications technology seems particularly apt, given the contribution made by the originator of the Hartley Transform (albeit the continuous version) to the foundation of information theory, where the Shannon–Hartley Theorem [37] helped to establish Shannon’s idea of channel capacity [37, 41]. The theorem simply states that if the amount of digital data or information transmitted over a communication channel is less than the channel capacity, then error-free communication may be achieved, whereas if it exceeds that capacity, then errors in transmission will always occur no matter how well the equipment is designed.

1.7 Hardware-Based Arithmetic Units When producing electronic equipment, whether for commercial or military use, great emphasis is inevitably placed upon minimizing the unit cost so that one is seldom blessed with the option of using the latest state-of-the-art device technology. The most common situation encountered is one where the expectation is to use the smallest (and thus the least expensive) device that is capable of yielding solutions able to meet the performance objectives, which often means using devices that are one, two or even three generations behind the latest specification. As a result, there are situations where there would be great merit in having designs that are not totally reliant on the availability of the increasingly large quantities of expensive embedded resources, such as fast multipliers and fast memory, as provided by the manufacturers of the latest silicon-based devices, but are sufficiently flexible to lend themselves to implementation in silicon even when constrained by the limited availability of embedded resources. One way of achieving such flexibility with the R24 FHT would be through the design of a processing element (PE) that minimizes or perhaps even avoids the need for fast multipliers, or fast memory, or both, according to the availability of the resources on the target device. Despite the increased use of the hardware-based computing technologies, however, there is still a strong reliance upon the use of software-based techniques for the design of the arithmetic unit. These techniques, as typified by the familiar fast multiplier, are relatively inflexible in terms of the precision they offer and, although increasingly more power efficient, tend to be expensive in terms of silicon resources. There are a number of hardware-based arithmetic techniques available, however, such as the shift-and-add techniques, as typified by the Co-Ordinate Rotation

10


DIgital Computer (CORDIC) arithmetic [47] unit, and the look-up table (LUT) techniques, as typified by the Distributed Arithmetic (DA) arithmetic [48] unit, that date back to the DSP revolution of the mid-twentieth century but nevertheless still offer great attractions for use with the new hardware-based technologies. The CORDIC arithmetic unit, for example, which may be used to carry out in an optimal fashion the operation of phase rotation – the key operation for the computation of the DFT – may be implemented by means of a computational structure whose form may range from fully-sequential to fully-parallel, with the latency of the CORDIC operation increasing linearly with increasing parallelism. The application of the CORDIC technique to the computation of the R24 FHT is considered in this monograph for its ability to both minimize the memory requirement and to yield a flexible-precision solution to the real-data DFT problem.

1.8 Performance Metrics Having introduced and defined the algorithms of interest in this introductory chapter, namely the DFT and its close relation the DHT, as well as discussing very briefly the various types of computing architecture and technology available for the implementation of their fast solutions, via the FFT and the FHT, respectively, it is now worth devoting a little time to considering the type of performance metrics most appropriate to each. For the mapping of such algorithms onto a uni-processor computing device, for example, the performance would typically be assessed according to the following: Performance Metric for Uni-processor Computing Device: An operation-efficient solution to a discrete unitary or orthogonal transform, when executed on a (Von Neumann) uni-processor sequential computing device, is one which exploits the symmetry of the transform kernel such that the transform may be computed with fewer operations than by direct implementation of its definition.

As is clear from this definition, the idea of identifying and exploiting the property of symmetry, whether it exists just in the transform kernel or, in certain circumstances, in the input data and/or output data as well, is central to the problem of designing fast algorithms for the efficient computation of discrete unitary or orthogonal transforms. For the mapping of such algorithms onto a multi-processor computing device, on the other hand, the performance would typically be assessed according to the following: Performance Metric for Multi-processor Computing Device: A time-efficient solution to a discrete unitary or orthogonal transform, when executed on a multi-processor parallel computing device, is one which facilitates the execution of many of the operations simultaneously, or in parallel, such that the transform may be computed in less time than by its sequential implementation.

With the availability of multiple processors the idealized objective of a parallel solution is to obtain a linear speed-up in performance which is directly proportional to

1.9 Basic Definitions

11

the number of processors used, although in reality, with most multi-processor applications, being able to obtain such a speed-up is rarely achievable. The main problem relates to the communication complexity arising from the need to move potentially large quantities of data between the processors. Finally, for the mapping of such algorithms onto a silicon-based parallel computing device, the performance would typically be assessed according to the following: Performance Metric for Silicon-Based Parallel Computing Device: A hardware-efficient solution to a discrete unitary or orthogonal transform, when executed on a silicon-based parallel computing device, is one which facilitates the execution of many of the operations simultaneously, or in parallel, such that the transform throughput per unit area of silicon is maximized.

Although other metrics could of course be used for this definition, this particular metric of throughput per unit area of silicon – referred to hereafter as the computational density – is targeted specifically at the type of power-constrained environment that one would expect to encounter with mobile communications, as it is assumed that a solution that yields a high computational density will be attractive in terms of both power efficiency and hardware efficiency, given the known influence of silicon area – to be discussed in Chapter 5 – on the power consumption.

1.9 Basic Definitions To clarify the use of a few basic terms, note that the input data to unitary transforms, such as the DFT and the DHT, may be said to belong to the data space which, as already stated, is CN for the case of the DFT and RN for the case of the DHT. Analogously, the output data from such transforms may be said to belong to the transform space which for the case of the DFT is referred to as Fourier space and for the case of the DHT is referred to as Hartley space. As already implied, all vectors with an attached superscript of “(F)” will be assumed to reside within Fourier space whilst all those with an attached superscript of “(H)” will be assumed to reside within Hartley space. These definitions will be used throughout the monograph, where appropriate, in order to simplify or clarify the exposition. Also, it has already been stated that for the case of a fixed-radix FFT, the trigonometric elements of the Fourier Matrix, as applied to the appropriate butterfly inputs/outputs, are generally referred to as twiddle factors. However, for consistency, the elements of both the Fourier and Hartley Matrices, as required for the butterflies of their respective decompositions, will be referred to hereafter as the trigonometric coefficients – for the fast solution to both transform types the elements are generally decomposed into pairs of real numbers for efficient application. Finally, note that the curly brackets “f.g” will be used throughout the monograph to denote a finite set or sequence of digital samples, as required, for example, for expressing the input–output relationship for both the DFT and the DHT. Also, the indexing convention generally adopted when using such sequences is that the

12


elements of a sequence in data space – typically denoted with a lower-case character – are indexed by means of the letter “n”, whereas the elements of a sequence in transform space – typically denoted with an upper-case character – are indexed by means of the letter “k”.

1.10 Organization of the Monograph The first section of the monograph provides the background information necessary for a better understanding of both the problem being addressed, namely that of the real-data DFT, and of the resulting solution described in the research results that follow. This involves, in Chapter 1, an outline of the problem set in a historical context, followed in Chapter 2 by an account of the real-data DFT and of the fast algorithms and techniques conventionally used for its solution, and in Chapter 3 by a detailed account of the DHT and the class of FHT algorithms used for its fast solution, and of those properties of the DHT that make the FHT of particular interest with regard to the fast solution of the real-data DFT. The middle section of the monograph deals with the novel solution proposed for dealing with the real-data DFT problem. This involves, in Chapter 4, a detailed account of the design and efficient computation of a solution to the DHT based upon the GD-BFLY, namely the regularized FHT or R24 FHT, which lends itself naturally to parallelization and to mapping onto a regular computational structure for implementation with silicon-based parallel computing technology. Design constraints and considerations for such technologies are then discussed in Chapter 5 prior to the consideration, in Chapter 6, of different architectures for the mapping of the R24 FHT onto such hardware. A partitioned-memory architecture exploiting a single high-performance PE is identified for the parallel computation of the GD-BFLY and of the resulting R24 FHT [27] whereby both the data and the trigonometric coefficients are partitioned or distributed across multiple banks of memory, referred to hereafter as the data memory (DM) and the trigonometric coefficient memory (CM), respectively. Following this, in Chapter 7, it is seen how the fast multipliers used by the GD-BFLY might in certain circumstances be beneficially replaced by a hardware-based parallel arithmetic unit – based upon CORDIC arithmetic – that is able to yield a flexible-precision solution, without need of trigonometric coefficient memory, when implemented with the proposed hardware-based technology. The final section of the monograph deals with applications of the resulting solution to the real-data DFT problem. This involves, in Chapter 8, an account of how the application of the R24 FHT may be extended to the efficient parallel computation of the real-data DFT whose length is a power of two, but not a power of four, this being followed by its application, in Chapter 9, to the computation of some of the more familiar and computationally-intensive DSP-based functions, such as those of correlation – both auto-correlation and cross-correlation – and of the wideband channelization of real-valued radio frequency (RF) data via the polyphase DFT filter bank [22]. With each such function, which might typically be encountered in that

References

13

increasingly important area of wireless communications relating to the geolocation [38] of signal emitters, the adoption of the R24 FHT may potentially result in both conceptually and computationally simplified solutions. The monograph concludes with two appendices which provide both a detailed description and a listing of computer source code, written in the “C” programming language, for all the functions of the proposed partitioned-memory single-PE solution to the R24 FHT, this code being used for proving the mathematical/logical correctness of its operation. The computer program provides the user with various choices of PE design and of storage/accession scheme for the trigonometric coefficients, helping the user to identify how the algorithm might be efficiently mapped onto suitable parallel computing equipment following translation of the sequential “C” code to parallel code as produced by a suitably chosen hardware description language (HDL). The computer code for the complete solution is also to be found on the compact disc (CD) accompanying the monograph. Finally, note that pseudo-code, loosely based on the “C” programming language, will be used throughout the monograph, where appropriate, to illustrate the operation of the R24 FHT and of the individual functions of which the R24 FHT is comprised.

References 1. S.G. Akl, The Design and Analysis of Parallel Algorithms (Prentice Hall, Upper Saddle River, NJ, 1989) 2. G.D. Bergland, A Fast Fourier Transform Algorithm for Real-Valued Series. Comm. ACM. 11(10) (1968) 3. A.W. Biermann, Great Ideas in Computer Science (MIT Press, Cambridge, MA, 1995) 4. G. Birkhoff, S. MacLane, A Survey of Modern Algebra (Macmillan, New York, 1977) 5. R. Blahut, Fast Algorithms for Digital Signal Processing (Addison Wesley, Boston, MA, 1985) 6. R.N. Bracewell, The Fourier Transform and Its Applications (McGraw Hill, New York, 1978) 7. R.N. Bracewell, The fast Hartley transform. Proc. IEEE. 72(8) (1984) 8. R.N. Bracewell, The Hartley Transform (Oxford University Press, New York, 1986) 9. E.O. Brigham, The Fast Fourier Transform and its Applications (Prentice Hall, Englewood Cliffs, NJ, 1988) 10. G. Bruun, Z-transform DFT filters and FFTs. IEEE Trans. ASSP. 26(1) (1978) 11. J.W. Cooley, P.A.W. Lewis, P.D. Welch, The fast Fourier transform algorithm and its applications. Technical Report RC-1743, IBM (1967) 12. J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19(4), 297–301 (1965) 13. G.C. Danielson, C. Lanczos, Some improvements in practical Fourier series and their application to x-ray scattering from liquids. J. Franklin Inst. 233, 365–380, 435–452 (1942) 14. C. Ding, D. Pei, A. Salomaa, Chinese Remainder Theorem: Applications in Computing, Coding, Cryptography. World Scientific (1996) 15. P. Duhamel, Implementations of Split-Radix FFT Algorithms for Complex, Real and RealSymmetric Data. IEEE Trans. ASSP. 34(2), 285–295 (1986) 16. P. Duhamel, M. Vetterli, Improved Fourier and Hartley transform algorithms: Application to cyclic convolution of real data. IEEE Trans. ASSP. 35(6), 818–824 (1987) 17. D.F. Elliott, K. Ramamohan Rao, Fast Transforms: Algorithms, Analyses, Applications (Academic, New York, 1982)

14


18. O. Ersoy, Real discrete Fourier transform. IEEE Trans. ASSP. 33(4) (1985) 19. P. Gannon, Colossus: Bletchley Park’s Greatest Secret (Atlantic Books, London, 2006) 20. I.J. Good, The interaction algorithm and practical Fourier series. J. Roy. Stat. Soc. Ser. B 20, 361–372 (1958) 21. D. Harel, Algorithmics: The Spirit of Computing. (Addison Wesley, Reading, MA, 1997) 22. F.J. Harris, Multirate Signal Processing for Communication Systems (Prentice Hall, Upper Saddle River, NJ, 2004) 23. R.V.L. Hartley, A more symmetrical Fourier analysis applied to transmission problems. Proc. IRE. 30 (1942) 24. M.T. Heideman, D.H. Johnson, C.S. Burrus, Gauss and the history of the fast Fourier transform. IEEE ASSP Mag. 1, 14–21 (1984) 25. A. Hodges, Alan Turing: The Enigma (Vintage, London, 1992) 26. K.J. Jones, Design and parallel computation of regularised fast Hartley transform. IEE Proc. Vision Image Signal Process. 153(1), 70–78 (February 2006) 27. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (2007) 28. L. Kronsjo, Computational Complexity of Sequential and Parallel Algorithms (Wiley, New York, 1985) 29. S.Y. Kung, VLSI Array Processors (Prentice Hall, Englewood Cliffs, NJ, 1988) 30. J.B. Marten, Discrete Fourier transform algorithms for real valued sequences. IEEE Trans. ASSP. 32(2) (1984) 31. C. Maxfield, The Design Warrior’s Guide to FPGAs (Newnes (Elsevier), Burlington, MA, 2004) 32. J.H. McClellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1979) 33. H. Murakami, Real-valued fast discrete Fourier transform and decimation-in-frequency algorithms. IEEE Trans. Circuits Syst. II: Analog Dig. Signal Process. 41(12), 808–816 (1994) 34. I. Nivan, H.S. Zuckerman, An Introduction to the Theory of Numbers (Wiley, New York, 1980) 35. H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms (Springer, Berlin, 1981) 36. A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1989) 37. J.R. Pierce, An Introduction to Information Theory: Symbols, Signals and Noise (Dover, New York, 1980) 38. R.A. Poisel, Electronic Warfare: Target Location Methods (Artech House, Boston, MA, 2005) 39. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1975) 40. C. Runge, Uber die Zerlegung Empirisch Periodischer Funktionen in Sinnus-Wellen. Zeit. Fur Math. Und Physik. 48, 443–456 (1903) 41. C.E. Shannon, A mathematical theory of communication. BSTJ 27, 379–423, 623–657 (1948) 42. G.R.L. Sohie, W. Chen, Implementation of Fast Fourier Transforms on Motorola’s Digital Signal Processors. downloadable document from website: www.Motorola.com 43. H.V. Sorensen, D.L. Jones, C.S. Burrus, M.T. Heideman, On computing the discrete Hartley transform. IEEE ASSP 33, 1231–1238 (1985) 44. H.V. Sorensen, D.L. Jones, M.T. Heideman, C.S. Burrus, Real-valued fast Fourier transform algorithms. IEEE Trans. ASSP. 35(6), 849–863 (1987) 45. K. Stumpff, Tafeln und Aufgaben zur Harmonischer Analyse und Periodogrammrechnung. (Julius Springer, Berlin, 1939) 46. P.R. Uniyal, Transforming real-valued sequences: Fast Fourier versus fast Hartley transform algorithms. IEEE Signal Process. 42(11) (1994) 47. J.E. Volder, The CORDIC trigonometric computing technique. IRE Trans. Elect. Comput. EC-8(3), 330–334 (1959) 48. S.A. White, Application of distributed arithmetic to digital signal processing: A tutorial review. IEEE ASSP Mag. 6(3), 4–19 (1989) 49. S. Winograd, Arithmetic Complexity of Computations. (SIAM, Philadelphia, PA, 1980)

Chapter 2

Fast Solutions to Real-Data Discrete Fourier Transform

Abstract This chapter discusses the two approaches conventionally adopted for dealing with the real-data DFT problem. The first approach involves the design of specialized fast algorithms, such as those due to Bergland and Bruun, which are geared specifically to addressing real-data applications and therefore able to exploit, in a direct way, the real-valued nature of the data – which is known to result in a Hermitian-symmetric frequency spectrum. The second approach, which is the most commonly adopted, particularly for applications requiring a hardware solution, involves re-structuring the data so as to use an existing complex-data FFT algorithm, possibly coupled with pre-FFT and/or post-FFT stages, to produce the DFT of either one or two (produced simultaneously) real-valued data sets – such solutions thus said to be obtained via a “real-from-complex” strategy. A discussion is finally provided relating to the results obtained in the chapter.

2.1 Introduction Since the original developments of spectrum analysis in the eighteenth century, the vast majority of real-world applications have been concerned with the processing of real-valued data, where the data generally corresponds to amplitude measurements of some signal of interest. As a result, there has always been a genuine practical need for fast solutions to the real-data DFT with two quite distinct approaches evolving over this period to address the problem. The first and more intellectually challenging approach involves trying to design specialized algorithms which are geared specifically to real-data applications and therefore able to exploit, in a direct way, the real-valued nature of the data which is known to result in a Hermitian-symmetric frequency spectrum, whereby for the case of an N-point transform

and

Re X.F/ Œk D Re X.F/ ŒN k

(2.1)

Im X.F/ Œk D Im X.F/ ŒN k ;

(2.2)

K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 2, c Springer Science+Business Media B.V. 2010

15

16

2 Fast Solutions to Real-Data Discrete Fourier Transform

so that one half of the DFT outputs are in fact redundant. Such solutions, as typified by the Bergland [1] and Bruun [3, 13] algorithms, only need therefore to produce one half of the DFT outputs. The second and less demanding approach – but also the most commonly adopted, particularly for applications requiring a hardware solution – involves re-structuring the data so as to use an existing complex-data FFT algorithm, possibly coupled with pre-FFT and/or post-FFT stages, to produce the DFT of either one or two (produced simultaneously) real-valued data sets – such solutions thus said to be obtained via a real-from-complex strategy [17]. Both of these approaches are now discussed in some detail prior to a summary of their relative merits and drawbacks.

2.2 Real-Data FFT Algorithms Since the re-emergence of computationally-efficient FFT algorithms, as initiated by the published work of James Cooley and John Tukey in the mid-1960s [5], a number of attempts have been made [1, 3, 6, 7, 9, 10, 12, 18, 19] at producing fast algorithms that are able to directly exploit the spectral symmetry that arises from the processing of real-valued data. Two such algorithms are those due to Glenn Bergland (1968) and Georg Bruun (1978) and these are now briefly discussed so as to give a flavour of the type of algorithmic structures that can result from pursuing such an approach. The Bergland algorithm effectively modifies the DIT version of the familiar Cooley–Tukey radix-2 algorithm [2] to account for the fact that only one half of the DFT outputs need to be computed, whilst the Bruun algorithm adopts an unusual recursive polynomial-factorization approach – note that the DIF version of the Cooley–Tukey fixed-radix algorithm, referred to as the Sande–Tukey algorithm [2], may also be expressed in such a form – which involves only real-valued polynomial coefficients until the last stage of the computation, making it thus particularly suited to the computation of the real-data DFT. Examples of the signal flow graphs (SFGs) for both DIT and DIF versions of the radix-2 FFT algorithm are as given below in Figs. 2.1 and 2.2, respectively.

2.2.1 The Bergland Algorithm The Bergland algorithm is a real-data FFT algorithm based upon the observation that the frequency spectrum arising from the processing of real-valued data is Hermitiansymmetric, so that only one half of the DFT outputs needs to be computed. Starting with the DIT version of the familiar complex-data Cooley–Tukey radix-2 FFT algorithm, if the input data is real valued, then at each of the log2 N temporal stages of the algorithm the computation involves the repeated combination of two transforms to yield one longer double-length transform. From this, Bergland observed that the property of Hermitian symmetry may actually be exploited at each of the

2.2 Real-Data FFT Algorithms

17

X[0]

X[0] 0

W8

X[2] X[4]

4-point DFT

X[1]

W 18 W 28

X[6]

X[2] X[3]

W 38

X[1]

X[4] 4

W8

X[3] X[5]

4-point DFT

W58 W68

X[7]

X[5] X[6] X[7]

W78

Fig. 2.1 Signal flow graph for DIT decomposition of eight-point DFT

X[0]

X[0]

X[1] 4-point DFT

X[2] X[3] X[4] X[5] X[6]

X[2] X[4] X[6]

-

W 08

X[1]

W 18 W 28

4-point DFT

X[5]

3

W8

X[7]

X[3]

-

X[7]

Fig. 2.2 Signal flow graph for DIF decomposition of eight-point DFT

log2 N stages of the algorithm. Thus, as all the odd-addressed output samples for each such double-length transform form the second half of the frequency spectrum, which can in turn be straightforwardly obtained from the property of spectral symmetry, Bergland’s algorithm instead uses those memory locations for storing the imaginary components of the data. Thus, with Bergland’s algorithm, given that the input data sequence is real valued, all the intermediate results may be stored in just N memory locations – each location thus corresponding to just one word of memory. The computation can also be carried out in an in-place fashion – whereby the outputs from each butterfly are stored in the same set of memory locations as used by the inputs – although the indices of the set of butterfly outputs are not in bit-reversed order, as they are with the Cooley–Tukey algorithm, being instead ordered according to the Bergland

18


ordering [1], as also are the indices of the twiddle factors or trigonometric coefficients. However, the normal ordering of the twiddle factors may, with due care, be converted to the Bergland ordering and the Bergland ordering of the FFT outputs subsequently converted to the normal ordering, as required for an efficient in-place solution [1, 17]. Thus, the result of the above modifications is an FFT algorithm with an arithmetic complexity of O.N:log2 N/ which yields a reduction of two saving, compared to the conventional zero-padded complex-data FFT solution – to be described in Section 2.3.1 – in terms of both arithmetic complexity and memory requirement.

2.2.2 The Brunn Algorithm The Bruun algorithm is a real-data FFT algorithm based upon an unusual recursive polynomial-factorization approach, proposed initially for the case of N input samples where N is a power of two, but subsequently generalized to arbitrary evennumber transform sizes by Hideo Murakami in 1996 [12]. Recall firstly, from Chapter 1, that the N-point DFT can be written in normalized form as N1 1 X X.F/ Œk D p xŒn:Wnk k D 0; 1; : : : ; N 1 (2.3) N N nD0 where the transform kernel is derived from the term WN D exp.i2 =N/;

(2.4)

the primitive Nth complex root of unity. Then, by defining the polynomial x(z) whose coefficients are those elements of the sequence fx[n]g, such that N1 1 X x.z/ D p xŒn:zn ; N nD0

(2.5)

it is possible to view the DFT as a reduction of this polynomial [11], so that X.F/ Œk D x WkN D x.z/ mod z WkN

(2.6)

where “mod” stands for the modulo operation [11] which denotes the polynomial remainder upon division of x(z) by z WkN [11]. The key to fast execution of the Bruun algorithm stems from being able to perform this set of N polynomial remainder operations in a recursive fashion. To compute the DFT involves evaluating the remainder of x(z) modulo some polynomial of degree one, more commonly referred to as a monomial, a total of N times, as suggested by Equations 2.5 and 2.6. To do this efficiently, one can combine the remainders recursively in the following way. Suppose it is required to evaluate

2.3 Real-From-Complex Strategies

19

x(z) modulo U(z) as well as x(z) modulo V(z). Then, by first evaluating x(z) modulo the polynomial product, U(z).V(z), the degree of the polynomial x(z) is reduced, thereby making subsequent modulo operations less computationally expensive. k Now the productof all of the monomials, z W , for values of k from 0 up N to N 1, is simply zN 1 , whose roots are clearly the N complex roots of unity. A recursive factorization of zN 1 is therefore required which breaks it down into polynomials of smaller and smaller degree with each possessing as few non-zero coefficients as possible. To compute the DFT, one takes x(z) modulo each level of this factorization in turn, recursively, until one arrives at the monomials and the final result. If each level of the factorization splits every polynomial into an O(1) number of smaller polynomials, each with an O(1) number of non-zero coefficients, then the modulo operations for that level will take O(N) arithmetic operations, thus leading to a total arithmetic complexity, for all log2 N levels, of O.N:log2 N/, as obtained with the standard Cooley–Tukey radix-2 algorithm. NNotethat for N a power of two, the Bruun algorithm factorizes the polynomial z 1 recursively via the rules z2M 1 D zM 1 zM C 1

(2.7)

and p p z4M C a:z2M C 1 D z2M C 2 a:zM C 1 z2M 2 a:zM C 1 ;

(2.8)

where “a” is a constant such that jaj 2. On completion of the recursion, when M D 1, there remains polynomials of degree two that can each be evaluated modulo two roots of the form z WkN for each polynomial. Thus, at each recursive stage, all of the polynomials may be factorized into two parts, each of half the degree and possessing at most three non-zero coefficients, leading to an FFT algorithm with an O.N:log2 N/ arithmetic complexity. Moreover, since all the polynomials have purely real coefficients, at least until the last stage, they quite naturally exploit the special case where the input data is real valued, thereby yielding a reduction of two saving, compared to the conventional zero-padded complex-data FFT solution to be discussed in Section 2.3.1, in terms of both arithmetic complexity and memory requirement.

2.3 Real-From-Complex Strategies By far the most common approach to solving the real-data DFT problem is that based upon the use of an existing complex-data FFT algorithm as it simplifies the problem, at worst, to one of designing pre-FFT and/or post-FFT stages for the packing of the real-valued data into the correct format required for input to the FFT algorithm and for the subsequent unpacking of the FFT output data to obtain the spectrum (or spectra) of the original real-valued data set (or sets). Note that any fast

20


algorithm may be used for carrying out the complex-data FFT, so that both DIT and DIF versions of fixed-radix FFTs, as already discussed, as well as more sophisticated FFT designs such as those corresponding to the mixed-radix, split-radix, prime factor, prime-length and Winograd’s nested algorithms [8, 11], for example, might be used.

2.3.1 Computing One Real-Data DFT via One Full-Length Complex-Data FFT The most straightforward approach to the problem involves first packing the realvalued data into the real component of a complex-valued data sequence, padding the imaginary component with zeros – this action more commonly referred to as zero padding – and then feeding the resulting complex-valued data set into a complexdata FFT. The arithmetic complexity of such an approach is clearly identical to that obtained when a standard complex-data FFT is applied to genuine complexvalued data, so that no computational benefits stemming from the nature of the data are achieved with such an approach. On the contrary, computational resources are wasted with such an approach, as excessive arithmetic operations are performed for the computation of the required outputs and twice the required amount of memory used for the storage of the input/output data.

2.3.2 Computing Two Real-Data DFTs via One Full-Length Complex-Data FFT The next approach to the problem involves computing two N-point real-data DFTs, simultaneously, by means of one N-point complex-data FFT. This is achieved by packing the first real-valued data sequence into the real component of a complexvalued data sequence and the second real-valued data sequence into its imaginary component. Thus, given two real-valued data sequences, fg[n]g and fh[n]g, a complex-valued data sequence, fx[n]g, may be simply obtained by setting x Œn D g Œn C i:h Œn ;

(2.9)

with the kth DFT output of the resulting data sequence being written in normalized form, in terms of the DFTs of fg[n]g and fh[n]g, as 1 X X.F/ Œk D p xŒn:Wnk N N nD0 N1

2.3 Real-From-Complex Strategies

21

1 X 1 X Dp gŒn:Wnk hŒn:Wnk N C ip N N nD0 N nD0 N1

N1

D G Œk C i:H Œk D .GR Œk HI Œk/ C i .GI Œk C HR Œk/ ;

(2.10)

where GR Œk and GI Œk are the real and imaginary components, respectively, of G[k] – the same applying to HR Œk and HI Œk with respect to H[k]. Similarly, the (N–k)th DFT output may be written in normalized form as 1 X X.F/ ŒN k D p xŒn:Wn.Nk/ N N nD0 N1

1 X 1 X D p gŒn:Wnk hŒn:Wnk N C ip N N nD0 N nD0 N1

N1

D G Œk C i:H Œk D .GR Œk C HI Œk/ C i .GI Œk C HR Œk/

(2.11)

where the superscript “ ” stands for the operation of complex conjugation, so that upon combining the expressions of Equations 2.10 and 2.11, the DFT outputs G[k] and H[k] may be written, in terms of the DFT outputs X.F/ Œk and X.F/ ŒN k, as GŒk D GR Œk C iGI Œk i h i 1 h .F/ Re X Œk C X.F/ ŒN k C iIm X.F/ Œk X.F/ ŒN k (2.12) D 2 and HŒk D HR Œk C iHI Œk i h i 1 h .F/ Im X Œk C X.F/ ŒN k iRe X.F/ Œk X.F/ ŒN k (2.13) D 2 where the terms Re X.F/ Œk and Im X.F/ Œk denote the real and imaginary components, respectively, of X.F/ Œk. Thus, it is evident that the DFT of the two real-valued data sequences, fg[n]g and fh[n]g, may be computed simultaneously, via one full-length complex-data FFT algorithm, with the DFT of the sequence fg[n]g being as given by Equation 2.12 and that of the sequence fh[n]g by Equation 2.13. The pre-FFT data packing stage is quite straightforward in that it simply involves the assignment of one real-valued data sequence to the real component of the complex-valued data array and one realvalued data sequence to its imaginary component. The post-FFT data unpacking stage simply involves separating out the two spectra from the complex-valued FFT output data, this involving two real additions/subtractions for each real-data DFT output together with two scaling operations each by a factor of 2 (which in fixedpoint hardware reduces to that of a simple right shift operation).

22


2.3.3 Computing One Real-Data DFT via One Half-Length Complex-Data FFT Finally, the last approach to the problem involves showing how an N-point complexdata FFT may be used to carry out the computation of one 2N-point real-data DFT. The kth DFT output of the 2N-point real-valued data sequence fx[n]g may be written in normalized form as 2N1 1 X X.F/ Œk D p xŒn:Wnk 2N 2N nD0

k D 0; 1; : : : ; N 1

1 X 1 X k Dp xŒ2n:Wnk xŒ2n C 1:Wnk N C W2N p N 2N nD0 2N nD0 N1

N1

(2.14)

which, upon setting gŒn D xŒ2n and hŒn D xŒ2n C 1, becomes 1 X 1 X k gŒn:Wnk hŒn:Wnk X.F/ Œk D p N C W2N p N 2N nD0 2N nD0 N1

N1

k D 0; 1; : : : ; N 1

D GŒk C Wk2N HŒk:

(2.15)

Therefore, by setting yŒn D gŒn C i:hŒn and exploiting the combined expressions of Equations 2.10 and 2.11, the DFT output Y[k] may be written as YŒk D .GR Œk HI Œk/ C i .GI Œk C HR Œk/

(2.16)

and that for Y[N–k] as YŒN k D .GR Œk C HI Œk/ C i .GI Œk C HR Œk/ :

(2.17)

Then, by combining the expressions of Equations 2.15–2.17, the real component of X.F/ Œk may be written as .F/

XR Œk D

1 Re .YŒk C YŒN k/ 2 1 C cos.k =N/:Im .YŒk C YŒN k/ 2 1 sin.k =N/:Re .YŒk YŒN k/ 2

and the imaginary component as .F/

XI Œk D

1 Im .YŒk YŒN k/ 2

(2.18)

2.4 Data Re-ordering

23

1 C sin.k =N/:Im .YŒk C YŒN k/ 2 1 cos.k =N/:Re .YŒk YŒN k/ : 2

(2.19)

Thus, it is evident that the DFT of one real-valued data sequence, fx[n]g, of length 2N, may be computed via one N-point complex-data FFT algorithm, with the real component of the DFT output being as given by Equation 2.18 and the imaginary component of the DFT output as given by Equation 2.19. The pre-FFT data packing stage is conceptually simple, but nonetheless burdensome, in that it involves the assignment of the even-addressed samples of the real-valued data sequence to the real component of the complex-valued data sequence and the odd-addressed samples to its imaginary component. The post-FFT data unpacking stage is considerably more complex than that required for the approach of Section 2.3.2, requiring the application of eight real additions/subtractions for each DFT output, together with two scaling operations, each by a factor of 2, and four real multiplications by pre-computed trigonometric coefficients.

2.4 Data Re-ordering All of the fixed-radix formulations of the FFT, at least for the case where the transform length is a power or two, require that either the inputs to or the outputs from the transform be permuted according to the digit reversal mapping [4]. In fact, it is possible to place the data re-ordering either before or after the transform for both the DIT and DIF formulations [4]. For the case of a radix-2 algorithm the data reordering is more commonly known as the bit reversal mapping, being based upon the exchanging of single bits, whilst for the radix-4 case it is known as the di-bit reversal mapping, being based instead upon the exchanging of pairs of bits. Such data re-ordering, when mapped onto a uni-processor sequential computing device, may be carried out via the use of either: 1. An LUT, at the expense of additional memory; or 2. A fast algorithm using just shifts, additions/subtractions and memory exchanges; or 3. A fast algorithm that also makes use of a small LUT – containing the reflected bursts of ones that change on the lower end with incrementing – to try and optimize the speed at the cost of a slight increase in memory with the optimum choice being dependent upon the available resources and the time constraints of the application. Alternatively, when the digit-reversal mapping is appropriately parallelized, it may be mapped onto a multi-processor parallel computing device, such as an FPGA, possessing multiple banks of fast memory, thus enabling the time-complexity to be greatly reduced – see the recent work of Ren et al. [14] and Seguel and Bollman [16].

24


The optimum approach to digit-reversal is dictated very much by the operation of the FFT, namely whether the FFT is of burst or streaming type, as discussed in Chapter 6.

2.5 Discussion The aim of this chapter has been to highlight both the advantages and the disadvantages of the conventional approaches to the solution of the real-data DFT problem. As is evident from the examples discussed in Section 2.2, namely the Bergland and Bruun algorithms, the adoption of specialized real-data FFT algorithms may well yield solutions possessing attractive performance metrics in terms of arithmetic complexity and memory requirement, but generally this is only achieved at the expense of a more complex algorithmic structure when compared to those of the highly-regular fixed-radix designs. As a result, such algorithms would not seem to lend themselves particularly well to being mapped onto parallel computing equipment. Similarly, from the examples of Section 2.3, namely the real-from-complex strategies, the regularity of the conventional fixed-radix designs may only be exploited at the expense of introducing additional processing modules, namely the pre-FFT and/or post-FFT stages for the packing of the real-valued data into the correct format required for input to the FFT algorithm and for the subsequent unpacking of the FFT output data to obtain the spectrum (or spectra) of the original real-valued data set (or sets). An additional set of problems associated with the realfrom-complex strategies, at least when compared to the more direct approach of a real-data FFT, relate to the need for increased memory and increased processing delay to allow for the possible acquisition/processing of pairs of data sets. It is worth noting that an alternative DSP-based approach to those discussed above is to first convert the real-valued data to complex-valued data by means of a wideband DDC process, this followed by the application of a conventional complex-data FFT. Such an approach, however, introduces an additional function to be performed – typically an FIR filter with length dependent upon the performance requirements of the application – which also introduces an additional processing delay prior to the execution of the FFT. Drawing upon a philosophical analogy, namely the maxim of the fourteenth century Franciscan scholar, William of Occam, commonly known as “Occam’s razor” [15], why use two functions when just one will suffice. A related and potentially serious problem arises when there is limited information available on the signal under analysis as such information might well be compromised via the filtering operation, particularly when the duration of the signal is short relative to that of the transient response of the filter – as might be encountered, for example, with problems relating to the detection of extremely short duration dual-tone multi-frequency (DTMF) signals. Thus, there are clear drawbacks to all such approaches, particularly when the application requires a solution in hardware using parallel computing equipment, so

References

25

that the investment of searching for alternative solutions to the fast computation of the real-data DFT is still well merited. More specifically, solutions are required possessing both highly regular designs that lend themselves naturally to mapping onto parallel computing equipment and attractive performance metrics, in terms of both arithmetic complexity and memory requirement, but without requiring excessive packing/unpacking requirements and without incurring the latency problems (as arising from the increased processing delay) associated with the adoption of certain of the real-from-complex strategies.

References 1. G.D. Bergland, A fast Fourier transform algorithm for real-valued series. Comm. ACM. 11(10) (1968) 2. E.O. Brigham, The Fast Fourier Transform and Its Applications. (Prentice Hall, Englewood Cliffs, NJ, 1988) 3. G. Bruun, Z-Transform DFT Filters and FFTs. IEEE Trans. ASSP. 26(1) (1978) 4. E. Chu, A. George, Inside the FFT Black Box (CRC Press, Boca Raton, FL, 2000) 5. J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19(4), 297–301 (1965) 6. P. Duhamel, Implementations of split-radix FFT algorithms for complex, real and realsymmetric data. IEEE Trans. ASSP 34(2), 285–295 (1986) 7. P. Duhamel, M. Vetterli, Improved Fourier and Hartley transform algorithms: Application to cyclic convolution of real data. IEEE Trans. ASSP 35(6), 818–824 (1987) 8. P. Duhamel, M. Vetterli, Fast Fourier transforms: A tutorial review and a state of the art. Signal Process. 19, 259–299 (1990) 9. O. Ersoy, Real discrete Fourier transform. IEEE Trans. ASSP 33(4) (1985) 10. J.B. Marten, Discrete Fourier transform algorithms for real valued sequences. IEEE Trans. ASSP. 32(2) (1984) 11. J.H. McClellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1979) 12. H. Murakami, Real-valued fast discrete Fourier transform and cyclic convolution algorithms of highly composite even length. Proc. ICASSP 3. 1311–1314 (1996) 13. H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms. (Springer, Berlin, 1981) 14. G. Ren, P. Wu, D. Padua, Optimizing Data Permutations for SIMD Devices (PDL’06, Ottawa, Ontario, Canada, 2006) 15. B. Russell, History of Western Philosophy (George Allen & Unwin, London, 1961) 16. J. Seguel, D. Bollman, A framework for the design and implementation of FFT permutation algorithms. IEEE Trans. Parallel Distrib. Syst. 11(7), 625–634 (2000) 17. G.R.L. Sohie, W. Chen, Implementation of Fast Fourier Transforms on Motorola’s Digital Signal Processors. Downloadable document from website: www.Motorola.com 18. H.V. Sorensen, D.L. Jones, M.T. Heideman, C.S. Burrus, Real-valued fast Fourier transform algorithms. IEEE Trans. ASSP 35(6), 849–863 (1987) 19. P.R. Uniyal, Transforming real-valued sequences: Fast Fourier versus fast Hartley transform algorithms. IEEE Signal Process. 42(11) (1994)

Chapter 3

The Discrete Hartley Transform

Abstract This chapter introduces the DHT and discusses those aspects of its solution, as obtained via the FHT, which make it an attractive choice for applying to the real-data DFT problem. This involves first showing how the DFT may be obtained from the DHT, and vice versa, followed by a discussion of those fundamental theorems, common to both the DFT and DHT algorithms, which enable the input data sets to be similarly related to their respective transforms and thus enable the DHT to be used for solving those DSP-based problems commonly addressed via the DFT, and vice versa. The limitations of existing FHT algorithms are then discussed bearing in mind the ultimate objective of mapping any subsequent solution onto silicon-based parallel computing equipment. A discussion is finally provided relating to the results obtained in the chapter.

3.1 Introduction An algorithm that would appear to satisfy most if not all of the requirements laid down in Section 2.5 of the previous chapter is that of the DHT, as introduced in Chapter 1, a discrete unitary transform [6] that involves only real arithmetic (thus making it also orthogonal) and that is intimately related to the DFT, satisfying all those properties required of a unitary transform as well as possessing fast algorithms for its solution. Before delving into the details, however, it is perhaps worth restating the definition, namely that for the case of N input/output samples, the DHT may be expressed in normalized form via the equation 1 X Œk D p xŒn:cas.2 nk=N/ k D 0; 1; : : : ; N 1 N nD0 N1

X

.H/

(3.1)

where the input/output data vectors belong to RN , the linear space of real-valued N-tuples, and the transform kernel is given by cas.2 nk=N/ D cos.2 nk=N/ C sin.2 nk=N/; K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 3, c Springer Science+Business Media B.V. 2010

(3.2) 27

28

3 The Discrete Hartley Transform

a periodic function with period 2 and possessing (amongst others) the following set of useful properties: cas .A C B/ D cosA:casB C sinA:cas. B/ cas .A B/ D cosA:cas .B/ C sinA:casB casA:casB D cos .A B/ C sin .A C B/ 1 1 casA C casB D 2:cas .A C B/ :cos .A B/ 2 2 1 1 .A B/ casA casB D 2:cas .A C B/ :sin 2 2

(3.3)

as will be exploited later for derivation of the FHT algorithm.

3.2 Normalization of DHT Outputs Suppose now that the DHT operation is applied twice, in succession, the first time to a real-valued data sequence, fx[n]g, and the second time to the output of the first operation. Then given that the DHT is bilateral and, like the DFT, a unitary transform, the output from the second operation, fy[n]g, can be expressed as fyŒng D DHT .DHT .fxŒng// fxŒng;

(3.4)

so that the output of the second operation is actually equivalent to the input of the first operation. However, it should be noted that without the presence of the scaling factor, p 1= N, that has been included in the current definition of the DHT, as given by Equation 3.1 above, the magnitudes of the outputs of the second DHT would actually be equal to N times those of the inputs of the first DHT, so that the role of the scaling factor is to ensure that the magnitudes are preserved. It should be borne in mind, however, that the presence of a coherent signal in the input data will result in most of the growth in magnitude occurring in the forward transform, so that any future scaling strategy – as discussed in the following chapter – must reflect this fact. Note that a scaling factor of 1=N isp often used for the forward definition of both the DFT and the DHT, the value of 1= N being used instead here purely for mathematical elegance, as it reduces the definitions of the DHT for both the forward and the reverse directions to an identical form. The fundamental theorems discussed in Section 3.5 for both the DFT and the DHT, however, are valid regardless of the scaling factor used.

3.4 Connecting Relations Between DFT and DHT

29

3.3 Decomposition into Even and Odd Components The close relationship between the DFT and the DHT hinges upon symmetry considerations which may be best explained by considering the decomposition of the DHT into its “even” and “odd” components [2], denoted E[k] and O[k], respectively, and written as X.H/ Œk D EŒk C OŒk (3.5) where, for an N-point transform, E[k] is such that E ŒN k D E Œk

(3.6)

O ŒN k D O Œk :

(3.7)

and O[k] is such that

As a result, the even and odd components may each be expressed in terms of the DHT outputs via the expressions EŒk D

1 .H/ X Œk C X.H/ ŒN k 2

(3.8)

and

1 .H/ X Œk X.H/ ŒN k ; (3.9) 2 respectively, from which the relationship between the DFT and DHT outputs may be straightforwardly obtained. OŒk D

3.4 Connecting Relations Between DFT and DHT Firstly, from the equality nk cas.2 nk=N/ D Re Wnk N Im WN ;

(3.10)

which relates the kernels of the two transformations, the DFT outputs may be expressed as X.F/ Œk D EŒk iOŒk; (3.11) so that

and

1 .H/ X ŒN k C X.H/ Œk Re X.F/ Œk D 2

(3.12)

1 .H/ X ŒN k X.H/ Œk ; Im X.F/ Œk D 2

(3.13)

30

whilst


X.H/ Œk D Re X.F/ Œk Im X.F/ Œk :

(3.14)

3.4.1 Real-Data DFT Thus, from Equations 3.12 to 3.14, the complex-valued DFT output set and the realvalued DHT output set may now be simply obtained, one from the other, so that a fast algorithm for the solution of the DFT may also be used for the computation of the DHT whilst a fast algorithm for the solution of the DHT may similarly be used for the computation of the DFT. Note from the above equations that pairs of real-valued DHT outputs combine to give individual complex-valued DFT outputs, such that X.H/ Œk & X.H/ ŒN k $ X.F/ Œk (3.15) for k D 1; 2; : : :; N=2 1, whilst the remaining two terms are such that

and

X.H/ Œ0 $ X.F/ Œ0

(3.16)

X.H/ ŒN=2 $ X.F/ ŒN=2:

(3.17)

With regard to the two trivial mappings provided above by Equations 3.16 and 3.17, it may also be noted from Equation 3.10 that when k D 0, we have cas .2 nk=N/ D Wnk N D 1;

(3.18)

so that the zero-address component in Hartley space maps to the zero-address (or zero-frequency) component in Fourier space, and vice versa, as implied by Equation 3.16, whilst when k D N=2, we have n cas .2 nk=N/ D Wnk N D .1/ ;

(3.19)

so that the Nyquist-address component in Hartley space similarly maps to the Nyquist-address (or Nyquist-frequency) component in Fourier space, and vice versa, as implied by Equation 3.17.

3.4.2 Complex-Data DFT Now, having defined the relationship between the Fourier-space and Hartley-space representations of a real-valued data sequence it is a simple task to extend the results to the case of a complex-valued data sequence. Given the linearity of the DFT – this property follows from the Addition Theorem to be discussed in the following

3.5 Fundamental Theorems for DFT and DHT

31

section – the DFT of a complex-valued data sequence, fxR Œn C i:xI Œng, can be written as the sum of the DFTs of the individual real and imaginary components, so that DFT .fxR Œn C i:xI Œng/ D DFT .fxR Œng/ C i:DFT .fxI Œng/ :

(3.20)

Therefore, by first taking the DHT of the individual real and imaginary components of the complex-valued data sequence and then deriving the DFT of each such component by means of Equations 3.12 and 3.13, the real and imaginary components of the DFT of the complex-valued data sequence may be written in terms of the two DHTs as 1 1 .H/ .H/ .H/ .H/ Re X.F/ Œk D XR ŒN k C XR Œk XI ŒN k XI Œk (3.21) 2 2 and 1 1 .H/ .H/ XR ŒN k X.H/ X.H/ Im X.F/ Œk D R Œk C I ŒN k C XI Œk ; (3.22) 2 2 respectively, so that it is now possible to compute the DFT of both real-valued and complex-valued data sequences by means of the DHT – pseudo code is provided for both the real-valued data and complex-valued data cases in Figs. 3.1 and 3.2, respectively. The significance of the complex-to-real decomposition described here for the complex-data DFT is that it introduces an additional level of parallelism to the problem as the resulting DHTs are independent and thus able to be computed simultaneously, or in parallel, when implemented with parallel computing technology – a subject to be introduced in Chapter 5. This is particularly relevant when the transform is long and the throughput requirement high.

3.5 Fundamental Theorems for DFT and DHT As has already been stated, if the DFT and DHT algorithms are to be used interchangeably, for solving certain types of signal processing problem, then it is essential that there are corresponding theorems [2] for the two transforms which enable the input data sequences to be similarly related to their respective transforms. Suppose firstly that the sequences fx[n]g and fX.F/ Œkg are related via the expression n o DFT .fxŒng/ D X.F/ Œk ; (3.23) so that fx[n]g is the input data sequence to the DFT and fX.F/ Œkg the corresponding transform-space output, thus belonging to Fourier space, and that fx[n]g and fX.H/ Œkg are similarly related via the expression

32

3 The Discrete Hartley Transform Description: The real and imaginary components of the real-data N-point DFT outputs are optimally stored in the following way: XRdata[0] XRdata[1] XRdata[N–1] XRdata[2] XRdata[N–2] --- --- ---

⫽ ⫽ ⫽ ⫽ ⫽

zeroth frequency output real component of 1st frequency output imaginary component of 1st frequency output real component of 2nd frequency output imaginary component of 2nd frequency output

--- --- ---

--- --- ---

--- --- ---

XRdata[N/2–1] ⫽ real component of (N/2–1)th frequency output XRdata[N/2+1] ⫽ imaginary component of (N/2–1)th frequency output XRdata[N/2] ⫽ (N/2)th frequency output Note: The components XRdata[0] and XRdata[N/2] do not need to be modified to yield zeroth and (N/2)th frequency outputs. Pseudo-Code for DHT-to-DFT Conversion: k = N – 1; for ( j ⫽ 1; j < (N/2); j⫽j+1) { store ⫽ XRdata[k] + XRdata[j]; XRdata[k] ⫽ XRdata[k] – XRdata[j]; XRdata[j] ⫽ store; XRdata[j] ⫽ XRdata[j] / 2; XRdata[k] ⫽ XRdata[k] / 2; k ⫽ k – 1; }

Fig. 3.1 Pseudo-code for computing real-data DFT from DHT outputs

n o DHT .fxŒng/ D X.H/ Œk ;

(3.24)

so that fx[n]g is now the input data sequence to the DHT and fX.H/ Œkg the corresponding transform-space output, thus belonging to Hartley space. Then using the normalized definition of the DHT as given by Equation 3.1 – with a similar scaling strategy assumed for the definition of the DFT, as given by Equation 1.1, and its inverse – the following commonly encountered theorems may be derived, each one carrying over from one transform space to the other. Note that the data sequence is assumed, in each case, to be of length N.

3.5.1 Reversal Theorem The DFT-based relationship is given by o n DFT .fxŒN ng/ D X.F/ ŒN k ;

(3.25)


33

Description: The complex-data N-point DFT outputs are optimally stored with array "XRdata" holding the real component of both the input and output data, whilst the array "XIdata" holds the imaginary component of both the input and output data. Note: The components XRdata[0] and XRdata[N/2] do not need to be modified to yield zeroth and (N/2)th frequency outputs. Pseudo-Code for DHT-to-DFT Conversion: k ⫽ N – 1; for (j ⫽ 1; j < (N/2);j⫽j+1) { // Real Data Channel. store ⫽ XRdata[k] + XRdata[ j]; XRdata[k] ⫽ XRdata[k] – XRdata[ j]; XRdata[ j] ⫽ store; XRdata[ j] ⫽ XRdata[j] / 2; XRdata[k] ⫽ XRdata[k] / 2; // Imaginary Data Channel. store ⫽ XIdata[k] + XIdata[j]; XIdata[k] ⫽ XIdata[k] – XIdata[j]; XIdata[j] ⫽ store; XIdata[j] = XIdata[j] / 2; XIdata[k] ⫽ XIdata[k] / 2; // Combine Outputs for Real and Imaginary Data Channels. store1 ⫽ XRdata[j] + XIdata[k]; store2 ⫽ XRdata[j] – XIdata[k]; store3 ⫽ XIdata[j] + XRdata[k]; XIdata[k] ⫽ XIdata[j] – XRdata[k]; XRdata[ j] ⫽ store2; XRdata[k] ⫽ store1; XIdata[ j] ⫽ store3; k ⫽ k – 1; }

Fig. 3.2 Pseudo-code for computing complex-data DFT from DHT outputs

with the corresponding DHT-based relationship given by n o DHT .fxŒN ng/ D X.H/ ŒN k :

(3.26)

3.5.2 Addition Theorem The DFT-based relationship is given by DFT .fx1 Œn C x2 Œng/ D DFT .fx1 Œng/ C DFT .fx2 Œng/ o n o n .F/ Œk C X Œk ; D X.F/ 1 2

(3.27)

34


with the corresponding DHT-based relationship given by DHT .fx1 Œn C x2 Œng/ D DHT .fx1 Œng/ C DHT .fx2 Œng/ o n o n .H/ .H/ D X1 Œk C X2 Œk :

(3.28)

3.5.3 Shift Theorem The DFT-based relationship is given by n o DFT .fxŒn n0 g/ D ei2 n0 k=N :X.F/ Œk ;

(3.29)

with the corresponding DHT-based relationship given by DHT .fxŒn n0 g/ o n o n D cos .2 n0 k=N/ :X.H/ Œk sin .2 n0 k=N/ :X.H/ ŒN k : (3.30)

3.5.4 Convolution Theorem Denoting the operation of circular or cyclic convolution by means of the symbol “”, the DFT-based relationship is given by n o .F/ .F/ DFT .fx1 Œng fx2 Œng/ D X1 Œk:X2 Œk ;

(3.31)

with the corresponding DHT-based relationship given by DHT .fx1 Œng fx2 Œng/ 1 .H/ .H/ .H/ .H/ X1 Œk:X2 Œk X1 ŒN k:X2 ŒN k D 2 .H/ .H/ .H/ .H/ CX1 Œk:X2 ŒN k C X1 ŒN k:X2 Œk :

(3.32)


35

3.5.5 Product Theorem The DFT-based relationship is given by n o n o .F/ DFT .fx1 Œn:x2 Œng/ D X.F/ 1 Œk X2 Œk ;

(3.33)

with the corresponding DHT-based relationship given by DHT .fx1 Œn:x2 Œng/ 1 .H/ .H/ .H/ .H/ X1 Œk X2 Œk X1 ŒN k X2 ŒN k D 2

.H/ .H/ .H/ .H/ C X1 Œk X2 ŒN k C X1 ŒN k X2 Œk :

(3.34)

3.5.6 Autocorrelation Theorem Denoting the operation of circular or cyclic correlation by means of the symbol “˝”, the DFT-based relationship is given by 1 DFT .fxŒng ˝ fxŒng/ D 2

ˇ ˇ ˇ .F/ ˇ2 ˇX Œkˇ ;

with the corresponding DHT-based relationship given by ˇ ˇ2 ˇ ˇ2 ˇ ˇ ˇ ˇ DHT .fxŒng ˝ fxŒng/ D ˇX.H/ Œkˇ C ˇX.H/ ŒN kˇ :

(3.35)

(3.36)

3.5.7 First Derivative Theorem The DFT-based relationship is given by DFT

o ˚ 0 n x Œn D i2 kX.F/ Œk ;

with the corresponding DHT-based relationship given by o

n ˚ DHT x0 Œn D 2 kX.H/ ŒN k :

(3.37)

(3.38)

36


3.5.8 Second Derivative Theorem The DFT-based relationship is given by o ˚ 00 n x Œn D 4 2 k2 X.F/ Œk ;

(3.39)

with the corresponding DHT-based relationship given by o

n ˚ DHT x00 Œn D 4 2 k2 X.H/ Œk :

(3.40)

DFT

3.5.9 Summary of Theorems This section simply highlights the fact that for every fundamental theorem associated with the DFT, there is an analogous theorem for the DHT, which may be applied, in a straightforward fashion, so that the DHT may be used to address the same type of signal processing problems as the DFT, and vice versa. An important example is that of the digital filtering of an effectively infinite-length data sequence with a fixed-length FIR filter, more commonly referred to as continuous convolution, where the associated linear convolution is carried out via the piecewise application of the Circular Convolution Theorem using either the overlap-add or the overlap-save technique [3]. The role of the DHT, in this respect, is much like that of the number-theoretic transforms (NTTs) [7] – as typified by the Fermat number transform (FNT) and the Mersenne number transform (MNT) – which gained considerable popularity back in the 1970s amongst the academic community. These transforms, which are defined over finite or Galois fields [7] via the use of residue number arithmetic [7], exist purely for their ability to satisfy the Circular Convolution Theorem. An additional and important result, arising from the Product Theorem of Equations 3.33 and 3.34, is that when the real-valued data sequences fx1 Œng and fx2 Œng are identical, we obtain Parseval’s Theorem [3], as given by the equation N1 X nD0

jxŒnj2 D

N1 Xˇ

N1 ˇ ˇ ˇ ˇ .F/ ˇ2 X ˇ .H/ ˇ2 ˇX Œkˇ D ˇX Œkˇ ;

kD0

(3.41)

kD0

which simply states that the energy in the signal is preserved under both the DFT and the DHT (and, in fact, under any discrete unitary or orthogonal transformation), so that the energy measured in the data space is equal to that measured in the transform space. This theorem will be used later in Chapter 8, where it will be invoked to enable a fast radix-4 FHT algorithm to be applied to the fast computation of the real-data DFT whose transform length is a power of two, but not a power of four. Finally, note that whenever theorems involve dual Hartley-space terms in their expression – such as the terms X.H/ Œk and X.H/ [N–k], for example, in the

3.6 Fast Solutions to DHT

37

convolution and correlation theorems – that it is necessary that care be taken to treat the zero-address and Nyquist-address terms separately, as neither term possesses a dual.

3.6 Fast Solutions to DHT Knowledge that the DHT is in possession of many of the same properties as the DFT is all very well, but to be of practical significance, it is also necessary that the DHT, like the DFT, possesses fast algorithms for its efficient computation. The first widely published work in this field is thought to be that due to Ronald Bracewell [1, 2], who produced both radix-2 and radix-4 versions of the DIT fixed-radix FHT algorithm. His work in this field was summarized in a short monograph [2] which has formed the inspiration for the work discussed here. The solutions produced by Bracewell are attractive in that they achieve the desired performance metrics in terms of both arithmetic complexity and memory requirement – that is, compared to a conventional complex-data FFT, they require one half of the arithmetic operations and one half the memory requirement – but suffer from the fact that they need two sizes – and thus two separate designs – of butterfly for efficient fixed-radix formulations. For the radix-4 algorithm, for example, a single-sized butterfly produces four outputs from four inputs, as shown in Fig. 3.3, whilst a double-sized butterfly produces eight outputs from eight inputs, as shown in Fig. 3.4, both of which will be developed in some detail from first principles in the following chapter. This lack of regularity makes an in-place solution somewhat difficult to achieve, necessitating the use of additional memory between the temporal stages, as well as making an efficient mapping onto parallel computing equipment less than straightforward. Although other algorithmic variations for the efficient solution to the DHT have subsequently appeared [4, 5, 10], they all suffer, to varying extents, in terms of their lack of regularity, so that alternative solutions to the DHT are still sought that possess the regularity associated with the conventional complex-data fixed-radix FFT algorithms but without sacrificing the benefits of the existing FHT algorithms in terms of their reduced arithmetic complexity, reduced memory requirement and optimal latency. Various FHT designs could be studied, including versions of the popular radix-2 and split-radix [4] algorithms, but when transform lengths allow for comparison, the radix-4 FHT is more computationally efficient than the radix-2 FHT, its design more regular than that of the split-radix FHT, and it has the potential for an eightfold speed up with parallel computing equipment over that achievable via a purely sequential solution, making it a good candidate to pursue for potential hardware implementation. The radix-4 version of the FHT has therefore been selected as the algorithm of choice in this monograph.

38

3 The Discrete Hartley Transform X[0]

X[0]

X[1]

X[1]

-

X[2]

X[2]

-

X[3]

-

X[3]

-

Zero-address version of single-sized butterfly

X[0]

X[0] 2

X[1]

X[1]

-

X[2]

X[2]

2

X[3]

X[3]

Nyquist-address version of single-sized butterfly

Fig. 3.3 Signal flow graphs for single-sized butterfly for radix-4 FHT algorithm

trigonometric coefficients X[0] X[0]

X[1] -

-

-

X[1] X[2]

X[2]

X[3] -

-

-

X[3] X[4]

X[4]

X[5] -

-

-

X[5] X[6]

X[6]

X[7] -

-

-

Fig. 3.4 Signal flow graph for double-sized butterfly for radix-4 FHT algorithm

X[7]

3.8 Discussion

39

3.7 Accuracy Considerations When compared to a full-length FFT solution based upon one of the real-fromcomplex strategies, as discussed in Section 2.3 of the previous chapter, the FHT approach will involve approximately the same number of arithmetic operations (when the complex arithmetic operations of the FFT are reduced to equivalent real arithmetic operations) in order to obtain each real-data DFT output. The associated numerical errors may be due to both rounding, as introduced via the discarding of the lower order bits from the fixed-point multiplier outputs, and truncation, as introduced via the discarding of the least significant bit from the adder outputs after an overflow has occurred. The underlying characteristics of such errors for the two approaches will also be very similar, however, due to the similarity of their butterfly structures, so that when compared to FFT-based solutions possessing comparable arithmetic complexity the errors will inevitably be very similar [8, 11]. This feature of the FHT will be particularly relevant when dealing with a fixedpoint implementation, as is implied with any solution that is to be mapped onto an FPGA or ASIC device, where the combined effects of both truncation errors [9] and rounding errors [9] will need to be properly assessed and catered for through the optimum choice of word length and scaling strategy.

3.8 Discussion When the DHT is applied to the computation of the DFT, as discussed in Section 3.4, a conversion routine is required to map the DFT outputs from Fourier space to Hartley space. For the real-data case, as outlined in Fig. 3.1, the conversion process involves two real additions/subtractions for each DFT output together with two scaling operations, whilst for the complex-data case, as outlined in Fig. 3.2, this increases to four real additions/subtractions for each DFT output together with two scaling operations. All the scaling operations, however, are by a factor of 2 which in fixed-point arithmetic reduces to that of a simple right shift operation. Note that if the requirement is to use an FHT algorithm to compute the power spectral density (PSD) [3, 6], which is typically obtained from the squared magnitudes of the DFT outputs, then there is no need for the Hartley-space outputs to be first transformed to Fourier space, as the PSD may be computed directly from the Hartley-space outputs – an examination of Equations 3.12–3.14 should convince one of this. Also, it should be noted that with many of the specialized real-data FFT algorithms, apart from their lack of regularity, they also suffer from the fact that different algorithms are generally required for the fast computation of the DFT and its inverse. Clearly, in applications requiring transform-space processing followed by a return to the data space, as encountered for example with matched filtering, this could prove something of a disadvantage, particularly when compared to the adoption of a bilateral transform, such as the DHT, where the definitions of both the transform and its inverse, up to a scaling factor, are identical.

40


References 1. R.N. Bracewell, The fast Hartley transform. Proc. IEEE 72(8) (1984) 2. R.N. Bracewell, The Hartley Transform (Oxford University Press, New York, 1986) 3. E.O. Brigham, The Fast Fourier Transform and Its Applications (Prentice Hall, Englewood Cliffs, NJ, 1988) 4. P. Duhamel, Implementations of split-radix FFT algorithms for complex, real and realsymmetric data. IEEE Trans. ASSP 34(2), 285–295 (1986) 5. P. Duhamel, M. Vetterli, Improved Fourier and Hartley transform algorithms: Application to cyclic convolution of real data. IEEE Trans. ASSP 35(6), 818–824 (1987) 6. D.F. Elliott, K. Ramamohan Rao, Fast Transforms: Algorithms, Analyses, Applications (Academic, New York, 1982) 7. J.H. McClellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1979) 8. J.B. Nitschke, G.A. Miller, Digital filtering in EEG/ERP analysis: Some technical and empirical comparisons. Behavior Res. Methods, Instrum. Comput. 30(1), 54–67 (1998) 9. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1975) 10. H.V. Sorensen, D.L. Jones, C.S. Burrus, M.T. Heideman, On Computing the Discrete Hartley Transform. IEEE ASSP 33, 1231–1238 (1985) 11. A. Zakhor, A.V. Oppenheim, Quantization errors in the computation of the discrete Hartley transform. IEEE Trans. ASSP 35(11), 1592–1602 (1987)

Chapter 4

Derivation of the Regularized Fast Hartley Transform

Abstract This chapter discusses a new formulation of the FHT, referred to as the regularized FHT, which overcomes the limitations of existing FHT algorithms given the ultimate objective of mapping the solution onto silicon-based parallel computing equipment. A generic version of the double-sized butterfly, the GD-BFLY, is described which dispenses with the need for two sizes – and thus two separate designs – of butterfly as required via conventional fixed-radix formulations. Efficient schemes are also described for the storage, accession and generation of the trigonometric coefficients using suitably defined LUTs. A brief complexity analysis is then given in relation to existing FFT and FHT approaches to both the real-data and complex-data DFT problems. A discussion is finally provided relating to the results obtained in the chapter.

4.1 Introduction A drawback of conventional FHT algorithms, as highlighted in the previous chapter, lies in the need for two sizes – and thus two separate designs – of butterfly for efficient fixed-radix formulations. For the case of the radix-4 FHT to be discussed here, a single-sized butterfly, producing four outputs from four inputs, is required for both the zero-address and the Nyquist-address iterations of the relevant temporal stages, whilst a double-sized butterfly, producing eight outputs from eight inputs, is required for each of the remaining iterations. We look now at how this lack of regularity might be overcome, bearing in mind the desire, ultimately, to map the resulting algorithmic structure onto suitably defined parallel computing equipment: Statement of Performance Objective No 1: The aim is to produce a design for a generic double-sized butterfly for use by a radix-4 version of the FHT which lends itself naturally to parallelization and to mapping onto a regular computational structure for implementation with parallel computing technology.

Note that the attraction of the radix-4 solution, rather than that of the more familiar radix-2 case, is its greater computational efficiency – in terms of both reduced arithmetic complexity and reduced memory access – and the potential for exploiting K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 4, c Springer Science+Business Media B.V. 2010

41

42

4 Derivation of the Regularized Fast Hartley Transform

greater parallelism, at the arithmetic level, via the larger sized butterfly, thereby offering the possibility of achieving a higher computational density when implemented in silicon – to be discussed in Chapter 6.

4.2 Derivation of the Conventional Radix-4 Butterfly Equations The first step towards achieving this goal concerns the derivations of the two different sized butterflies – the single and the double – as required for efficient implementation of the radix-4 FHT. A DIT version is to be adopted given that the DIT algorithm is known to yield a slightly better signal-to-noise ratio (SNR) than the DIF algorithm when fixed-point processing is used [7, 10]. In fact, the noise variance of the DIF algorithm can be shown to be twice that of the DIT algorithm [10], so that the DIT algorithm offers the possibility of using shorter word lengths and ultimately less silicon for a given level of performance. The data re-ordering, in addition, is assumed to take place prior to the execution of the transform so that the data may be efficiently generated and stored in memory in the required di-bit reversed order directly from the output of the analog-to-digital conversion (ADC) unit at minimal expense. Let us first decompose the basic DHT expression as given by Equation 3.1 from the previous chapter – although in this instance without the scaling factor and with output vector X.H/ now replaced simply by X for ease of exposition – into four partial summations, such that XŒk D

N=41 X

xŒ4n:cas.2 .4n/k=N/

nD0

C

N=41 X

xŒ4n C 1:cas.2 .4n C 1/k=N/

nD0

C

N=41 X

xŒ4n C 2:cas.2 .4n C 2/k=N/

nD0

C

N=41 X

xŒ4n C 3:cas.2 .4n C 3/k=N/:

(4.1)

nD0

Suppose now that x1Œn D xŒ4n; x2Œn D xŒ4n C 1; x3Œn D xŒ4n C 2 & x4Œn D xŒ4n C 3 (4.2) and note from Equation 3.3 of the previous chapter that cas.2 .4n C r/k=N/ D cas.2 nk=.N=4/ C 2 rk=N/ D cos.2 rk=N/ :cas.2 nk=.N=4// C sin.2 rk=N/ :cas.2 nk=.N=4//

(4.3)

4.2 Derivation of the Conventional Radix-4 Butterfly Equations

43

and cas.2 nk=N/ D cas.2 n.N k/=N/:

(4.4)

Then if the partial summations of Equation 4.1 are written as X1Œk D

N=41 X

x1Œn:cas.2 nk=.N=4//

(4.5)


(4.6)


(4.7)

x4Œn:cas.2 nk=.N=4//;

(4.8)

nD0

X2Œk D

N=41 X nD0

X3Œk D

N=41 X nD0

X4Œk D

N=41 X nD0

it enables the equation to be re-written as XŒk D X1Œk C cos.2 k=N/:X2Œk C sin.2 k=N/:X2ŒN=4 k C cos.4 k=N/:X3Œk C sin.4 k=N/:X3ŒN=4 k C cos.6 k=N/:X4Œk C sin.6 k=N/:X4ŒN=4 k;

(4.9)

the first of the double-sized butterfly equations. Now, by exploiting the properties of Equations 4.3 and 4.4, the remaining doublesized butterfly equations may be written as XŒN=4 k D X1ŒN=4 k C sin.2 k=N/:X2ŒN=4 k C cos.2 k=N/:X2Œk cos.4 k=N/:X3ŒN=4 k C sin.4 k=N/:X3Œk sin.6 k=N/:X4ŒN=4 k cos.6 k=N/:X4Œk

(4.10)

XŒk C N=4 D X1Œk sin.2 k=N/:X2Œk C cos.2 k=N/:X2ŒN=4 k cos.4 k=N/:X3Œk sin.4 k=N/:X3ŒN=4 k C sin.6 k=N/:X4Œk cos.6 k=N/:X4ŒN=4 k

(4.11)

44


XŒN=2 k D X1ŒN=4 k cos.2 k=N/:X2ŒN=4 k C sin.2 k=N/:X2Œk C cos.4 k=N/:X3ŒN=4 k sin.4 k=N/:X3Œk cos.6 k=N/:X4ŒN=4 k C sin.6 k=N/:X4Œk

(4.12)

XŒk C N = 2 D X1Œk cos.2 k=N/:X2Œk sin.2 k=N/:X2ŒN=4 k C cos.4 k=N/:X3Œk C sin.4 k=N/:X3ŒN=4 k cos.6 k=N/:X4Œk sin.6 k=N/:X4ŒN=4 k

(4.13)

XŒ3N=4 k D X1ŒN=4 k sin.2 k=N/:X2ŒN=4 k cos.2 k=N/:X2Œk cos.4 k=N/:X3ŒN=4 k C sin.4 k=N/:X3Œk C sin.6 k=N/:X4ŒN=4 k C cos.6 k=N/:X4Œk (4.14) XŒk C 3N=4 D X1Œk C sin.2 k=N/:X2Œk cos.2 k=N/:X2ŒN=4 k cos.4 k=N/:X3Œk sin.4 k=N/:X3ŒN=4 k sin.6 k=N/:X4Œk C cos.6 k=N/:X4ŒN=4 k

(4.15)

XŒN k D X1ŒN=4 k C cos.2 k=N/:X2ŒN=4 k sin.2 k=N/:X2Œk C cos.4 k=N/:X3ŒN=4 k sin.4 k=N/:X3Œk C cos.6 k=N/:X4ŒN=4 k sin.6 k=N/:X4Œk; (4.16) where N/4 is the length of the DHT output sub-sequences, fX1[k]g, fX2[k]g, fX3[k]g and fX4[k]g, and the parameter “k” varies from 1 up to N/8–1. When k D 0, which corresponds to the zero-address case, we obtain the singlesized butterfly equations XŒ0 D X1Œ0 C X2Œ0 C X3Œ0 C X4Œ0 XŒN=4 D X1Œ0 C X2Œ0 X3Œ0 X4Œ0

(4.17) (4.18)

XŒN=2 D X1Œ0 X2Œ0 C X3Œ0 X4Œ0 XŒ3N=4 D X1Œ0 X2Œ0 X3Œ0 C X4Œ0;

(4.19) (4.20)

and when k D N=8, which corresponds to the Nyquist-address case, we obtain the single-sized butterfly equations

4.3 Single-to-Double Conversion of the Radix-4 Butterfly Equations

p 2:X2ŒN=8 C X3ŒN=8 p XŒ3N=8 D X1ŒN=8 X3ŒN=8 C 2:X4ŒN=8 p XŒ5N=8 D X1ŒN=8 2:X2ŒN=8 C X3ŒN=8 p XŒ7N=8 D X1ŒN=8 X3ŒN=8 2:X4ŒN=8:

XŒN=8 D X1ŒN=8 C

45

(4.21) (4.22) (4.23) (4.24)

Thus, two different-sized butterflies are required for efficient computation of the DIT version of the radix-4 FHT, their SFGs being as given in Figs. 3.3 and 3.4 of the previous chapter. For the single-sized butterfly equations, the computation of each output involves the addition of at most four terms, whereas for the double-sized butterfly equations, the computation of each output involves the addition of seven terms. The resulting lack of regularity makes an attractive hardware implementation very difficult to achieve, therefore, without suitable reformulation of the associated equations.

4.3 Single-to-Double Conversion of the Radix-4 Butterfly Equations In order to derive a computationally-efficient single-design solution to the radix4 FHT, it is therefore necessary to regularize the algorithm structure by replacing the single and double sized butterflies with a generic version of the double-sized butterfly. Before this can be achieved, however, it is first necessary to show how the single-sized butterfly equations may be converted to the same form as that of the double-sized butterfly. When just the zero-address equations need to be carried out, it may be achieved via the interleaving of two sets, each of four equations, one set involving the consecutive samples fX1[0], X2[0], X3[0], X4[0]g, say, and the other set involving the consecutive samples fY1[0], Y2[0], Y3[0], Y4[0]g, say. This yields the modified butterfly equations XŒ0 D X1Œ0 C X2Œ0 C X3Œ0 C X4Œ0

(4.25)

YŒ0 D Y1Œ0 C Y2Œ0 C Y3Œ0 C Y4Œ0 XŒN=4 D X1Œ0 C X2Œ0 X3Œ0 X4Œ0

(4.26) (4.27)

YŒN=4 D Y1Œ0 C Y2Œ0 Y3Œ0 Y4Œ0 XŒN=2 D X1Œ0 X2Œ0 C X3Œ0 X4Œ0

(4.28) (4.29)

YŒN=2 D Y1Œ0 Y2Œ0 C Y3Œ0 Y4Œ0 XŒ3N=4 D X1Œ0 X2Œ0 X3Œ0 C X4Œ0

(4.30) (4.31)

YŒ3N=4 D Y1Œ0 Y2Œ0 Y3Œ0 C Y4Œ0;

(4.32)

with the associated double-sized butterfly being referred to as the “Type-I” butterfly.

46


Similarly, when both the zero-address and Nyquist-address equations need to be carried out – which is always true when the Nyquist-address equations are required – they may be combined in the same fashion to yield the butterfly equations XŒ0 D X1Œ0 C X2Œ0 C X3Œ0 C X4Œ0 p XŒN=8 D X1ŒN=8 C 2:X2ŒN=8 C X3ŒN=8 XŒN=4 D X1Œ0 C X2Œ0 X3Œ0 X4Œ0 p XŒ3N=8 D X1ŒN=8 X3ŒN=8 C 2:X4ŒN=8 XŒN=2 D X1Œ0 X2Œ0 C X3Œ0 X4Œ0 p XŒ5N=8 D X1ŒN=8 2:X2ŒN=8 C X3ŒN=8 XŒ3N=4 D X1Œ0 X2Œ0 X3Œ0 C X4Œ0 p XŒ7N=8 D X1ŒN=8 X3ŒN=8 2:X4ŒN=8;

(4.33) (4.34) (4.35) (4.36) (4.37) (4.38) (4.39) (4.40)

with the associated double-sized butterfly being referred to as the “Type-II” butterfly. With the indexing assumed to start from zero, rather than one, the evenindexed equations thus correspond to the zero-address butterfly and the odd-indexed equations to the Nyquist-address butterfly. Thus, the sets of single-sized butterfly equations may be reformulated in such a way that the resulting composite butterflies now accept eight inputs and produce eight outputs, the same as the standard radix-4 double-sized butterfly, referred to as the “Type-III” butterfly. The result is that the radix-4 FHT, instead of requiring both single and double sized butterflies, may now be carried out instead with three simple variations of the double-sized butterfly.

4.4 Radix-4 Factorization of the FHT A radix-4 factorization of the FHT may be obtained in a straightforward fashion in terms of the double-sized butterfly equations through application of the familiar divide-and-conquer [6] principle, as used in the derivation of other fast discrete unitary and orthogonal transforms [4], such as the FFT. This factorization leads to the algorithm described by the pseudo-code of Fig. 4.1, where all instructions within the scope of the outermost “for” loop constitute a single iteration in the temporal domain and all instructions within the scope of the innermost “for” loop constitute a single iteration in the spatial domain. Thus, each iteration in the temporal domain, more commonly referred to as a “stage”, comprises N/8 iterations in the spatial domain, where each iteration corresponds to the execution of a single set of doublesized butterfly equations.

4.4 Radix-4 Factorization of the FHT Fig. 4.1 Pseudo-code for radix-4 factorization of FHT algorithm

47 // Set up transform length. N = 4α; // Di-bit reverse input data addresses. (in) X N =PΦ0 .xN; // Loop through log4 temporal stages. offset = 1; for (i = 0; i > 2.k 1/ mod 4 >” and “ 3

(6.4)

so that ˆ.n/ 2 f0; 1; : : :; .N=8/ 1g; where the parameter “n” 2 f0; 1; : : : ; N 1g corresponds to the sample address after di-bit reversal.

To better understand the workings of these rotation-based memory mappings for the storage/accession of the data, it is best to first visualize the data as being stored in a two-dimensional array of four columns and N/4 rows, where the data is stored on a row-by-row basis, with four samples to a row. The effect of the generic address mapping, ‰4 , as shown in the example given in Table 6.1 below, is to apply a left-sense rotation to each row of data where the amount of rotation is dependent upon the particular .N=4/ 4 sub-array to which it belongs, as well as the particular .N=16/ 4 sub-array within that sub-array, as well as the particular .N=64/4 sub-array within that sub-array, etc., until all the relevant partitions have been accounted for – there are log4 N of these. As a result, there is a cyclic rotation being applied to the data over each such sub-array – the cyclic nature of the mapping means that within each sub-array the amount of rotation to be applied to a given row of data is one position greater than that for the preceding row. This property, as will later be seen, may be beneficially exploited by the GD-BFLY through the way in which it stores/accesses

82 Table 6.1 Structure of generic address mapping ‰4 for case of length-64 data set

Table 6.2 Structure of address mapping 1 for case of length-64 data set

6 Derivation of Area-Efficient and Scalable Parallel Architecture Row

Value of Generic address mapping ‰4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 2 4 6 2 4 6 0 4 6 0 2 6 0 2 4 Row 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

2 4 6 0 4 6 0 2 6 0 2 4 0 2 4 6

4 6 0 2 6 0 2 4 0 2 4 6 2 4 6 0

6 0 2 4 0 2 4 6 2 4 6 0 4 6 0 2

Value of address mapping 1 0 3 4 7 2 5 6 1 4 7 0 3 6 1 2 5 2 5 6 1 4 7 0 3 6 1 2 5 0 3 4 7 4 7 0 3 6 1 2 5 0 3 4 7 2 5 6 1 6 1 2 5 0 3 4 7 2 5 6 1 4 7 0 3

the elements of the input/output data sets, for both individual instances of the GDBFLY, via the address mapping 1 , as well as for consecutive pairs of instances, via the address mapping 2 , over all eight memory banks. Examples of the address mappings 1 and 2 are given in Tables 6.2 and 6.3, respectively, where each pair of consecutive rows of bank addresses correspond to the locations of a complete GD-BFLY input/output data set. Suppose now, for ease of exposition, that the arithmetic within the GD-BFLY can be assumed to be carried out fast enough to allow for the data sets processed by the GD-BFLY to be both read from and written back to DM within a single clock cycle – this is not of course actually achievable and a more realistic scenario is to be discussed later in Section 6.4 when the concept of internal pipelining within the PE is introduced.

6.3 Conflict-Free Parallel Memory Addressing Schemes Table 6.3 Structure of address mapping 2 for case of length-64 data set

83 Row

Value of address mapping 2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 2 5 7 2 4 7 1 4 6 1 3 6 0 3 5

2 4 7 1 4 6 1 3 6 0 3 5 0 2 5 7

4 6 1 3 6 0 3 5 0 2 5 7 2 4 7 1

6 0 3 5 0 2 5 7 2 4 7 1 4 6 1 3

The input/output data set to/from the GD-BFLY comprises four even-address samples and four odd-address samples, where for a given instance of the GD-BFLY for the first temporal stage, each of the eight DM banks will contain just one sample, as required, whilst for a given instance of the GD-BFLY for the remaining ’1 temporal stages, four of the eight DM banks will each contain one even-address sample and one odd-address sample with the remaining four DM banks being unused. As a result, it is generally not possible to carry out all eight reads/writes for the same data set using all eight DM banks in a single clock cycle. However, if for all but the first temporal stage we consider any pair of consecutive instances of the GD-BFLY, then it may be shown that the sample addresses of the second instance will occupy the four DM banks not utilized by the first, so that every two clock cycles the eight even-address samples and the eight odd-address samples required by the pair of consecutive instances of the GD-BFLY may be both read from and written to DM, as required for conflict-free and in-place memory addressing – see Fig. 6.2 below. Thus, based upon our simplistic assumption, all eight DM banks for the first temporal stage may be both read from and written to within a single clock cycle, whilst for the remaining ’ 1 temporal stages it can be shown that in any one clock cycle all the input samples for one instance of the GD-BFLY may be both read from DM and processed by the GD-BFLY, whilst all those output samples produced by its predecessor may be written back to DM. As a result, the solution based upon the single-PE R24 FHT architecture will be able to yield complete GD-BFLY output sets at the rate of one set per clock cycle, as required. An alternative way of handling the pipelining for the last ’ 1 temporal stages would be to read just four samples for the first clock cycle, with one sample from each of the four even-addressed memory banks. This would be followed by eight samples for each succeeding clock cycle apart from the last, with four samples for the current instance of the GD-BFLY being read from the four

84

6 Derivation of Area-Efficient and Scalable Parallel Architecture 0 1st butterfly of pair

2nd butterfly of pair

1

ES / OS

2

3

ES /OS

ES / OS

4

5

7

ES /OS

ES /OS

ES / OS

6

ES / OS

ES /OS

ES – even-address sample OS – odd-address sample

Fig. 6.2 Addressing of hypothetical pair of consecutive generic double butterflies for all stages other than first

even-addressed/odd-addressed memory banks and four samples for the succeeding instance of the GD-BFLY being read from the remaining four odd-addressed/evenaddressed memory banks. The processing would be completed by reading just four samples for the last clock cycle, with one sample from each of the four oddaddressed memory banks. In this way, for each clock cycle apart from the first and the last, eight samples could be read/written from/to all eight memory banks, one sample per memory bank, with one complete set of eight GD-BFLY outputs being thus produced and another partly produced, to be completed on the succeeding clock cycle. Note, however, that a temporary buffer would be needed to hold one complete GD-BFLY output set as the samples written back to memory would also need to come from consecutive GD-BFLY output sets, rather than from a single GD-BFLY output set, due to the dual-port nature of the memory. For the last clock cycle, the remaining set of eight GD-BFLY outputs could also be written out to all eight memory banks, again one sample per memory bank. The choice of how best to carry out the pipelining is really down to the individual HDL programmer, but for the purposes of consistency within the current monograph, it will be assumed that all the samples required for a given instance of the GD-BFLY are to be read from the DM within the same clock cycle, two samples per even-addressed/odd-addressed memory bank as originally described, so that all the input samples for one instance of the GD-BFLY may be both read from DM and processed by the GD-BFLY, whilst all those output samples produced by its predecessor are written back to DM.

6.3.2 Trigonometric Coefficient Storage, Accession and Generation Turning now to the trigonometric coefficients, the GD-BFLY, as described in Chapter 4, requires that six non-trivial trigonometric coefficients be either accessed

6.3 Conflict-Free Parallel Memory Addressing Schemes

85

from CM or efficiently generated in order to be able to carry out the GD-BFLY processing for a given data set. Two schemes are now outlined for performing this task whereby all six trigonometric coefficients may be accessed simultaneously, within a single clock cycle, these schemes offering a straightforward trade-off of memory requirement against addressing complexity – as measured in terms of the number of arithmetic/logic operations required for computing the necessary addresses. The two schemes considered cater for those extremes whereby the requirement is either to minimize the arithmetic complexity or to minimize the CM requirement. Clearly, other options that fall between these two extremes are also possible, but these may be easily defined and developed given an understanding of the techniques discussed here and in Section 4.6 of Chapter 4.

6.3.2.1 Minimum-Arithmetic Addressing Scheme The trigonometric coefficient set comprises cosinusoidal and sinusoidal terms for single-angle, double-angle and triple-angle cases. Therefore, in order for all six trigonometric coefficients to be obtained simultaneously, three LUTs may be exploited with the two single-angle coefficients being read from the first LUT, the two double-angle coefficients from the second LUT, and the two triple-angle coefficients from the third LUT. To keep the arithmetic complexity of the addressing to a minimum each LUT may be defined as in Section 4.6.1 of Chapter 4, being sized according to the single-quadrant addressing scheme, whereby the trigonometric coefficients are read from a sampled version of the sinusoidal function with argument defined from 0 up to =2 radians. Thus, for the case of an N-point R24 FHT, it is required that each of the three single-quadrant LUTs be of size N/4 words yielding Aopt a total CM requirement, denoted CMEM , of Aopt

CMEM D

3 N 4

(6.5)

words. This scheme would seem to offer a reasonable compromise, therefore, between the CM requirement and the addressing complexity, using more memory than is theoretically necessary, in terms of replicated LUTs, in order to keep the arithmetic/logic requirement of the addressing as simple as possible – namely, a zero arithmetic complexity when using the twelve-multiplier version of the GD-BFLY or six additions when using the nine-multiplier version.

6.3.2.2 Minimum-Memory Addressing Scheme Another approach to the problem is to adopt a two-level LUT for the first of the three angles, where the associated complementary-angle LUTs are as p defined in Section 4.6.2 of Chapter 4, comprising one coarse-resolution region of N=2 words

86

6 Derivation of Area-Efficient and Scalable Parallel Architecture

p for the sinusoidal function, and one fine-resolution region of N=2 words for each of the cosinusoidal and sinusoidal functions. To keep the CM requirement to a minimum, the double-angle and triple-angle trigonometric coefficients are then obtained straightforwardly through the application of standard trigonometric identities, as given by Equations 4.65–4.68 of Chapter 4, so that the solution requires that three complementary-angle LUTsp be used for just the single-angle trigonometric coefficient case, each LUT of size N=2 words, yielding a total CM requirement, denoted Mopt CMEM , of 3p Mopt N (6.6) CMEM D 2 words. The double-angle and triple-angle trigonometric coefficients could also be obtained by assigning a two-level LUT to the storage of each, but the associated arithmetic complexity involved in generating the addresses turns out to be identical to that obtained when the trigonometric coefficients are obtained through the direct application of standard trigonometric identities, so that in this instance the replication of the two-level LUT provides us with three times the memory requirement but with no arithmetic advantage as compensation. With the proposed technique, therefore, the CM requirement, as given by Equation 6.6, is minimized at the expense of additional arithmetic/logic for the addressing – namely, an arithmetic complexity of seven multiplications and eight additions when using the twelve-multiplier version of the GD-BFLY or seven multiplications and 14 additions when using the nine-multiplier version.

6.3.2.3 Summary of Addressing Schemes The results of this section are summarized in Table 6.4 below, where the CM requirement and arithmetic complexity for each of the conflict-free parallel addressing schemes are given. A trade-off has clearly to be made between CM requirement and arithmetic complexity, with the choice being ultimately made according to the resources available on the target hardware. Versions I and II of the solution to the R24 FHT correspond to the adoption of the minimum-arithmetic addressing scheme for the twelve-multiplier and nine-multiplier PEs, respectively, whilst Versions III and IV correspond to the adoption of the minimum-memory addressing scheme for the twelve-multiplier and nine-multiplier PEs, respectively. The trigonometric coefficient accession/generation schemes required for Versions I to IV of the above solution are illustrated via Figs. 6.3–6.6, respectively, with the associated arithmetic complexity for the addressing given by zero when using Version I of the R24 FHT solution, six additions when using Version II, seven multiplications and eight additions when using Version III, and seven multiplications and 14 additions when using Version IV.

Version of solution I II III IV Adders 0 6 8 14

Multipliers 0 0 7 7

Multipliers 12 9 12 9

Adders 22 25 22 25

Coefficient generator

Processing element 8 1=8N D N 8 1=8N D N 8 1=8N D N 8 1=8N D N

Data 3 1=4N D 3=4N 3 3 1=4N pD =4N p 3 1=2pN D 3=2pN 3 1=2 N D 3=2 N

Coefficients

Table 6.4 Performance/resource comparison for fast multiplier versions of N-point regularized FHT Arithmetic complexity Memory requirement (words)

1=8N:log N 4 1=8N:log N 4 1=8N:log N 4 1=8N:log N 4

Update time

1=8N:log N 4 1=8N:log N 4 1=8N:log N 4 1=8N:log N 4

Latency

Time complexity (clock cycles)

6.3 Conflict-Free Parallel Memory Addressing Schemes 87

88


Fig. 6.3 Resources required for trigonometric coefficient accession/generation for Version I of solution with one-level LUTs

S1 C1

S2 C2

S3 C3

Fig. 6.4 Resources required for trigonometric coefficient accession/generation for Version II of solution with one-level LUTs

S1 C1

D1 D2 D3 D4

Sn = sin(nθ) Î LUT[n]

D5

Cn = cos(nθ) Î LUT[n]

D6

Dn = nth coefficient

D7

LUT[n] = 1 N words 4

D8 D9

D1 D2 D3 Sn = sin(nθ) Î LUT[n]

S2 C2

S3 C3

D4 D5 D6

Cn = cos(nθ) Î LUT[n] Dn = nth coefficient LUT[n] = 1 N words 4

D7 D8 D9

Note that with the minimum-memory addressing scheme of Figs. 6.5 and 6.6 pipelining will certainly need to be introduced so as to ensure that a complete new set of trigonometric coefficients is available for input to the GD-BFLY for each new clock cycle.

6.4 Design of Pipelined PE for Single-PE Architecture

89

S1, C1 S2

D7, D8, D9

C2

S1 = sin(α)

Delay

LUT[1]

C1 = cos(α)

LUT[1]

S2 = sin(β)

LUT[2]

C2 = cos(β)

LUT[3]

Delay

D4, D5, D6 D1, D2, D3



Fig. 6.5 Resources required for trigonometric coefficient accession/generation for Version III of solution with two-level LUT – pipelining required to maintain computational throughput

Delay

Delay

Delay

S1, C1 S2

D7, D8, D9

C2

S1 = sin(α) LUT[1] C1 = cos(α)

Delay

Delay

Delay

D1, D2, D3

LUT[1]

S2 = sin(β)

LUT[2]

C2 = cos(β)

LUT[3]

D4, D5, D6



Fig. 6.6 Resources required for trigonometric coefficient accession/generation for Version IV of solution with two-level LUT – pipelining required to maintain computational throughput

6.4 Design of Pipelined PE for Single-PE Architecture To exploit the multi-bank memories and LUTs, together with the associated conflictfree and (for the data) in-place parallel memory addressing schemes, the PE needs now to be able to produce one complete GD-BFLY output set per clock cycle, as discussed in Section 6.3.1, bearing in mind that although, for the first temporal stage,

90


all eight DM banks can be both read from and written to within the same clock cycle, for the remaining temporal stages, only those four DM banks not currently being read from may be written to (and vice versa).

6.4.1 Internal Pipelining of Generic Double Butterfly The above constraint suggests that a suitable PE design may be achieved if the GDBFLY is carried out by means of a “-stage computational pipeline, as shown in the simple example of Fig. 6.7, where ““” is an odd-valued integer and where each CS of the pipeline contains its own set of storage registers for holding the current set of processed samples. In this way, if a start-up delay of DCG clock cycles is required for a pipelined version of the trigonometric coefficient generator and DPE clock cycles for a pipelined version of the PE, where DPE D “ 1;

(6.7)

then after a total start-up delay of DSU clock cycles for the first temporal stage of the processing, where DSU D DCG C DPE ; (6.8) the PE will be able to read in eight samples and write out eight samples every clock cycle, thereby enabling the first temporal stage to be completed in DSU C N=8 clock cycles, and subsequent temporal stages to be completed in N/8 clock cycles. Note that the pipeline delay DPE must account not only for the sets of adders and permutators, but also for the fixed-point multipliers which are themselves typically implemented as pipelines, possibly requiring as many as five CSs according to the required precision. As a result, it is likely that at least nine CSs might be required

Stage 0: both even-addressed EB and odd-addressed OB memory banks are read from & written to at the same time – one sample per memory bank READ EB & OB

PE0

EB & OB

PE1

EB & OB

WRITE

PE2

PE3 EB & OB

PE4 EB & OB

EB & OB

Stages 1 to a-1: when even-addressed EB memory banks are read from, odd-addressed OB memory banks are written to & vice versa – two samples per memory bank READ EB / OB

PE0

O B / EB

PE1

EB / OB

PE2

OB / EB

PE3

EB / OB

Fig. 6.7 Parallel solution for PE using five-stage computational pipeline

PE4

WRITE OB / EB

DM0

CM1

CS1

DM1

91

CM2

CSβ−1

Writes × 8

Reads × 6

CM0

CS0 Reads × 8

Address Generation

6.4 Design of Pipelined PE for Single-PE Architecture

DM7

CM – coefficient memory CS – computational stage DM – data memory

Fig. 6.8 Memory structure and interconnections for internally-pipelined partitioned-memory PE

for implementation of the computational pipeline, with each temporal stage of the R24 FHT requiring the PE to execute the pipeline a total of N/8 times. A description of the pipelined PE including the structure of the memory, for both the data and the trigonometric coefficients, together with its associated interconnections, is given in Fig. 6.8. Note, however, that depending upon the relative lengths of the computational pipeline, ““”, and the transform, “N”, an additional delay may need to be applied for every temporal stage, not just the first, in order to ensure that sample sets are not updated in one temporal stage before they have been processed and written back to DM in the preceding temporal stage, as this would result in the production of invalid outputs. If the transform length is sufficiently greater than the pipeline delay, however, this problem may be avoided – these points are discussed further in Section 6.4.3.

6.4.2 Space Complexity Considerations The space complexity is determined by the combined requirements of the multibank dual-port memory and the arithmetic/logic components. Adopting the minimum-arithmetic addressing scheme of Versions I and II of the R24 FHT solution (as detailed in Table 6.4), the worst-case total memory requirement for the partitioned-memory single-PE architecture, denoted M.W/ FHT , is given by

92


M.W/ FHT

D 8 D

7 N 4

1 1 N C3 N 8 4 (6.9)

words, where N words are required by the eight-bank DM and 3N/4 words for the three single-quadrant LUTs required for the CM. In comparison, by adopting the minimum-memory addressing scheme of Versions III and IV of the R24 FHT solution (as detailed in Table 6.4), the best-case total memory requirement for the partitioned-memory single-PE architecture, denoted M.B/ FHT , is given by .B/

1p 1 N C3 N 8 2 3p DNC N 2

MFHT D 8

(6.10)

p words, where N words are required by the eight-bank DM and 3 N=2 words for the three complementary-angle LUTs required for the CM. The arithmetic/logic requirement is dominated by the presence of the dedicated fast fixed-point multipliers, with a total of nine or 12 being required by the GD-BFLY and up to seven for the memory addressing, depending upon the chosen addressing scheme.

6.4.3 Time Complexity Considerations The partitioned-memory single-PE architecture, based upon the internally-pipelined PE described in Section 6.4.1, enables a new GD-BFLY output set to be produced every clock cycle. Therefore, the first temporal stage will be completed in DSU CN=8 clock cycles and subsequent temporal stages in either N/8 clock cycles or DSM CN=8 clock cycles, where the additional delay DSM provides the necessary safety margin to ensure that the outputs produced from each stage are valid. The delay depends upon the relative lengths of the computational pipeline and the transform and may range from zero to as large as DPE . As a result, the N-point R24 FHT, where “N” is as given by Equation 5.2, has a worst-case time complexity, denoted T.W/ FHT , of 1 1 N C .˛ 1/ D N T.W/ D D C C SU SM FHT 8 8 1 D .DSU C .’ 1/DSM / C N: log4 N 8

(6.11) .B/

clock cycles, and a best-case or standard time-complexity, denoted TFHT , for when the safety margin delay is not required, of

6.5 Performance and Requirements Analysis of FPGA Implementation

T.B/ FHT

1 1 D DSU C N C .˛ 1/ N 8 8 1 D DSU C N: log4 N 8

93

(6.12)

clock cycles, given that ’ D log4 N. More generally, for any given combination of pipeline length, ““”, and transform length, “N”, it should be a straightforward task to calculate the exact safety margin delay, DSM , required after each temporal stage in order to guarantee the generation of valid outputs, although for most parameter combinations of practical interest it will almost certainly be set to zero so that the time-complexity for each instance of the transform will be as given by Equation 6.12. Note that a multi-PE R24 FHT architecture, based upon the adoption of an ’-stage computational pipeline, could only yield this level of performance by exploiting up to “’” times as much silicon as the single-PE R24 FHT architecture, assuming that the PEs in the pipeline are working in sequential fashion with the data and trigonometric coefficients stored in global memory – that is, with the reads/writes being performed at a rate of one per clock cycle. Each stage of a pipelined multiPE R24 FHT architecture requires the reading/writing of all N samples so that ’ 1 double-buffered memories – each holding up to 2N samples to cater for both inputs and outputs of the PE – are typically required for connecting the PEs in the pipeline.

6.5 Performance and Requirements Analysis of FPGA Implementation The theoretical complexity requirements discussed above have been proven in silicon by TRL Technology (a member of L3 Communications Corporation, U.S.A.) in the U.K., who have produced a generic real-data radix-4 FFT implementation – based upon the R24 FHT – on a Xilinx Virtex-II Pro 100 FPGA [8], running at close to 200 MHz, for use in various wireless communication systems. A simple comparison with the state-of-the-art performances of the RFEL QuadSpeed FFT [4] and Roke Manor Research FFT solutions [5] (both multi-PE designs from the UK whereby a complex-data FFT may be used to process simultaneously two real-valued data sets – packing/unpacking of the input/output data sets therefore needs to be accounted for) is given in Table 6.5 for the case of 4K-point and 16K-point real-data FFTs (where 1K 1;024), where the RFEL and Roke Virtex-II Pro 100 results are extrapolated from company data sheets and where the Version II solution of the R24 FHT described in Section 6.3.2.3 – using the minimum-arithmetic addressing scheme together with a nine-multiplier PE – is assumed for the TRL solution. Clearly, many alternatives to these two commercially-available devices could have been used for the purposes of this comparison, but at the time the comparison was made, these devices were both considered to be viable options with performances that were (and still are) quite representative of this particular class of multi-PE streaming FFT solution.

b

a

DHT-to-DFT conversion not accounted for in figures Packing/unpacking requirement not accounted for in figures

Table 6.5 Performance and resource utilization for 4K-point and 16K-point real-data radix-4 FFTs Clock Input 1K 18 RAMS frequency FFT word (with double 18 18 200 MHz length length buffering) Multipliers Logic slices TRLa 4K 18 11 9 5; 000 (2.5% capacity) (2.0% (5.0% capacity) capacity) 4K 12 33 30 5; 000 RFELb (7.5% capacity) (6.8% (5.0% capacity) capacity) 4K 14 42 48 3; 800 ROKEb (9.5% capacity) (10.8% (3.8% capacity) capacity) 16K 18 44 9 5; 000 TRLa (9.9% capacity) (2.0% (5.0% capacity) capacity) 16K 12 107 37 6; 500 RFELb (24.1% capacity) (8.3% (6.5% capacity) capacity) 16K 10 124 55 5; 800 ROKEb (28.0% capacity) (12.4% (5.8% capacity) capacity)

Latency per real-data FFT (s) 15 (1 channel) 21 (2 channels) 21 (2 channels) 72 (1 channel) 83 (2 channels) 83 (2 channels)

Update time per real-data FFT (s) 15 (1 channel) 10 (2 channels) 10 (2 channels) 72 (1 channel) 41 (2 channels) 41 (2 channels)

4

4

1

4

4

I/O speed (samples/cycle) 1

94 6 Derivation of Area-Efficient and Scalable Parallel Architecture

6.6 Constraining Latency Versus Minimizing Update-Time

95

Note that the particular choice of the real-from-complex strategy for the two commercially-available solutions has been made to ensure that we compare like with like, or as close as we can make it, as the adoption of the DDC-based approach would introduce additional filtering operations to complicate the issue together with an accompanying processing delay. As a matter of interest, for an efficient implementation with the particular device used here, the Virtex-II Pro 100, a complex DDC with 84 dB of spurious-free dynamic range (SFDR) has been shown to require approximately 1,700 slices of programmable logic [1]. Although the performances, in terms of the update time and latency figures, are similar for the solutions described, it is clear from the respective I/O requirements that the RFEL and Roke performance figures are achieved at the expense of having to process twice as much data at a time (two channels yielding two output sets instead of one) as the TRL solution and (for the case of an N-point transform) having to execute N/2 radix-2 butterflies every N/2 clock cycles, so that the pipeline needs to be fed with data generated by the ADC unit(s) at the rate of N complex-valued (or 2N real-valued) samples every N/2 clock cycles. This means generating the samples at four times the speed (four samples per clock cycle instead of just one) of the TRL solution which might in turn involve the use of multiple ADC units. The results highlight the fact that although the computational densities of the three solutions are not that dissimilar, the TRL solution is considerably more area efficient, requiring a small fraction of the memory and fast multiplier requirements of the other two solutions in order to satisfy the latency constraint, whilst the logic requirement – as required for controlling the operation and interaction of the various components of the FPGA implementation – which increases significantly with transform length for the RFEL and Roke solutions, remains relatively constant. The scalable nature of the TRL solution means that only the memory requirement needs substantially changing from one transform length to another in order to reflect the increased/decreased quantity of data to be processed, making the cost of adapting the solution for new applications negligible. For longer transforms, better use of the resources could probably be achieved by trading off memory against fast multiplier requirement through the choice of a more memory-efficient addressing scheme – as discussed above in Section 6.3. Note that double buffering is assumed for the sizing of the TRL solution in order to support continuous processing, whereby the I/O is limited to N clock cycles, this resulting in a doubling of the DM requirement.

6.6 Constraining Latency Versus Minimizing Update-Time An important point to note is that most, if not all, of the commercially-available FFT solutions are multi-PE solutions geared to streaming operation where the requirement relates to the minimization of the update time – so as to maximize the throughput – rather than satisfying some constraint on the latency, as has been addressed in this monograph with the design of the R24 FHT. In fact, the point should

96


perhaps be re-made here that in the “Statement of Performance Objective No 2” made at the beginning of the chapter, the requirement was simply that the N-point transform be executed within N clock cycles. Now from Equation 6.12, this is clearly achievable for all transform lengths up to and including 64K, so that for transforms larger than this it would be necessary to increase the throughput rate by an appropriate amount in order that continuous operation be maintained – note that two PEs would maintain continuous operation for N 416 . To clarify the situation, whereas with a pipelined FFT approach, as adopted by the multi-PE commercially-available solutions, one is able to attain a high throughput rate by effectively minimizing the update time, with the R24 FHT it is possible to increase the throughput rate by adopting an SIMD-type approach, either via a “multi-R24 FHT” solution whereby multiple R24 FHTs are used to facilitate the simultaneous processing of multiple data sets or via a “multi-PE” solution whereby multiple PEs are used to facilitate the parallel processing of a single data set by means of a multi-PE version of the R24 FHT. The multi-PE solution could thus be used to maintain continuous operation for the case of extremely large transform lengths, whereas the multi-R24 FHT solution could be used to deal with those computationally-demanding applications where the throughput rate for the generation of each new N-point real-data FFT output data set needs to be greater than one set every N clock cycles. With the multi-R24 FHT approach, the attraction is that it is possible to share both the control logic and the CM between the R24 FHTs, given that the LUTs contain precisely the same information and need to be accessed in precisely the same order for each R24 FHT. Such an approach could also be used to some advantage, for example, when applied to the computation of the complex-data DFT, as discussed in Section 3.4.2 of Chapter 3, where one R24 FHT is applied to the computation of the real component of the data and one R24 FHT is applied to the computation of the imaginary component. A highly-parallel dual-R24 FHT solution such as this would be able to attain, for the case of complex data, the eightfold speed up already achieved for the real-data case over a purely sequential solution (now processing eight complex-valued samples per clock cycle rather than eight real-valued samples), yet for minimum additional resources. With the multi-PE approach – restricting ourselves here to the simple case of two PEs – it needs firstly to be noted that as dual-port memory is necessary for the operation of the single-PE solution so quad-port memory would be necessary for the operation of a dual-PE solution, so as to facilitate the reading/writing of two samples from/to each of the eight memory banks for each clock cycle, as well as the reading of four (rather than two) trigonometric coefficients from each of the LUTs, as shared by the PEs. Alternate instances of the GD-BFLY could be straightforwardly assigned to alternate PEs with all eight GD-BFLY inputs/outputs for each PE being read/written from/to memory simultaneously so that conflict-free and inplace parallel memory addressing would be maintained for each PE. At present, genuine quad-port memory is not available from the major FPGA manufacturers, so that the obtaining of such a facility may only be achieved through the modification of existing dual-port memory at the effective cost of a doubling

6.7 Discussion

97

of the memory requirement. A simple alternative may be obtained, however, by noting that with current FPGA technology there is typically an approximate factor of 2 difference between the dual-port memory read/write access time and the update time for the fast multipliers – and thus the update time of the GD-BFLY computational pipeline. As a result, by doubling the speed at which the reads/writes are performed, a pseudo quad-port memory capability may be achieved whereby the data is read/written from/to the dual-port memory at twice the rate of the computational pipeline and thus at a sufficient rate to sustain the operation of the pipeline. The ideas considered in this section involving the use of multi-PE and multiR24 FHT solutions would seem to suggest that the throughput rate of the most advanced commercially-available solutions could be achieved for reduced quantities of silicon, so that the GD-BFLY-based PE could thus be used as a building block to define real-data FFT solutions to a range of problems according to whether the particular design objective involves the satisfying of some constraint on the latency, as addressed by this monograph, or the maximization of the throughput rate.

6.7 Discussion The outcome of this chapter is the specification of a partitioned-memory single-PE computing architecture for the parallel computation of the R24 FHT, together with the specification of conflict-free and (for the data) in-place parallel memory addressing schemes for both the data and the trigonometric coefficients, which enable the outputs from each instance of the GD-BFLY to be produced via this computing architecture within a single clock cycle. Four versions of the PE have been described – all based upon the use of a fixed-point fast multiplier and referred to as Versions I, II, III and IV of the solution – which provide the user with the ability to trade off arithmetic complexity, in terms of both adders and multipliers, against the memory requirement, with a theoretical performance and resource comparison of the four solutions being provided in tabular form. The mathematical/logical correctness of the operation of all four versions of the solution has been proven in software via a computer program written in the “C” programming language. Silicon implementations of both 4K-point and 16K-point transforms have been studied, each using Version II of the R24 FHT solution – which uses the minimumarithmetic addressing scheme together with a nine-multiplier version of the PE – and the Xilinx Virtex-II Pro 100 device running at a clock frequency of close to 200 MHz. The R24 FHT results were seen to compare very favourably with those of two commercially-available industry-standard multi-PE solutions, with both the 4Kpoint and 16K-point transforms achieving the stated performance objective whilst requiring greatly reduced silicon resources compared to their commercial complexdata counterparts. Note that although the target device family may be somewhat old, it was more than adequate for purpose, which was simply to facilitate comparison of the relative merits of the single-PE and multi-PE architectures. As already stated, with real-world applications it is not always possible, for various practical/financial

98


reasons, to have access to the latest device technologies. Such a situation does tend to focus the mind, however, as one is then forced to work to within whatever silicon budget one happens to have been dealt with. Note that a number of scalable single-PE designs for the fixed-radix FFT [2,6,7], along the lines of that discussed in this chapter for the R24 FHT, have already appeared in the technical literature over the past 10–15 years for the more straightforward complex-data case, each such solution using a simplified version of the memory addressing scheme discussed here whereby multi-bank memory is again used to facilitate the adoption of partitioned-memory processing. Another important property of the proposed set of R24 FHT designs discussed here is that they are able, via the application of the block floating-point scaling technique, to optimize the achievable dynamic range of the Hartley space (and thus Fourier space) outputs and therefore to outperform the more conventional streaming FFT solutions which, given the need to process the data as and when it arrives, are restricted to the use of various fixed scaling strategies in order to address the fixedpoint overflow problem. With fully-optimized streaming operation, the application of block floating-point scaling would involve having to stall the optimal flow of data through the computational pipeline, as the entire set of outputs from each stage of butterflies needs to be passed through the “maximum” function in order that the required common exponent may be found. As a result, the block-based nature of the single-PE R24 FHT operation means that it is also able to produce higher accuracy transform-space outputs than is achievable by its multi-PE FFT counterparts. Finally, it should be noted that the data re-ordering – carried out here by means of the di-bit reversal mapping – to be applied to the input data to the transform can be comfortably carried out in less than N clock cycles, for a length N transform, so that performance may be maintained through the use of double buffering, whereby one data set is being re-ordered and written to one set of DM banks whilst another data set – its predecessor – is being read/written from/to another set of DM banks by the R24 FHT. The functions of the two sets of DM banks are then interchanged after the completion of each R24 FHT. Thus, we may set up what is essentially a two-stage pipeline, where the first stage of the pipeline carries out the task of data re-ordering and the second carries out the R24 FHT on the re-ordered data. The data re-ordering may be carried out in various ways, as already outlined in Section 2.4 of Chapter 2.

References 1. R. Hosking, New FPGAs tackle real-time DSP tasks for defense applications, Boards & Solutions Magazine (November 2006) 2. L.G. Johnson, Conflict-free memory addressing for dedicated FFT hardware. IEEE Trans. Circuits Syst. II: Analog Dig. Signal Proc. 39(5), 312–316 (1992) 3. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (2007) 4. RF Engines Ltd., IP Cores – Xilinx FFT Library, product information sheet available at company web site: www.rfel.com

References

99

5. Roke Manor Research Ltd., Ultra High Speed Pipeline FFT Core, product information sheet available at company web site: www.roke.com 6. B.S. Son, B.G. Jo, M.H. Sunwoo, Y.S. Kim, A high-speed FFT processor for OFDM systems. Proc. IEEE Int. Symp. Circuits Syst. 3, 281–284 (2002) 7. C.H. Sung, K.B. Lee, C.W. Jen, Design and implementation of a scalable fast Fourier transform core. Proc. of 2002 IEEE Asia-Pacific Conference on ASICs, 295–298 (2002) 8. Xilinx Inc., company and product information available at company web site: www.xilinx.com

Chapter 7

Design of Arithmetic Unit for Resource-Constrained Solution

Abstract This chapter discusses a solution to the regularized FHT where the pipelined fixed-point multipliers involving the trigonometric coefficients are now replaced by pipelined CORDIC phase rotators which eliminate the need for the trigonometric coefficient memory and lead to the specification of a flexibleprecision solution. The design is targeted, in particular, at those applications where one is constrained by the limited availability of embedded resources. Theoretical performance figures for a silicon-based implementation of the CORDIC-based solution are derived and the results compared with those for the previously discussed solutions based upon the use of the fast fixed-point multipliers for various combinations of transform length and word length. A discussion is finally provided relating to the results obtained in the chapter.

7.1 Introduction The last two chapters have provided us with a detailed account of how the R24 FHT is able to be mapped onto a partitioned-memory single-PE computing architecture so as to effectively exploit the computational power of the silicon-based parallel computing technologies. Four versions of this highly-parallel R24 FHT solution have been produced with PE designs which range from providing optimality in terms of the arithmetic complexity to optimality in terms of the memory requirement, although the common feature of all four versions is that they each involve the use of a fast fixed-point multiplier. No consideration has been given, as yet, as to whether an arithmetic unit based upon the fast multiplier is always the most appropriate to adopt or, when such an arithmetic unit is used, how the fast multiplier might best be implemented. With the use of FPGA technology, however, the fast multiplier is typically available to the user as an embedded resource which, although expensive in terms of silicon resources, is becoming increasingly more power efficient and therefore the logical solution to adopt. A problem may arise in practice, however, when the length of the transform to be computed is very large compared to the capability of the target device such that there are insufficient embedded resources – in terms of fast multipliers, fast RAM, K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 7, c Springer Science+Business Media B.V. 2010

101

102

7 Design of Arithmetic Unit for Resource-Constrained Solution

or both – to enable a successful mapping of the transform onto the device to take place. In such a situation, where the use of a larger and more powerful device is simply not an option, it is thus required that some means be found of facilitating a successful mapping onto the available device and one way of achieving this is through the design of a more appropriate arithmetic unit, namely one which does not rely too heavily upon the use of embedded resources. The choice of which type of arithmetic unit to adopt for the proposed resourceconstrained solution has been made in favour of the CORDIC unit, rather than the DA unit, as the well documented optimality of CORDIC arithmetic for the operation of phase rotation [1] – as is shown to be the required operation here – combined with the ability to generate the rotation angles that correspond to the trigonometric coefficients very efficiently on-the-fly, with trivial memory requirement, make it the obvious candidate to pursue – the DA unit would inevitably involve a considerably larger memory requirement due to the storage of the pre-computed sums or inner products. A number of attractive CORDIC-based FPGA solutions to the FFT have appeared in the technical literature in recent years, albeit for the more straightforward complex-data case, with two such solutions as discussed in references [2, 11]. Note that the sizing to be carried out in this chapter for the various R24 FHT solutions, including those based upon both the fast fixed-point multiplier and the CORDIC phase rotator, is to be performed for hypothetical implementations exploiting only programmable logic in order to facilitate their comparison.

7.2 Accuracy Considerations To obtain L-bit accuracy in the GD-BFLY outputs it will be necessary to retain sufficient bits out of the multipliers as well as to use sufficient guard bits in order to protect both the least significant bit (LSB) and the most significant bit (MSB). This is due to the fact that with fixed-point processing the accuracy may be degraded through the possible word growth of one bit with each stage of adders. For the MSB, the guard bits correspond to those higher order (initially unoccupied) bits, appended to the left of the L most significant data bits out of the multipliers, that could in theory, after completion of the stages of GD-BFLY adders, contain the MSB of the output data. For the LSB, the guard bits correspond to those lower order (initially occupied) bits, appearing to the right of the L most significant data bits out of the multipliers, which could in theory, after completion of the stages of GD-BFLY adders, affect or contribute to the LSB of the output data. Thus, the possible occurrence of truncation errors due to the three stages of adders is accounted for by varying the lengths of the registers as the data progresses across the PE. Allowing for word growth in this fashion permits the application of block floating-point scaling [9] – as discussed in Section 4.8 of Chapter 4, prior to each stage of GD-BFLYs, thereby enabling the dynamic range of any signals present in the data to be maximized at the output of the R24 FHT.

7.3 Fast Multiplier Approach

103

7.3 Fast Multiplier Approach Apart from the potentially large CM requirement associated with the four PE designs discussed in the previous chapter, an additional limitation relates to their relative inflexibility, in terms of the arithmetic precision offered, due to their reliance on the fast fixed-point multiplier. For example, when the word length, “L”, of one or more of the multiplicands exceeds the word length capability, “K”, of the embedded multiplier, it would typically be necessary to use four embedded multipliers and two 2K-bit adders to carry out each L L multiplication (assuming that K < L 2K). When implemented on an FPGA in programmable logic, it is to be assumed that one L L pipelined multiplier will require of order 5L2 =8 slices [3, 4] in order to produce a new output each clock cycle, whilst one L-bit adder will require L/2 slices [4]. The CM will require L-bit RAM, with the single-port version involving L/2 slices and the dual-port version involving L slices [8]. These logicbased complexity figures will be used later in the chapter for carrying out sizing comparisons of the PE designs discussed in this and the previous chapters. To obtain L-bit accuracy in the outputs of the twelve-multiplier version of the GD-BFLY, which involves three stages of adders, it is necessary that L C 3 bits be retained from the multipliers, each of size L L, in order to guard the LSB, whilst the first stage of adders is carried out to .L C 4/-bit precision, the second stage to .L C 5/-bit precision and the third stage to .L C 6/-bit precision, in order to guard the MSB, at which point the data is scaled to yield the L-bit results. Similarly, to obtain L-bit accuracy in the outputs of the nine-multiplier version of the GD-BFLY, which involves four stages of adders, it is necessary that the first stage of adders (preceding the multipliers) be carried out to .L C 1/-bit precision, with L C 4 bits being retained from the multipliers, each of size .L C 1/ .L C 1/, whilst the second stage of adders is carried out to .L C 5/-bit precision, the third stage to .L C 6/-bit precision and the fourth stage to .L C 7/-bit precision, at which point the data is scaled to yield the L-bit results. Thus, given that the twelve-multiplier version of the GD-BFLY involves a total of 12 pipelined multipliers, six stage-one adders, eight stage-two adders and eight stage-three adders, the PE can be constructed with an arithmetic-based logic requirement, denoted LA M12 , of LA M12

1 15L2 C 22L C 112 2

(7.1)

slices, whilst the nine-multiplier version of the GD-BFLY, which involves a total of three stage-one adders, nine pipelined multipliers, six stage-two adders, eight stage-three adders and eight stage-four adders, requires an arithmetic-based logic requirement, denoted LA M9 , of LA M9 slices.

1 45L2 C 190L C 548 8

(7.2)

104


These figures, together with the CM requirement – as discussed in Section 6.3 of the previous chapter and given by Equations 6.5 and 6.6 – will combine to form the benchmarks with which to assess the merits of the hardware-based arithmetic unit now discussed.

7.4 CORDIC Approach The CORDIC algorithm [12] is an arithmetic technique used for carrying out twodimensional vector rotations. Its relevance here is in its ability to carry out the phase rotation of a complex number, as this will be seen to be the underlying operation required by the GD-BFLY. The vector rotation, which is a convergent linear process, is performed very simply as a sequence of elementary rotations with an ever-decreasing elementary rotation angle where each elementary rotation can be carried out using just shift and add-subtract operations.

7.4.1 CORDIC Formulation of Complex Multiplier For carrying out the particular operation of phase rotation, a vector (X,Y) is rotated by an angle “™” to obtain the new vector .X0 ; Y0 /. For the nth elementary rotation, the fixed elementary rotation angle, arctan.2n /, which is stored in a ROM, is subtracted/added from/to the angle remainder, “™n ”, so that the angle remainder approaches zero with increasing “n”. The mathematical relations for the conventional non-redundant CORDIC rotation operation [1] are as given below via the four sets of equations: (a) Phase Rotation Operation: X0 D cos.™/ X sin.™/ Y Y0 D cos.™/ Y C sin.™/ X ™0 D 0

(7.3)

(b) Phase Rotation Operation as Sequence of Elementary Rotations: X0 D

K1 Y

cos.arctan.2n //.Xn ¢n :Yn :2n /

nD0

Y0 D

K1 Y

cos.arctan.2n //.Yn C ¢n :Xn :2n /

nD0

™0 D

K1 X nD0

™ .¢ n :arctan.2n //

(7.4)

7.4 CORDIC Approach

105

(c) Expression for nth Elementary Rotation: XnC1 D Xn ¢n :2n :Yn YnC1 D Yn C ¢n :2n :Xn ™nC1 D ™n ¢n :arctan.2n /

(7.5)

where “¢n ” is either C1 or 1, for non-redundant CORDIC, depending upon the sign of the angle remainder term, denoted here as “™n ”. (d) Expression for CORDIC Magnification Factor: MD

K1 Y nD0

n

arccos.arctan.2 // D

K1 Yq

.1 C 22n / 1:647 forlarge K (7.6)

nD0

which may need to be scaled out of the rotated output in order to preserve the correct amplitude of the phase rotated complex number. The choice of non-redundant CORDIC, rather than a redundant version whereby the term “¢n ” is allowed to be either C1, 1 or 0, ensures that the value of the magnification factor, which is a function of the number of iterations, is independent of the rotation angle being applied and therefore fixed for every instance of the GD-BFLY whether it is of Type-I, Type-II or Type-III – for the definitions see Section 4.3 of Chapter 4.

7.4.2 Parallel Formulation of CORDIC-Based PE From Equation 7.5, the CORDIC algorithm requires one pair of shift/add–subtract operations and one add–subtract operation for each bit of accuracy. When implemented sequentially [1], therefore, the CORDIC unit implements these elementary operations, one after another, using a single CS and feeding back the output as the input to the next iteration. A sequential CORDIC unit with L-bit output has a latency of L clock cycles and produces a new output every L clock cycles. On the other hand, when implemented in parallel form [1], the CORDIC unit implements these elementary operations as a computational pipeline – see Fig. 7.1 – using an array of identical CSs. A parallel CORDIC unit with L-bit output has a latency of L clock cycles but produces a new output every clock cycle. An attraction of the fully parallel pipelined architecture is that the shifters in each CS involve a fixed right shift, so that they may be implemented very efficiently in the wiring. Also, the elementary rotation angles may be distributed as constants to each CS so that they may also be hardwired. As a result, the entire CORDIC rotator may be reduced to an array of interconnected add–subtract units. Pipelining is achieved by inserting registers between the add–subtract units, although with most FPGA architectures there are already registers present in each logic cell, so that the addition of the pipeline registers involves no additional hardware cost.

106 Fig. 7.1 Pipeline architecture for CORDIC rotator

7 Design of Arithmetic Unit for Resource-Constrained Solution X0

Y0

Z0 LUT

>> 0

~ X0 ± Y0

~ Y0 ± X0

Z0 ± α0

Y1

X1 >> 1

>> 1

α0

sign(Z0)

>> 0

Z1

α1

sign(Z1)

~ X1 ± Y1

~ Y1 ± X1

Z1 ± α1

X2

Y2

Z2

XN−1

YN−1

ZN−1

>> N-1

~ XN-1 ± YN-1 XN

>> N-1

~ YN-1 ± XN-1 YN

sign(ZN-1)

αN−1

ZN-1 ± αN-1 ZN

~ ~ Note: Xn = Xn >> n and Yn=Yn>> n for n = 0,1,...,N−1

7.4.3 Discussion of CORDIC-Based Solution The twelve-multiplier version of the GD-BFLY produces eight outputs from eight inputs, these samples denoted by .X1 ; Y1 / through to .X4 ; Y4 /, with the multiplication stage of the GD-BFLY comprising 12 real multiplications which, together with the accompanying set of additions/subtractions, may be expressed for the case of the standard Type-III GD-BFLY via the three sets of equations

7.4 CORDIC Approach

107

X2 Y2 X3 Y3 X4 Y4

D

D

D

X2 Y2 sin.2™/ X3 Y3 cos.2™/ sin.3™/ X4 Y4 cos.3™/

cos.™/ sin.™/ sin.™/ cos.™/ cos.2™/ sin.2™/ cos.3™/ sin.3™/

(7.7) (7.8) (7.9)

where “™” is the single-angle, “2™” the double-angle and “3™” the triple-angle rotation angles. These sets of equations are equivalent to what would be obtained if we multiplied the complex number interpretations of .X2 ; Y2 / by ei™ ; .X3 ; Y3 / by ei2™ and .X4 ; Y4 / by ei3™ , followed for the case of the standard Type-III GD-BFLY by negation of the components Y2 ; Y3 and Y4 . As with the nine-multiplier and twelve-multiplier versions of the GD-BFLY, there are minor changes to the operation of the GD-BFLY, from one instance to another, in terms of the definitions of the first three address permutations, with one of two slightly different versions being appropriately selected for each according to the particular “Type” of GD-BFLY being executed – see Table 4.1 of Chapter 4. In addition, however, there are also minor changes required to the outputs of the CORDIC units in that if the GD-BFLY is of Type-I then the components Y2 ; Y3 and Y4 do not need to be negated, whereas if the GD-BFLY is of Type-II then only component Y4 needs to be negated and if the GD-BFLY is of Type-III, as discussed in the previous paragraph, then all three components need to be negated. Note, however, that the outputs will have grown due to the CORDIC magnification factor, “M”, of Equation 7.6, so that this growth needs to be adequately accounted for within the GD-BFLY. The most efficient way of achieving this would be to allow the growth to remain within components .X2 ; Y2 / through to .X4 ; Y4 / and the components .X1 ; Y1 / to be scaled multiplicatively by the term “M”, this being achieved with just two constant coefficient multipliers – see Fig. 7.2. This would result in a growth of approximately 1.647 in all the eight inputs to the second address permutation “ˆ2 ”. Note that scaling by such a constant differs from the operation of a standard fast multiplier in that Booth encoding/decoding circuits are no longer required, whilst efficient recoding methods [5] can be used to further reduce the logic requirement of the simplified operation to approximately one third that of the standard fast fixed-point multiplier. An obvious attraction of the CORDIC-based approach is that the GD-BFLY only requires knowledge of the single-angle, double-angle and triple-angle rotation angles, so that there is no longer any need to construct, maintain and access potentially large LUTs required for the storage of the trigonometric coefficients – that is, for the storage of sampled versions of the sinusoidal function with argument defined from 0 up to =2 radians. As a result, the radix-4 factorization of the CORDIC-based FHT may be expressed very simply with the updating of the rotation angles for the execution of each instance of the GD-BFLY being performed on-the-fly and involving only additions and subtractions.


Φ2

Φ3

Un-scaled CORDIC Rotator

Negate


Negate


Negate

_ Address Permutation

input data vector

Address Permutation

Fixed Scaler

Φ4

_ Address Permutation

Φ1

_

_

_

_

_

output data vector

negated rotation angles

Address Permutation

108

_

Fig. 7.2 Signal flow graph for CORDIC-based version of generic double butterfly

XS

>>n

ZS

>>n

ZS

XD

sign(ZS)

>>n

YD

>>n

ZD

sign(ZD)

XT

>>n

YT

>>n

ZT

sign(ZT)

~ X±Y

~ Y±X

Z ± αn

~ X±Y

~ Y±X

Z ± αn

~ X±Y

~ Y±X

Z ± αn

XS

YS

ZS

XD

YD

ZD

XT

YT

ZT

~ ~ Note: X = X >> n and Y = Y >> n where n and αn are fixed

Fig. 7.3 Computational stage of pipeline for CORDIC rotator with scalar inputs

The optimum throughput for the GD-BFLY is achieved with the fully-parallel hardwired solution of Fig. 7.3, whereby each CS of the pipeline uses nine add– subtract units to carry out simultaneously the three elementary phase rotations – note that in the figure the superscripts “S”, “D” and “T” stand for “single angle”, “double angle” and “triple angle”, respectively. Due to the decomposition of the original rotation angle into “K” elementary rotation angles, it is clear that execution of the phase rotation operation can only be approximated with the accuracy of the outputs of the last iteration being limited by the magnitude of the last elementary rotation angle applied. Thus, if L-bit accuracy is required of the rotated output, one would expect the number of iterations, “K”, to be chosen so that K D L, as the right shifts carried out in the Kth (and last) iteration would be of length L 1. This, in turn, necessitates two guard bits on the MSB and log2 L guard bits on the LSB. The MSB guard bits cater for the magnification factor

7.4 CORDIC Approach

109

p of Equation 7.6 and the maximum possible range extension of 2, whilst the LSB guard bits cater for the accumulated rounding error from the “L” iterations. Note also, from the definition of the elementary rotation angles, tan.™n / D ˙2n ;

(7.10)

that the CORDIC algorithm is known to converge over the range =2 ™ C =2, so that in order to cater for rotation angles between ˙ an additional rotation angle of ˙ =2 may need to be applied prior to the elementary rotation angles in order to ensure that the algorithm converges, thus increasing the number of iterations from K D L to K D L C 1. This may be very simply achieved, however, via application of the equations: X0 D ¢:Y Y0 D C¢:X ™0 D ™ C ¢: (

where ¢D

1 2

C1

Y 256 (from the time-complexity figures of Equations 6.11 and 6.12 in Chapter 6) to effectively double the throughput of the standard R24 FHTbased approach. One way of achieving this is to process alternate data sets on separate R24 FHTs running in parallel and offset by N clock cycles relative to each other – see Fig. 8.5. In this way, the latency of each R24 FHT would be bounded above by 2N clock cycles, for those transform sizes of interest, whilst for the same transform sizes the update time of the dual-R24 FHT system would be bounded above by just N clock cycles.

Regularized FHT

Regularized FHT

Regularized FHT clock cycles

t=0

t=N

t = 2N Regularized FHT

t = 3N

t = 4N

Regularized FHT

t = 5N

t = 6N

Regularized FHT

Fig. 8.5 Dual-R24 FHT approach to half-resolution processing scheme

t = 7N

132

8 Computation of 2n -Point Real-Data Discrete Fourier Transform

Therefore, given that a single 2N-point R24 FHT requires twice the DM requirement and up to twice the CM requirement (depending upon the addressing scheme used) of a single N-point R24 FHT – albeit with the same arithmetic complexity – the required update time, achieved via the use of two 2N-point R24 FHTs, would involve up to four times the memory requirement and twice the arithmetic complexity. An alternative approach to that described above would be to adopt a single 2Npoint R24 FHT, rather than two, but to assign two PEs to its computation, as described in Section 6.6 of Chapter 6, thus doubling the throughput of the R24 FHT and enabling the processing to keep up with the I/O over each block of data. The feasibility of a dual-PE solution such as this would clearly be determined, however, by the viability of using either the more complex quad-port memory or a doubled read/write access rate to the dual-port memory, for both the DM and the CM, as it will be necessary to read/write two samples from/to each of the eight memory banks for each clock cycle, as well as to read four (rather than two) trigonometric coefficients from each of the LUTs. Thus, achieving the required update time via the use of a dualPE R24 FHT such as this, would involve twice the arithmetic complexity of a single N-point R24 FHT solution, together with either the replacement of all dual-port memory by quad-port memory or a doubling of the read/write access rate to the dual-port memory. As a result, the achievable computational density for a solution to the real-data DFT based upon the half-resolution approach that achieves the required timing constraint may be said to lie between one quarter and one half of that achievable for a 4n -point real-data DFT via the conventional use of the R24 FHT, the exact fraction being dependent upon the length of the transform – the longer the transform the larger the relative memory requirement and the lower the relative computational density – and the chosen approach. The practicality of such a solution is therefore very much dependent upon the implementational efficiency of the R24 FHT compared to that of other commercially-available solutions. The results of Chapters 6 and 7, however, would seem to suggest the adoption of the R24 FHT for both 2n -point and 4n -point cases to be a perfectly viable option.

8.4 Discussion The first solution discussed in Section 8.2 has shown how the highly-parallel GDBFLY may be effectively exploited for the computation of the 2N-point real-data DFT, where “N” is a power of four. The solution was obtained by means of a “double-resolution” approach involving FHT-based processing at double the required transform-space resolution via the application of two half-length regularized FHTs. The R4 FHT-to-R2 FFT converter uses a conflict-free and in-place parallel memory addressing scheme to enable the computation for the 2n -point case to be carried out in the same highly-parallel fashion as for the 4n -point case. The solution has some other interesting properties, even when the complexity is viewed purely in terms of sequential arithmetic operation counts, as the computation of the 2N-point real-data DFT – when N is a power of four – requires a total of

8.4 Discussion

133 mply

CFFT D 2N: log2 2N

(8.29)

real multiplications when obtained via one of the real-from-complex strategies discussed in Chapter 2, using the standard complex-data Cooley–Tukey algorithm, but only mply

CFHT D N .3 log4 N C 2/

(8.30) R24

FHT and the real multiplications when obtained via the combined use of the R4 FHT-to-R2 FFT converter. Thus, for the computation of a 2K-point real-data DFT, for example, this means 22,528 real multiplications via the complex-data radix-2 FFT or 15,360 real multiplications via the R24 FHT, implying a reduction of nearly one-third by using the solution outlined here. The split-radix algorithm could be used instead of the Cooley–Tukey algorithm to further reduce the multiplication count of the radix-2 FFT but only at the expense of a loss of regularity in the FFT design. The second solution discussed in Section 8.3 has shown how the highly-parallel GD-BFLY may be effectively exploited for the computation of the N-point realdata DFT, where “2N” is a power of four. The solution was obtained by means of a “half-resolution” approach involving FHT-based processing at one half the required transform-space resolution via the application of one double-length regularized FHT. A point worth noting is that if DHT .fxŒ0; xŒ1; xŒ2; xŒ3g/ D fX.H/ Œ0; X.H/ Œ1; X.H/ Œ2; X.H/ Œ3g

(8.31)

say, then it is also true, via a theorem applicable to both the DFT and the DHT, namely the Stretch or Repeat Theorem [2], that DHT .fxŒ0; xŒ1; xŒ2; xŒ3; xŒ0; xŒ1; xŒ2; xŒ3g/ D f2X.H/ Œ0; 0; 2X.H/ Œ1; 0; 2X.H/ Œ2; 0; 2X.H/ Œ3; 0g

(8.32)

this result being true, not just for the four-point sequence shown, but for a data sequence of any length. As a result, an alternative to the zero-padding approach, which instead involves the idea of transforming a repeated or replicated data set, could be used to extract the required FHT outputs from those of a double-length R24 FHT. Note, however, that the magnitudes of the required even-addressed output samples are twice what they should be so that scaling may be necessary – namely division by two which in fixed-point hardware reduces to that of a simple right shift operation – in order to achieve the correct magnitudes, this being applied either to the input samples or to the output samples. The two solutions discussed, based upon both double-resolution and halfresolution approaches – and for which the mathematical/logical correctness of their operation has been proven both in software, via a computer program written in the “C” programming language, and in silicon with a non-optimized Virtex-II Pro

134

8 Computation of 2n -Point Real-Data Discrete Fourier Transform

100 FPGA implementation – thus enable the R24 FHT to be applied, potentially, to a great many more problems including those that might not necessarily be best solved through the direct application of a 4n -point transform.

References 1. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (2007) 2. J.O. Smith III, Mathematics of the Discrete Fourier Transform (DFT) with Audio Applications (W3K Publishing, Stanford, CA, 2007)

Chapter 9

Applications of Regularized Fast Hartley Transform

Abstract This chapter discusses the application of the regularized FHT to a number of computationally-intensive DSP-based functions that may benefit from the adoption of a transform-space solution, and in particular, where the data in question is real valued so that the processing may be efficiently carried out in Hartley space. The functions discussed are those of up-sampling, differentiation, correlation – both auto-correlation and cross-correlation – and channelization. Efficient channelization, for the case of a single channel (or small number of channels), may be achieved by means of a DDC process where the filtering is performed via fast Hartley-space convolution, whilst for the case of multiple channels, efficiency may be achieved via the application of the polyphase DFT filter bank. Each such function might typically be encountered in that increasingly important area of wireless communications relating to the geolocation of signal emitters, with each potentially able to yield both conceptually and computationally simplified solutions when solved via the regularized FHT. A discussion is finally provided relating to the results obtained in the chapter.

9.1 Introduction Having now seen how the R24 FHT might be used for the efficient parallel computation of an N-point DFT where N may be a power of either two or four – although for optimal computational density it should be a power of four – the monograph concludes with the description of a number of DSP-based functions where the adoption of Hartley space, rather than Fourier space, as the chosen transform space within which to carry out the processing, may potentially lead to conceptually and computationally simplified solutions. Three particular sets of functions common to many modern DSP systems are discussed, namely: 1. The up-sampling and differentiation – for the case of both first and second derivatives – of a real-valued signal either individually or in combination. 2. The correlation function of two real-valued or complex-valued signals where the signals may both be of infinite duration, as encountered with cross-correlation,

K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 9, c Springer Science+Business Media B.V. 2010

135

136

9 Applications of Regularized Fast Hartley Transform

or where one signal is of finite duration and the other of infinite duration, as encountered with auto-correlation. 3. The channelization of a real-valued signal which, for the case of a single channel (or small number of channels), may be achieved by means of a DDC process where the filtering is carried out via fast Hartley-space convolution, whilst for the case of multiple channels, may be achieved via the application of the polyphase DFT filter bank. One important area of wireless communications where all three sets of functions might typically be encountered is that relating to the geolocation [8] of signal emitters, where there is a requirement to produce accurate timing measurements from the data gathered at a number of sensors, these measurements being generally obtained from the up-sampled outputs of a correlator. When the signal under analysis is of sufficiently wide bandwidth, however, the data would first have to be partitioned in frequency before such measurements could be made so as to optimize the SNR of the signal for specific frequency bands of interest prior to the correlation process. For the case of a single channel (or small number of channels) the associated filtering operation may, depending upon the parameters, be most efficiently carried out by means of fast transform-space convolution, whilst when there is a sufficiently large number of equi-spaced and equi-bandwidth channels, this process – which is generally referred to in the technical literature as channelization – is best carried out by means of a polyphase DFT filter bank [1, 4, 12]. The adoption of the transform-space approach in signal processing makes particular sense when a significant amount of the processing is able to be efficiently carried out in the transform space, so that several distinct tasks might be beneficially performed there before the resulting signal is transformed back to data space. A multi-sensor digital signal conditioner [5] has been defined, for example, which exploits the transform-space approach to carry out in a highly efficient manner, in Fourier space, the various tasks of sample-rate conversion, spectral shaping or filtering and malfunctioning sensor detection and compensation.

9.2 Fast Transform-Space Convolution and Correlation Given the emphasis placed on the transform-space approach in this chapter it is perhaps worth illustrating firstly its importance by considering the simple case of the filtering of a real-valued signal by means of a length N FIR filter. A linear system [9, 10] such as this is characterized by means of an output signal that is obtained from the convolution of the system input signal with the system impulse response – as represented by a finite set of filter coefficients. A direct data-space formulation of the problem may be written, in un-normalized complex-data form, as Rconv h;x Œk D

N1 X nD0

h Œn:xŒk n;

(9.1)

9.3 Up-Sampling and Differentiation of Real-Valued Signal

137

where the superscript “” refers to the operation of complex conjugation, so that each filter output requires N multiplications – this yields an O.N2 / arithmetic complexity for the production of N filter outputs. Alternatively, a fast Hartley-space convolution approach – see Section 3.5 of Chapter 3 – combined with the familiar overlap-save or overlap-add technique [2] associated with conventional FFT-based linear convolution [2] (where the FHT of the filter coefficient set is fixed and precomputed), might typically involve the application of two 2N-point FHTs and one transform-space product of length 2N in order to produce N filter outputs – this yields an O(N.logN) arithmetic complexity. Thus, with a suitably chosen FHT algorithm, clear computational gains are achievable via fast Hartley-space convolution for relatively small values of N. The correlation function is generally defined as measuring the degree of correlation or similarity between a given signal and a shifted replica of that signal. From this, the basic data-space formulation for the cross-correlation function of two arbitrary complex-valued signals may be written, in un-normalized form and with arbitrary upper and lower limits, as X

upper

Rcorr h;x Œk

D

h Œn:xŒk C n;

(9.2)

nDlower

which is similar in form to that for the convolution function of Equation 9.1 except that there is no need to apply the folding operation [2] to one of the two functions to be correlated. In fact, if either of the two functions to be correlated is an even function, then the operations of convolution and correlation are equivalent. The above expression is such that: (1) when both sequences are of finite length, it corresponds to the cross-correlation function of two finite-duration signals – to be discussed in Section 9.4.2; (2) when one sequence is of infinite length and the other a finite-length stored reference it corresponds to the auto-correlation function – to be discussed in Section 9.4.3; or (3) when both sequences are of infinite length it corresponds to the cross-correlation function of two continuous data streams – to be discussed in Section 9.4.4. As evidenced from the discussion above relating to the convolution-based filtering problem, the larger the correlation problem the greater the potential benefits to be gained from the adoption of a transform-space approach, particularly when the correlation operation is carried out by means of a fast unitary/orthogonal transform such as the FFT or the FHT.

9.3 Up-Sampling and Differentiation of Real-Valued Signal This section looks briefly at how two basic DSP-based functions, namely those of up-sampling and differentiation, might be efficiently carried out by first transforming the real-valued signal from data space to Hartley space, via the application of a

138


DHT, then modifying in some way the resulting Hartley-space data, before returning to the data space via the application of a second DHT to obtain the data corresponding to an appropriately modified version of the original real-valued signal.

9.3.1 Up-Sampling via Hartley Space The first function considered is that of up-sampling where the requirement is to increase the sampling rate of the signal without introducing additional frequency components to the signal outside of its frequency range or band of definition – this function being also referred to as band-limited interpolation. Suppose that the signal is initially represented by means of “N” real-valued samples and that it is required to increase or interpolate this by a factor of “L”. To achieve this, the real-valued data is first transformed from data space to Hartley space, via the application of a DHT of length N, with zero-valued samples being then inserted between the samples of the Hartley-space data according to the following rule [11]:

YŒk D

8 ˆ ˆ ˆ ˆ ˆ
256 (from the time-complexity figures of Equations 6.11 and 6.12 in Chapter 6), to double the throughput of the standard R24 FHT and hence of the real-data FFT. This may be achieved either by using a dual-PE version of the R24 FHT, as discussed in Section 6.6 of Chapter 6, or by computing two R24 FHTs simultaneously, or in parallel, on consecutive overlapped sets of polyphase filter outputs – along the lines of the dual-R24 FHT scheme described in Section 8.3 of the previous chapter. When the over-sampling ratio is reduced to 4/3, however, the real-data DFT problem simplifies to the execution of one N-point real-data FFT every 3N/4 clock cycles, so that for those situations where N 4K, a single R24 FHT may well suffice, as evidenced from the time-complexity figures given by Equations 6.11 and 6.12 in Chapter 6.

9.6 Discussion This chapter has focused on the application of the DHT to a number of computationally-intensive DSP-based functions which may benefit from the adoption of transform-space processing. The particular application area of geolocation was discussed in some detail as it is a potential vehicle for all of the functions considered. With most geolocation systems there is typically a requirement to produce up-sampled correlator outputs from which the TOA or TDOA timing measurements may subsequently be derived. The TOA measurement forms the basis of

156


those geolocation systems based upon the exploitation of multiple range estimates whilst the TDOA measurement forms the basis of those geolocation systems based upon the exploitation of multiple relative range estimates. The up-sampling, differentiation and correlation functions, as was shown, may all be efficiently performed, in various combinations, when the processing is carried out via Hartley space, with the linearity of the complex-data correlation operation also leading to its decomposition into four parallel real-data correlation operations. This parallel decomposition is particularly useful when the quantities of data to be correlated are large and the throughput requirement high as it enables the correlation to be efficiently computed by running in parallel multiple versions of the R24 FHT. With regard to the channelization problem, it was suggested that the computational complexity involved in the production of a single channel (or small number of channels) by means of a DDC process may, depending upon the parameters, be considerably reduced compared to that of the direct data-space approach, by carrying out the filtering via fast Hartley-space convolution. For the case of multiple channels, it was seen that the channelization of a real-valued signal by means of the polyphase DFT filter bank may also be considerably simplified through the adoption of an FHT for carrying out the associated real-data DFT. In fact, with most RF channelization problems, where the number of channels is large enough to make the question of implementational complexity a serious issue, the sampled IF data is naturally real valued, so that advantage may be made of this fact in trying to reduce the complexity to manageable levels. This can be done in two ways: firstly, by replacing each pair of short FIR filters – applied to the in-phase and quadrature channels – required by the standard solution for each polyphase branch, with a single short FIR filter, as the data remains real valued right the way through the polyphase filtering process; and secondly, by replacing the complex-data DFT at the output of the standard polyphase filter bank by a real-data DFT which for a suitably chosen number of channels may be efficiently computed by means of the R24 FHT. Note that to be able to carry out the real-data DFT component of the polyphase DFT filter bank with a dual-PE solution to the R24 FHT, rather than a single-PE solution, as suggested in Section 9.5.2.2, it would be necessary to use either quadport memory or a doubled read/write access rate to the dual-port memory, for both the DM and the CM, so as to ensure conflict-free and (for the data) in-place parallel memory addressing of both the data and the trigonometric coefficients for each PE – as discussed in Section 6.6 of Chapter 6 – with all eight GD-BFLY inputs/outputs for each PE being read/written simultaneously from/to memory.

References 1. A.N. Akansu, R.A. Haddad, Multiresolution Signal Decomposition: Transforms – Subbands – Wavelets (Academic Press, San Diego, CA, 2001). 2. E.O. Brigham, The Fast Fourier Transform and Its Applications (Prentice Hall, Englewood Cliffs, NJ, 1988).

References

157

3. D. Fraser, Interpolation by the FFT revisited – an experimental investigation. IEEE Trans. ASSP. 37(5), 665–675 (1989). 4. F.J. Harris, Multirate Signal Processing for Communication Systems (Prentice Hall, Upper Saddle River, NJ, 2004). 5. K.J. Jones, Digital Signal Conditioning for Sensor Arrays, G.B. Patent Application No: 0112415 (5 May 2001). 6. R. Nielson, Sonar Signal Processing (Artech House, Boston, MA, 1991). 7. A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing (Prentice Hall, Upper Saddle River, NJ, 1989). 8. R.A. Poisel, Electronic Warfare: Target Location Methods (Artech House, Boston, MA, 2005). 9. J.G. Proakis, Digital Communications (McGraw Hill, New York, 2001). 10. B. Sklar, Digital Communications: Fundamentals and Applications (Prentice Hall, Englewood Cliffs, NJ, 2002). 11. C.C. Tseng, S.L. Lee, Design of FIR Digital Differentiator Using Discrete Hartley Transform and Backward Difference (European Signal Processing Conference (EUSIPCO), Lausanne, 2008). 12. P.P. Vaidyanathan, Multirate Systems and Filter Banks (Prentice Hall, Englewood Cliffs, NJ, 1993).

Chapter 10

Summary and Conclusions

Abstract This chapter first outlines the background to the problem addressed by the preceding chapters, namely the computation using silicon-based hardware of the real-data DFT, including the specific objectives that were to be met by the research, this being followed with a further discussion of the results obtained from the research and finally of the conclusions to be drawn.

10.1 Outline of Problem Addressed The problem addressed in this monograph has been concerned with the parallel computation of the real-data DFT, targeted at implementation with silicon-based parallel computing equipment, where the application area of interest is that of wireless communications, and in particular that of mobile communications, so that resource-constrained (both silicon and power) solutions based upon the highly regular fixed-radix FFT design have been actively sought. With the computing power now available via the silicon-based parallel computing technologies, however, it is no longer adequate to view the FFT complexity purely in terms of arithmetic operation counts, as has conventionally been done, as there is now the facility to use both multiple arithmetic units – adders, fast multipliers and CORDIC phase rotators – and multiple banks of fast RAM in order to enhance the FFT performance via its parallel computation. As a result, a whole new set of constraints has arisen relating to the design of efficient FFT algorithms for silicon-based implementation. With the environment encountered in mobile communications, where a small battery may be the only source of power supply for long periods of time, algorithms are now being designed subject to new and often conflicting performance criteria, where the ideal is either to maximize the throughput (that is, to minimize the update time) or satisfy some constraint on the latency, whilst at the same time minimizing the required silicon resources (and thereby minimizing the cost of implementation) as well as keeping the power consumption to within the available budget. The traditional approach to the DFT problem has been to use a complex-data solution, regardless of the nature of the data, this often entailing the initial conversion K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 10, c Springer Science+Business Media B.V. 2010

159

160

10 Summary and Conclusions

of real-valued data to complex-valued data via a wideband DDC process or through the adoption of a real-from-complex strategy whereby two real-data FFTs are computed simultaneously via one full-length complex-data FFT or where one realdata FFT is computed via one half-length complex-data FFT. Each such solution, however, involves a computational overhead when compared to the more direct approach of a real-data FFT in terms of increased memory, increased processing delay to allow for the possible acquisition/processing of pairs of data sets, and additional packing/unpacking complexity. With the DDC approach, where two functions are used instead of just one, the information content of short-duration signals may also be compromised through the introduction of the additional and unnecessary filtering operation. Thus, the traditional approach to the problem of the real-data DFT has effectively been to modify the problem so as to match an existing complex-data solution – the aim of the research carried out in this monograph has been to seek a solution that matches the actual problem. The DHT, which is an orthogonal real-data transform and close relative to the DFT that possesses many of the same properties, was identified as an attractive algorithm for attacking the real-data DFT problem as the outputs from a real-data DFT may be straightforwardly obtained from the outputs of the DHT, and vice versa, whilst fast algorithms for its solution – referred to generically as the FHT – are now commonly encountered in the technical literature. A drawback of conventional FHTs, however, lies in the lack of regularity arising from the need for two sizes – and thus two separate designs – of butterfly for fixed-radix formulations, where a single-sized radix-R butterfly produces R outputs from R inputs and a double-sized radix-R butterfly produces 2R outputs from 2R inputs.

10.2 Summary of Results To address the above situation, a generic version of the double-sized butterfly, referred to as the generic double butterfly and abbreviated to “GD-BFLY”, was developed for the radix-4 version of the FHT which overcame the problem in an elegant fashion. The resulting single-design solution, referred to as the regularized FHT and abbreviated to “R24 FHT”, lends itself naturally to parallelization and to mapping onto a regular computational structure for implementation with one of the silicon-based parallel computing technologies. A partitioned-memory architecture was identified and developed for the parallel computation of the GD-BFLY and of the resulting R24 FHT, whereby both the data and the trigonometric coefficients were partitioned or distributed across multiple banks of memory. The approach exploited a single locally-pipelined highperformance PE that yielded an attractive solution which was both area-efficient and scalable in terms of transform length. High performance was achieved by having the PE able to process the input/output data sets to the GD-BFLY in parallel, this in turn implying the need to be able to access simultaneously, and without conflict,

10.2 Summary of Results

161

both multiple data and multiple txrigonometric coefficients, from their respective memories. The arithmetic and permutation operations performed on the GD-BFLY data within the PE were mapped onto a computational pipeline where, for the implementation considered, it was required that the total number of CSs in the pipeline was an odd-valued integer so as to avoid any possible conflict problems with the reading/writing of input/output data from/to the DM banks for each new clock cycle. A number of pipelined versions of the PE were thus described using both fast fixed-point multipliers and CORDIC phase rotators which enabled the arithmetic complexity to be traded off against memory requirement. The result was a set of designs based upon the partitioned-memory single-PE computing architecture which each yield a hardware-efficient solution with universal application, such that each new application necessitates minimal re-design cost. The resulting solutions were shown to be amenable to efficient implementation with the silicon-based technologies and capable of achieving the computational density – that is, the throughput per unit area of silicon – of the most advanced commercially-available complexdata solutions for just a fraction of the silicon resources. The area-efficiency makes each design particularly attractive for those applications where the real-data transform is sufficiently long as to make the associated memory requirement a serious issue for more conventional multi-PE solutions, whilst the block-based nature of their operation means that they are also able, via the block floating-point scaling strategy, to produce higher accuracy transform-domain outputs when using fixed-point arithmetic than is achievable by their streaming FFT counterparts. Finally, it was seen how the applicability of the R24 FHT – which is a radix-4 algorithm – could be generalized, without significantly compromising performance, to the efficient parallel computation of the real-data DFT whose length is a power of two, but not a power of four. This enables it to be applied, potentially, to a great many more problems, including those that might not necessarily be best solved through the direct application of a 4n -point transform. This was followed by its application to the computation of some of the more familiar and computationally-intensive DSP-based functions, such as those of correlation – both auto-correlation and crosscorrelation – and of the wideband channelization of RF data via the polyphase DFT filter bank. With each such function, which might typically be encountered in that increasingly important area of wireless communications relating to the geolocation of signal emitters, the adoption of the R24 FHT may potentially result in both conceptually and computationally simplified solutions. Note that the mathematical/logical correctness of the operation of the various functions used by the partitioned-memory single-PE solution to the R24 FHT has been proven in software with a computer program written in the “C” programming language. This code provides the user with various choices of PE design and of storage/accession scheme for the trigonometric coefficients, helping the user to identify how the algorithm might be efficiently mapped onto suitable parallel computing equipment following translation of the sequential “C” code to the parallel code produced by a suitably chosen HDL.

162

10 Summary and Conclusions

10.3 Conclusions The aims of this research, as described above, have been successfully achieved with a highly-parallel formulation of the real-data FFT being defined without recourse to the use of a complex-data FFT and, in so doing, a solution that yields clear implementational advantages, both theoretical and practical, over the more conventional complex-data solutions to the problem. The highly-parallel formulation of the real-data FFT described in the monograph has been shown to lead to scalable and device-independent solutions to the latency-constrained version of the problem which are able to optimize the use of the available silicon resources, and thus to maximize the achievable computational density, thereby making the solution a genuine advance in the design and implementation of high-performance parallel FFT algorithms.

Appendix A

Computer Program for Regularized Fast Hartley Transform

Abstract This appendix outlines the various functions of which the regularized FHT is comprised and provides a detailed description of the computer code, written in the “C” programming language, for executing the said functions, where integeronly arithmetic is used to model the fixed-point nature of the associated arithmetic operations. The computer source code for the complete solution, which is listed in Appendix B, is to be found on the CD accompanying the monograph.

A.1 Introduction The processing functions required for a fixed-point implementation of the R24 FHT break down into two quite distinct categories, namely those pre-processing functions that need to be carried out in advance of the real-time processing for performing the following tasks: – Setting up of LUTs for trigonometric coefficients – Setting up of permutation mappings for GD-BFLY and those processing functions that need to be carried out as part of the real-time solution: – – – – –

Di-bit reversal DM reads and writes CM reads and trigonometric coefficient generation GD-BFLY computation FHT-to-FFT conversion

The individual modules – written in the “C” programming language with the Microsoft Visual CCC compiler under their Visual Studio computing environment – that have been developed to implement these particular pre-processing and processing functions are now outlined, where integer-only arithmetic has been used to model the fixed-point nature of the associated arithmetic operations. This is followed by a brief guide on how to run the program and of the scaling strategies

163

164

A Computer Program for Regularized Fast Hartley Transform

available to the user. Please note, however, that the program has not been exhaustively tested so it is quite conceivable that various bugs may still be present in the current version of the code. The notification of any such bugs, if identified, would be greatly welcomed by the author. The computer code for the complete solution, which is listed in Appendix B, is to be found on the CD accompanying the monograph.

A.2 Description of Functions Before the R24 FHT can be executed it is first necessary that a main module or program be produced: “RFHT4 Computer Program.c” which carries out all the pre-processing functions, as required for providing the necessary inputs to the R24 FHT, as well as setting up the input data to the R24 FHT through the calling of a separate module: “SignalGeneration.c” such that the data – real valued or complex valued – may be either accessed from an existing binary or text file or generated by the signal generation module.

A.2.1 Control Routine Once all the pre-processing functions have been carried out and the input data made ready for feeding to the R24 FHT, a control module: “RFHT4 Control.c” called from within the main program then carries out in the required order all the processing functions that make up the real-time solution, this starting with the di-bit reversal of the input data, followed by the execution of the R24 FHT, and finishing with the conversion of the output data, should it be required, from Hartley space to Fourier space.

A.2.2 Generic Double Butterfly Routines Three versions of the GD-BFLY have been produced, as discussed in Chapters 6 and 7, with the first version, involving 12 fast fixed-point multipliers, being carried out by means of the module:

A.2 Description of Functions

165

“Butterfly V12M.c”, the second version, involving nine fast fixed-point multipliers, by means of the module: “Butterfly V09M.c” and the third version, involving three CORDIC rotation units, by means of the module: “Butterfly Cordic.c”. The last version makes use of a separate module: “Rotation.c” for carrying out the individual phase rotations.

A.2.3 Address Generation and Data Re-ordering Routines The generation of the four permutation mappings used by the GD-BFLY, as discussed in Chapter 4, is carried out by means of the module: “ButterflyMappings.c”, whilst the di-bit reversal of the input data to the R24 FHT is carried out with the module: “DibitReversal.c” and the addresses of the eight-sample data sets required for input to the GD-BFLY are obtained by means of the module: “DataIndices.c”. Note that for optimal efficiency the four permutation mappings used by the GDBFLY only store information relating to the non-trivial exchanges.

A.2.4 Data Memory Accession and Updating Routines The reading/writing of multiple samples of data from/to DM, which requires the application of the memory address mappings discussed in Chapter 6, is carried out by means of the module:

166


“MemoryBankAddresses.c” which, given the address of a single di-bit reversed sample, produces both the memory bank address together with the address offset within that particular memory bank. For optimal efficiency, this piece of code should be tailored to the particular transform length under consideration, although a mapping that is implemented for one particular transform length will also be valid for every transform length shorter than it, albeit somewhat wasteful in terms of unnecessary arithmetic/logic.

A.2.5 Trigonometric Coefficient Generation Routines The trigonometric coefficient sets accessed from CM which are required for the execution of the GD-BFLY are dependent upon the particular version of the GD-BFLY used, namely whether it involves 12 or nine fast fixed-point multipliers, as well as the type of addressing scheme used, namely whether the storage of the trigonometric coefficients is based upon the adoption of one-level or two-level LUTs. For the combination of a twelve-multiplier version of the GD-BFLY and the adoption of three one-level LUTs, the trigonometric coefficients are generated via the module: “Coefficients V12M 1Level.c”, whilst for the combination of a nine-multiplier version of the GD-BFLY and the adoption of three one-level LUTs, the trigonometric coefficients are generated via the module: “Coefficients V09M 1Level.c”, for the combination of a twelve-multiplier version of the GD-BFLY and the adoption of three two-level LUTs, the trigonometric coefficients are generated via the module: “Coefficients V12M 2Level.c”, and for the combination of a nine-multiplier version of the GD-BFLY and the adoption of three two-level LUTs, the trigonometric coefficients are generated via the module: “Coefficients V09M 2Level.c”. All four versions produce sets of nine trigonometric coefficients which are required for the execution of the GD-BFLY.

A.3 Brief Guide to Running the Program

167

A.2.6 Look-Up-Table Generation Routines The generation of the LUTs required for the storage of the trigonometric coefficients is carried out by means of the module: “Look Up Table 1Level.c”, for the case of the one-level LUTs, or the module: “Look Up Table 2Level.c” for the case of the two-level LUTs.

A.2.7 FHT-to-FFT Conversion Routines Upon completion of the R24 FHT, the outputs may be converted from Hartley space to Fourier space, if required, this being carried out by means of the module: “Conversion.c”. The routine is able to operate with FHT outputs obtained from the processing of either real-data inputs or complex-data inputs.

A.3 Brief Guide to Running the Program The parameters that define the operation of the R24 FHT are listed as constants at the top of the main program, “RFHT4 Computer Program.c”, these constants enabling the various versions of the GD-BFLY to be selected, as required, as well as the transform length, word lengths (for both data and trigonometric coefficients), input/output data formats, scaling strategy, etc., to be set up by the user at run time. The complete list of parameters is reproduced here, as shown in Fig. A.1, this including a typical set of parameter values and an accompanying description of each parameter. The input data set used for testing the various double butterfly and memory addressing combinations may be either read from a binary or text file (real data or complex data), with the appropriate file name being as specified in the signal generation routine, “SignalGeneration.c”, or mathematically generated to model a signal in the form of a single tone (real data or complex data versions) where the address of the excited FHT/FFT bin is as specified on the last line of Fig. A.1. For a real-valued input data set the program is able to produce transform outputs in either Hartley space or Fourier space, whilst when the input data set is complex valued the program will automatically produce the outputs in Fourier space.

168 //

A Computer Program for Regularized Fast Hartley Transform SYSTEM PARAMETERS:

#define FHT_length #define data_type #define FHT_FFT_flag #define BFLY_type #define MEM_type #define scaling //

1024 1 1 3 1 2

no of bits representing input data no of bits representing trigonometric coefficients

18 27 5

// // //

no of Cordic iterations = output accuracy (bits) no of bits representing Cordic rotation angle no of guard bits for LSB: ~ log2(no_of_iterations)

2 2

// //

1 => HEX, 2 => DEC 1 => HEX, 2 => DEC

FIXED SCALING PARAMETERS - ONE FACTOR PER FHT STAGE:

#define scale_factor_0 #define scale_factor_1 #define scale_factor_2 #define scale_factor_3 #define scale_factor_4 #define scale_factor_5 #define scale_factor_6 #define scale_factor_7 //

// //

FILE PARAMETERS:

#define input_file_format #define output_file_format //

18 24

CORDIC BUTTERFLY PARAMETERS:

#define no_of_iterations #define no_of_bits_angle #define LSB_guard_bits //

transform length: must be a power of 4 1 => real-valued data, 2 => complex-valued data 1 => FHT outputs, 2 => FFT outputs Bfly type: 1 => 12 mplys, 2 => 9 mplys, 3 => 3 Cordics Memory type: 1 => one-level LUT, 2 => two-level LUT 1 => FIXED, 2 => BFP

REGISTER-LENGTH PARAMETERS:

#define no_of_bits_data #define no_of_bits_coeffs //

// // // // // //

2 2 2 2 2 2 1 0

// // // // // // // //

bits to shift for stage = 0 bits to shift for stage = 1 bits to shift for stage = 2 bits to shift for stage = 3 bits to shift for stage = 4 – last stage for 1K FHT bits to shift for stage = 5 – last stage for 4K FHT bits to shift for stage = 6 – last stage for 16K FHT bits to shift for stage = 7 – last stage for 64K FHT

SYNTHETIC DATA PARAMETERS:

#define data_input #define dft_bin_excited

1 117

// //

0 => read data from file, 1 => generate data tone excited: between 0 and FHT_length/2-1

Fig. A.1 Typical parameter set for regularized FHT program

Note that when writing the outputs of an N-point FHT to file, the program stores one sample to a line; when writing the outputs of an N-point real-data FFT to file, it stores the zero-frequency term on the first line followed by the positive frequency terms on the next N/2 – 1 lines, with the real and imaginary components of each term appearing on the same line; and finally, when writing the outputs of an N-point complex-data FFT to file, it stores the zero-frequency term on the first line followed by the positive and then negative frequency terms on the next N – 1 lines, with the real and imaginary components of each term appearing on the same line – although the Nyquist-frequency term, like the zero-frequency term, possesses only a real component. Bear in mind that for the case of the real-data FFT, the magnitude of a zero-frequency tone (or Nyquist-frequency tone, if computed), if measured in

A.4 Available Scaling Strategies

169

the frequency domain, will be twice that of a comparable positive frequency tone (i.e. having the same signal amplitude) which shares its energy equally with its negative-frequency counterpart.

A.4 Available Scaling Strategies With regard to the fixed-point scaling strategies, note that when the scaling of the intermediate results is carried out via the conditional block floating-point technique, it is applied at the input to each stage of GD-BFLYs. As a result, any possible magnification incurred during the last stage of GD-BFLYs is not scaled out of the results, so that up to three bits of growth will still need to be accounted for in the R24 FHT outputs according to the particular post-FHT processing requirements. Examples of block floating-point scaling for both the twelve-multiplier and the nine-multiplier versions of the GD-BFLY are given in Figs. A.2 and A.3, respectively, each geared to the use of an 18-bit fast multiplier – the scaling for the CORDIC version is essentially the same as that for the twelve-multiplier version. The program provides the user with specific information relating to the chosen parameter set, printing to the screen the amount of scaling required, if any, for each stage of GD-BFLYs required by the transform.

Input Data: 18 Bits + Zero Growth Output Data: 18 Bits + Growth

1-2

22

21 >> Growth–3

23

×8

1-8

3-8

>> Growth

21

18

21

24 >>3

22

18 ×8 18 + Growth

Calculate Growth

Register Details: PE Internal = 21 (Min) & 24 (Max) PE External = 21

Note: GrowthÎ{0, 1, 2, 3}

Fig. A.2 Block floating-point scaling for use with twelve-multiplier and CORDIC versions of generic double butterfly

170


Input Data: 17 Bits + Zero Growth Output Data: 17 Bits + Growth

1-2

Theory => 23 Bits Maximum

22

20 >> Growth–3

23

×8

1-8

3-8

>> Growth

17

21

18

20

24 >> 3

22

18 ×8 17 + Growth

Calculate Growth

Register Details: PE Internal = 20 (Min) & 23 (Max) PE External = 20

Note: Growth Î {0, 1, 2, 3}

Fig. A.3 Block floating-point scaling for use with nine-multiplier version of generic double butterfly

For the case of the unconditional fixed scaling technique – the individual scale factors to be applied for each stage of GD-BFLYs are as specified by the set of constants given in Fig. A.1 – a small segment of code has been included within the generic double butterfly routines which prints to the screen an error message whenever the register for holding the input data to either the fast multiplier or the CORDIC arithmetic unit overflows. For the accurate simulation of a given hardware device this segment of code needs to be replaced by a routine that mimics the “actual” behaviour of the device in response to such an overflow – such a response being dependent upon the particular device used. When the nine-multiplier version of the GD-BFLY is adopted the presence of the stage of adders prior to that of the fast fixed-point multipliers is far more likely to result in an overflow unless additional scaling is applied immediately after this stage of adders has been completed, as is performed by the computer program, or alternatively, unless the data word-length into the GD-BFLY is constrained to be one bit shorter than that for the twelve-multiplier version. Clearly, in order to prevent fixed-point overflow, the settings for the individual scale factors will need to take into account both the transform length and the particular version of the GD-BFLY chosen, with experience invariably dictating when an optimum selection of scale factors has been made. Bear in mind, however, that with the CORDIC-based version of the GD-BFLY there is an associated magnification of the data magnitudes by approximately 1.647 with each temporal stage of GD-BFLYs which needs to be accounted for by the scale factors. Finally, note that when the CORDIC-based GD-BFLY is selected, regardless of the scaling strategy adopted, the program will also print to the screen

A.4 Available Scaling Strategies

171

exactly how many non-trivial shifts/additions are required for carrying out the two fixed-coefficient multipliers for the chosen parameter set. For the case of an 18-stage CORDIC arithmetic unit, for example, a total of nine such non-trivial shifts/additions are required.

Appendix B

Source Code Listings for Regularized Fast Hartley Transform

Abstract This appendix lists the source code, written in the “C” programming language, for the various functions of which the regularized FHT is comprised. The actual computer source code is to be found on the CD accompanying the monograph.

B.1 Listings for Main Program and Signal Generation Routine #include “stdafx.h” #include <math.h> #include <stdio.h> #include <stdlib.h> #include <string.h> // // //

DEFINE PARAMETERS ------------------------SYSTEM PARAMETERS:

#define FHT length #define data type #define FHT FFT flag #define BFLY type

1024 1 1 3

#define MEM type

1

#define scaling

2

//

// // // // // // // //

transform length: must be a power of 4 1 => real-valued data, 2 => complex-valued data 1 => FHT outputs, 2 => FFT outputs Bfly type: 1 => 12 mplys, 2 => 9 mplys, 3 => 3 Cordics Memory type: 1 => one-level LUT, 2 => two-level LUT 1 => FIXED, 2 => BFP

REGISTER-LENGTH PARAMETERS:

#define no of bits data #define no of bits coeffs

18 24

// //

no of bits representing input data no of bits representing trigonometric coefficients

173

174 //

B Source Code Listings for Regularized Fast Hartley Transform CORDIC BUTTERFLY PARAMETERS:

#define no of iterations #define no of bits angle #define LSB guard bits //

no of Cordic iterations = output accuracy (bits) no of bits representing Cordic rotation angle no of guard bits for LSB: log2(no of iterations)

2 2

// //

1 => HEX, 2 => DEC 1 => HEX, 2 => DEC

FIXED SCALING PARAMETERS - ONE FACTOR PER FHT STAGE:

#define scale #define scale #define scale #define scale #define scale #define scale #define scale #define scale //

// // //

FILE PARAMETERS:

#define input file format #define output file format //

18 27 5

factor factor factor factor factor factor factor factor

0 1 2 3 4 5 6 7

2 2 2 2 2 2 2 2

// // // // // // // //

bits to shift for stage = 0 bits to shift for stage = 1 bits to shift for stage = 2 bits to shift for stage = 3 bits to shift for stage = 4 - last stage for 1K FHT bits to shift for stage = 5 - last stage for 4K FHT bits to shift for stage = 6 - last stage for 16K FHT bits to shift for stage = 7 - last stage for 64K FHT

SYNTHETIC DATA PARAMETERS:

#define data input #define dft bin excited

1 256

// //

0 => read data from file, 1 => generate data tone excited: between 0 and FHT length/2-1

void main () f // // // // // // // // // // // // // // //

REGULARIZED FAST HARTLEY TRANSFORM ALGORITHM ---------------------------------------------------------------Author: Dr. Keith John Jones, June 14th 2009 FIXED-POINT FHT IMPLEMENTATION FOR FPGA -DATA & COEFFICIENTS QUANTIZED UTILIZES ONE DOUBLE–SIZED BUTTERFLY - T Y P E = 12 fast multipliers & 22 adders or 9 fast multipliers & 25 adders -or 3 Cordic arithmetic units & 2 fixed multipliers & 16 adders -UTILIZES EIGHT DATA MEMORY BANKS - S I Z E = N / 8 words per bank UTILIZES THREE COEFFICIENT MEMORY BANKS - S I Z E = N / 4 words or sqrt(N) / 2 words or zero words per bank

B.1 Listings for Main Program and Signal Generation Routine

175

//

Description:

// // // // // // // // // // // // //

--------This program carries out the FHT using a generic radix-4 double-sized butterfly. The solution performs 8 simultaneous reads/writes using 8 memory banks, each of length N/8 words. Three LUTs, each of length N/4 words or sqrt(N)/2 words, may also be used for holding the trigonometric coefficients, enabling all six coefficients to be accessed simultaneously - these LUTs are not required, however, when the arithmetic is performed with the Cordic unit. Three types of double-sized butterfly are available for use by the FHT: one involves the use of 12 fast fixed-point multipliers and 22 adders, another involves the use of 9 fast fixed-point multipliers and 25 adders, whilst a third involves the use of 3 Cordic arithmetic units, 2 fixed multipliers and 16 adders. Two coefficient memory addressing schemes are also available for use by the FHT: one involves the use of 3 LUTs, each of length N/4 words, whilst another involves the use of 3 LUTs, each of length sqrt(N)/2 words. The following combinations of arithmetic and memory are thus possible: 1) for a 12-multiplier double-sized butterfly & N/4 word LUTs, the coefficient generation involves no arithmetic operations; 2) for a 12-multiplier double-sized butterfly & sqrt(N)/2 word LUTs, the coefficient generation involves 7 multiplications and 8 additions; 3) for a 9-multiplier double-sized butterfly & N/4 word LUTs, the coefficient generation involves just additions; 4) for a 9-multiplier double-sized butterfly & sqrt(N)/2 word LUTs, the coefficient generation involves 7 multiplications and 14 additions; whilst 5) for a Cordic double-sized butterfly, the coefficients are efficiently generated on-the-fly. Scaling may be carried out within the regularized FHT to prevent overflow in the data registers – this may be carried out with either fixed scaling coefficients after each temporal stage, or by means of a block floating-point scheme in order to optimize the dynamic range out of the FHT. The program may produce either FHT or FFT output, where the input data may be either real valued or complex valued. For the case of complexvalued data, the FHT is simply applied to the real and imaginary components of the data separately before being appropriately combined via the FHT-to-FFT conversion routine. The inputs/outputs may be read/written from/to file with either decimal or hexadecimal formats.

// // // // // // // // // // // // // // // // // //

// // // // // // // // // // // // //

Files Used: -------For input/output data memory: input data read.txt - input file from which data is read. - FHT/FFT output data file. output data fht fft.txt For one-level trigonometric coefficient memory: - LUT for single-angle argument. LUT A1.txt - LUT for double-angle argument. LUT A2.txt - LUT for triple-angle argument. LUT A3.txt For two-level trigonometric coefficient memory: - coarse resolution sin LUT for single-angle argument. LUT Sin Coarse.txt - fine resolution sin LUT for single-angle argument. LUT Sin Fine.txt - fine resolution cos LUT for single-angle argument. LUT Cos Fine.txt

176 // // // // // // // // // // // // // // // // // // // // // //

// // // //

// //

B Source Code Listings for Regularized Fast Hartley Transform Functions Used: ----------- main program. FHT Computer Program SignalGeneration - signal generation routine. - regularized FHT control routine. RFHT4 Control - one-level LUT generation routine. LookUpTable 1Level - two-level LUT generation routine. LookUpTable 2Level ButterflyMappings - address permutation generation routine. DibitReversal - sequential di-bit reversal routine & 1-D to 2-D conversion. - double butterfly calculation routine: 12-multiply version. Butterfly V12M - double butterfly calculation routine: 9-multiply version. Butterfly V09M - double butterfly calculation routine: Cordic version. Butterfly Cordic Coefficients V12M 1Level - one-level coefficient generation: 12-multiply version. Coefficients V09M 1Level - one-level coefficient generation: 9-multiply version. Coefficients V12M 2Level - two-level coefficient generation: 12-multiply version. Coefficients V09M 2Level - two-level coefficient generation: 9-multiply version. DataIndices - data address generation routine. Conversion - DHT-to-DFT conversion routine. MemoryBankAddress - memory bank address/offset calculation routine. Rotation - Cordic phase rotation routine. Externs: -----void RFHT4 Control (int**, int*, int*, int*, int*, int*, int*, int*, int*, int*, int*, int*, int, int, int, int, int, int, int, int, int*, int*, int, int*, int*, int*, int, int, int, int, int*, int, int, int, int, int, int, int); void SignalGeneration (int*, int*, int, int, int, int, int, int); void LookUpTable 1Level (int, int, int*, int*, int*, int); void LookUpTable 2Level (int, int, int*, int*, int*, int); void ButterflyMappings (int*, int*, int*, int*); void DibitReversal (int, int, int*, int, int*, int**); void Conversion (int, int, int, int*, int*); void MemoryBankAddress (int, int, int, int, int*, int*); Declarations: ---------Integers: -----int wordsize, m, M, n, n1, n2, N, N2, N4, N8, no of bits, data levels, coef levels; int zero = 0, count, RootN, RootNd2, max magnitude, real type = 1, imag type = 2; int fft length, offset, halfpi, growth, growth copy, angle levels, minusquarterpi; int Root FHT length, alpha, lower, upper; Integer Arrays: -----------int index1[4], index2[16], index3[16], index4[8]; int scale factors[8], power of two A[15], power of two B[8]; int beta1[8], beta2[8], beta3[8], growth binary[32], arctans[32];

B.1 Listings for Main Program and Signal Generation Routine // // // //

// // // // ## //

//

//

//

177

Floats: ----double pi, halfpi float, quarterpi float, twopi, angle, growth float; Pointer Variables: ------------int *XRdata, *XIdata; int *bank1, *offset1, *bank2, *offset2, *scale total; int *Look Up Sin A1, *Look Up Sin A2, *Look Up Sin A3; int *Look Up Sin Coarse, *Look Up Cos Fine, Look Up Sin Fine; int **XRdata 2D = new int*[8]; Files: ---FILE *myinfile, *output; *********************************************************************** R E G U L A R I S E D F H T I N I T I A L I S A T I O N. Set up transform parameters. Root FHT length = (int) (sqrt(FHT length+0.5)); for (n = 3; n < 9; n++) f if (FHT length == (int) (pow(4,n))) f alpha = n; g g Set up standard angles. pi = atan(1.0)*4.0; halfpi float = atan(1.0)*2.0; twopi = atan(1.0)*8.0; quarterpi float = atan(1.0); wordsize = sizeof (int); memset (&index1[0], 0, wordsize 1); N8 = (N4 >> 1); RootN = Root FHT length; RootNd2 = RootN / 2; if (data type == 1) f fft length = N2; g else f fft length = N; g Set up number of quantisation levels for data. data levels = (int) (pow(2,(no of bits data-1))-1); Set up number of quantisation levels for coefficients. coef levels = (int) (pow(2,(no of bits coeffs-1))-1); Set up number of quantisation levels for Cordic rotation angles. angle levels = (int) (pow(2,(no of bits angle-1))-1); Set up maximum allowable data magnitude into double butterfly. max magnitude = (int) (pow(2,(no of bits data-1))); Set up register overflow bounds for use with unconditional fixed scaling strategy. lower = -(data levels+1); upper = data levels; Set up power-of-two array. no of bits = alpha void SignalGeneration (int *XRdata, int *XIdata, int N, int data type, int dft bin excited, int data input, int data levels, int input file format) f // Description: // -------// Routine to generate the signal data required for input to the Regularized FHT. // Parameters: // --------// XRdata = real component of 1-D data. // XIdata = imaginary component of 1-D data. // N = transform length. = data type: 1 => real valued data, 2 => complex valued data // data type = integer representing DFT bin excited. // dft bin excited = data type: 0 => read data from file, 1 => generate data. // data input = no of quantized data levels. // data levels = input file format: 1 => HEX, 2 => DEC. // input file format // Note: // ---// Complex data is stored in data file in the form of alternating real and imaginary // components. // Declarations: // --------// Integers: // -----int n; // Floats: // ----double twopi, argument; //

*********************************************************************

// ##

T E S T D A T A G E N E R A T I O N. if (data input == 0) f Read in FHT input data from file. FILE *input; if ((input = fopen(“input data fht.txt”, “r”)) == NULL) printf (“nn Error opening input data file to read from”); if (input file format == 1) f

//

B.2 Listings for Pre-processing Functions //

//

//

//

// g

185

“H E X” file format. if (data type == 1) f for (n = 0; n < N; n++) fscanf (input, “%x”, &XRdata[n]); g else f for (n = 0; n < N; n++) fscanf (input, “%x %x”, &XRdata[n], &XIdata[n]); g g else f “D E C” file format. if (data type == 1) f for (n = 0; n < N; n++) fscanf (input, “%d”, &XRdata[n]); g else f for (n = 0; n < N; n++) fscanf (input, “%d %d”, &XRdata[n], &XIdata[n]); g g Close file. fclose (input); g else f Generate single-tone signal for FHT input data. twopi = 8*atan(1.0); for (n = 0; n < N; n++) f argument = (twopi*n*dft bin excited)/N; XRdata[n] = (int) (cos(argument)*data levels); if (data type == 2) f XIdata[n] = (int) (sin(argument)*data levels); g g g End of function.

B.2 Listings for Pre-processing Functions #include “stdafx.h” #include <math.h> #include <stdio.h> void LookUpTable 1Level(int N, int N4, int *Look Up Sin A1, int *Look Up Sin A2, int *Look Up Sin A3, int coef levels)

186 f // // // // // // // // // // // // // // // // // // //

//

//

B Source Code Listings for Regularized Fast Hartley Transform

Description: --------Routine to set up the one-level LUTs containing the trigonometric coefficients. Parameters: --------N = transform length. N4 = N / 4. = l ook-up table for single-angle argument. Look Up Sin A1 = look-up table for double-angle argument. Look Up Sin A2 = look-up table for triple-angle argument. Look Up Sin A3 = number of trigonometric coefficient quantisation levels. coef levels Declarations: --------Integers: -----int i; Floats: ----double angle, twopi, rotation; *********************************************************************** Set up output files for holding LUT contents. FILE *output1; if ((output1 = fopen(“LUT A1.txt”, “w”)) == NULL) printf (“nn Error opening 1st LUT file”); FILE *output2; if ((output2 = fopen(“LUT A2.txt”, “w”)) == NULL) printf (“nn Error opening 2nd LUT file”); FILE *output3; if ((output3 = fopen(“LUT A3.txt”, “w”)) == NULL) printf (“nn Error opening 3rd LUT file”); twopi = (double) (atan(1.0) * 8.0); rotation = (double) (twopi / N); Set up size N/4 LUT for single-angle argument. angle = (double) 0.0; for (i = 0; i < N4; i++) f Look Up Sin A1[i] = (int) (sin(angle) * coef levels); angle += (double) rotation; fprintf (output1,“%xnn”, Look Up Sin A1[i]); g Set up size N/4 LUT for double-angle argument. angle = (double) 0.0; for (i = 0; i < N4; i++) f Look Up Sin A2[i] = (int) (sin(angle) * coef levels); angle += (double) rotation; fprintf (output2,“%xnn”, Look Up Sin A2[i]); g

B.2 Listings for Pre-processing Functions //

// // g

187

Set up size N/4 LUT for triple-angle argument. angle = (double) 0.0; for (i = 0; i < N4; i++) f Look Up Sin A3[i] = (int) (sin(angle) * coef levels); angle += (double) rotation; fprintf (output3,“%xnn”, Look Up Sin A3[i]); g Close files. fclose (output1); fclose (output2); fclose (output3); End of function.

#include “stdafx.h” #include <stdio.h> #include <math.h> void LookUpTable 2Level(int N, int RootNd2, int *Look Up Sin Coarse, int *Look Up Cos Fine, int *Look Up Sin Fine, int coef levels) f // Description: // --------// Routine to set up the two-level LUTs containing the trigonometric coefficients. // Parameters: // --------// N = transform length. // RootNd2 = sqrt(N) / 2. // Look Up Sin Coarse = coarse resolution sin LUT for single-angle argument. = fine resolution cos LUT for single-angle argument. // Look Up Cos Fine = fine resolution sin LUT for single-angle argument. // Look Up Sin Fine = number of trigonometric coefficient quantisation levels. // coef levels // Declarations: // ---------// Integers: // -----// int i; // Floats: // ----double angle coarse, angle fine, twopi, rotation coarse, rotation fine; // *********************************************************************** // Set up output files for holding LUT contents. FILE *output1; if ((output1 = fopen(“LUT Sin Coarse.txt”, “w”)) == NULL) printf (“nn Error opening 1st LUT file”); FILE *output2; if ((output2 = fopen(“LUT Cos Fine.txt”, “w”)) == NULL) printf (“nn Error opening 2nd LUT file”); FILE *output3; if ((output3 = fopen(“LUT Sin Fine.txt”, “w”)) == NULL) printf (“nn Error opening 3rd LUT file”); twopi = (double) (atan(1.0) * 8.0);

188

//

// // g

B Source Code Listings for Regularized Fast Hartley Transform rotation coarse = (double) (twopi / (2*sqrt((float)N))); rotation fine = (double) (twopi / N); Set up size sqrt(N) LUT for single-angle argument. angle coarse = (double) 0.0; angle fine = (double) 0.0; for (i = 0; i < RootNd2; i++) f Look Up Sin Coarse[i] = (int) (sin(angle coarse) * coef levels); Look Up Cos Fine[i] = (int) (cos(angle fine) * coef levels); Look Up Sin Fine[i] = (int) (sin(angle fine) * coef levels); fprintf (output1,“%xnn”, Look Up Sin Coarse[i]); fprintf (output2,“%xnn”, Look Up Cos Fine[i]); fprintf (output3,“%xnn”, Look Up Sin Fine[i]); angle coarse += (double) rotation coarse; angle fine += (double) rotation fine; g Look Up Sin Coarse[RootNd2] = coef levels; Close files. fclose (output1); fclose (output2); fclose (output3); End of function.

#include “stdafx.h” void ButterflyMappings(int *index1, int *index2, int *index3, int *index4) f // Description: // --------// Routine to set up the address permutations for the generic double butterfly. // Parameters: // --------// index1 = 1st address permutation. // index2 = 2nd address permutation. // index3 = 3rd address permutation. // index4 = 4th address permutation. // *********************************************************************** // 1st address permutation for Type-I and Type-II generic double butterflies. index1[0] = 6; index1[1] = 3; // 1st address permutation for Type-III generic double butterfly. index1[2] = 3; index1[3] = 6; // 2nd address permutation for Type-I and Type-II generic double butterflies. index2[0] = 0; index2[1] = 4; index2[2] = 3; index2[3] = 2; index2[4] = 1; index2[5] = 5; index2[6] = 6; index2[7] = 7; // 2nd address permutation for Type-III generic double butterfly. index2[8] = 0; index2[9] = 4; index2[10] = 2; index2[11] = 6; index2[12] = 1; index2[13] = 5; index2[14] = 3; index2[15] = 7; // 3rd address permutation for Type-I and Type-II generic double butterflies. index3[0] = 0; index3[1] = 4; index3[2] = 1; index3[3] = 5; index3[4] = 2; index3[5] = 6; index3[6] = 3; index3[7] = 7; // 3rd address permutation for Type-III generic double butterfly. index3[8] = 0; index3[9] = 4; index3[10] = 1; index3[11] = 3; index3[12] = 2; index3[13] = 6; index3[14] = 7; index3[15] = 5;

B.3 Listings for Processing Functions //

// g

189

4th address permutation for Type-I, Type-II and Type-III generic double butterflies. index4[0] = 0; index4[1] = 4; index4[2] = 1; index4[3] = 5; index4[4] = 6; index4[5] = 2; index4[6] = 3; index4[7] = 7; End of function.

B.3 Listings for Processing Functions #include “stdafx.h” #include <stdio.h> #include <math.h> void RFHT4 Control(int **Xdata 2D, int *index1, int *index2, int *index3, int *index4, int *Look Up Sin A1, int *Look Up Sin A2, int *Look Up Sin A3, int *Look Up Sin Coarse, int *Look Up Cos Fine, int *Look Up Sin Fine, int *power of two, int alpha, int N, int N2, int N4, int RootNd2, int coef levels, int no of bits coeffs, int scaling, int *scale factors, int *scale total, int max magnitude, int *beta1, int *beta2, int *beta3, int angle levels, int halfpi, int minusquarterpi, int growth, int *arctans, int no of iterations, int no of bits angle, int LSB guard bits, int lower, int upper, int BFLY type, int MEM type) f // Description: // --------// Routine to carry out the regularized FHT algorithm, with options to use either // twelve-multiplier, nine- multiplier or Cordic versions of the generic double butterfly // and N/4 word, sqrt(N)/2 word or zero word LUTs for the storage of the // trigonometric coefficients. // Externs: // -----void Butterfly V12M (int, int, int, int*, int*, int*, int*, int*, int*, int*, int, int, int, int*, int, int, int); void Butterfly V09M (int, int, int, int*, int*, int*, int*, int*, int*, int*, int, int, int, int*, int, int, int, int); void Butterfly Cordic (int*, int*, int*, int*, int*, int*, int*, int, int, int, int*, int, int, int, int, int*, int, int, int, int); void Coefficients V12M 1Level (int, int, int, int, int, int*, int*, int*, int*, int); void Coefficients V09M 1Level (int, int, int, int, int, int*, int*, int*, int*, int); void Coefficients V12M 2Level (int, int, int, int, int, int, int, int*, int*, int*, int*, int, int); void Coefficients V09M 2Level (int, int, int, int, int, int, int, int*, int*, int*, int*, int, int); void DataIndices (int, int, int, int, int*, int[2][4], int[2][4], int, int); // Parameters: // --------= 2-D data. // Xdata 2D // index1 = 1st address permutation. // index2 = 2nd address permutation. // index3 = 3rd address permutation. // index4 = 4th address permutation.

190 // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //

// // //

B Source Code Listings for Regularized Fast Hartley Transform Look Up Sin A1 Look Up Sin A2 Look Up Sin A3 Look Up Sin Coarse Look Up Cos Fine Look Up Sin Fine power of two alpha N N2 N4 RootNd2 coef levels no of bits coeffs scaling scale factors scale total max magnitude beta1 beta2 beta3 angle levels halfpi minusquarterpi growth arctans no of iterations no of bits angle LSB guard bits lower upper BFLY type

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

LUT for single-angle argument. LUT for double-angle argument. LUT for triple-angle argument. coarse resolution sin LUT for single-angle argument. fine resolution cos LUT for single-angle argument. fine resolution sin LUT for single-angle argument. array containing powers of 2. no of temporal stages for transform. transform length. N / 2. N / 4. sqrt(N) / 2. number of trigonometric coefficient quantisation levels. number of bits representing trigonometric coefficients. scaling flag: 1 => FIXED, 2 => BFP. bits to shift for double butterfly stages. total number of BFP scaling bits. maximum magnitude of data into double butterfly. initial single-angle Cordic rotation angle. initial double-angle Cordic rotation angle. initial triple-angle Cordic rotation angle. number of Cordic rotation angle quantisation levels. integer value of C(pi/2). integer value of (pi/4). integer value of Cordic magnification factor. Cordic micro-rotation angles. no of Cordic iterations. no of bits representing Cordic rotation angle. no of bits for guarding LSB. lower bound for register overflow with unconditional scaling. upper bound for register overflow with unconditional scaling. BFLY type: 1 => 12 multipliers, 2 => 9 multipliers, 3 => 3 Cordic units. = MEM type: 1 => LUT = one-level, 2 => LUT = two-level.

MEM type Description: --------Integers: ------int i, j, k, n, n2, offset, M, beta, bfly count, Type, negate flag, shift; Integer Arrays: ----------int X[9], kk[4], kbeta[3], Data Max[1], coeffs[9], threshold[3]; int index even 2D[2][4], index odd 2D[2][4]; *********************************************************************** Set up offset for address permutations. kk[3] = 0; Set up block floating-point thresholds.

B.3 Listings for Processing Functions

//

//

//

//

//

//

//

// ## //

//

threshold[0] = max magnitude; threshold[1] = max magnitude scale factors[i]); g W R I T E S - Set up output data vector for double butterfly. for (n = 0; n < 4; n++) f n2 = (n 0. Set up data indices for double butterfly. DataIndices (i, j, k, offset, kk, index even 2D, index odd 2D, bfly count, alpha); bfly count ++; Set up trigonometric coefficients for double butterfly. if (BFLY type == 1)

193

194

//

//

//

//

//

//

//

//

//

//

B Source Code Listings for Regularized Fast Hartley Transform f Butterfly is twelve-multiplier version. if (MEM type == 1) f Standard arithmetic & standard memory solution. Coefficients V12M 1Level (i, k, N2, N4, kbeta[0], Look Up Sin A1, Look Up Sin A2, Look Up Sin A3, coeffs, coef levels); g else f Standard arithmetic & reduced memory solution. Coefficients V12M 2Level (i, k, N2, N4, RootNd2, alpha, kbeta[0], Look Up Sin Coarse, Look Up Cos Fine, Look Up Sin Fine, coeffs, coef levels, no of bits coeffs); g Increment address offset. kbeta[0] += beta; g else f Butterfly is nine-multiplier version. if (BFLY type == 2) f if (MEM type == 1) f Reduced arithmetic & standard memory solution. Coefficients V09M 1Level (i, k, N2, N4, kbeta[0], Look Up Sin A1, Look Up Sin A2, Look Up Sin A3, coeffs, coef levels); g else f Reduced arithmetic & reduced memory solution. Coefficients V09M 2Level (i, k, N2, N4, RootNd2, alpha, kbeta[0], Look Up Sin Coarse, Look Up Cos Fine, Look Up Sin Fine, coeffs, coef levels, no of bits coeffs); g g Increment address offset. kbeta[0] += beta; g R E A D S - Set up input data vector for double butterfly. for (n = 0; n < 4; n++) f n2 = (n > scale factors[i]); g W R I T E S - Set up output data vector for double butterfly. for (n = 0; n < 4; n++) f n2 = (n void DibitReversal(int N8, int no of bits, int *power of two, int alpha, int *Xdata, int **Xdata 2D) f // Description: // --------// Routine to carry out in sequential fashion the in-place di-bit reversal mapping of the // input data and to store data in 2-D form. // Parameters: // --------// N8 = N / 8. = number of bits corresponding to N. // no of bits // power of two = array containing powers of 2. // alpha = no of temporal stages for transform. // Xdata = 1-D data. = 2-D data. // Xdata 2D // Externs: // -----void MemoryBankAddress (int, int, int, int, int*, int*); // Declarations: // ---------// Integers: // ------int i1, i2, i3, j1, j2, k, n, store; // Pointer Variables: // ------------int *bank, *offset; // *********************************************************************** // Set up dynamic memory. bank = new int [1]; bank[0] = 0; offset = new int [1]; offset[0] = 0; // Re-order data. i3 = 0; for (i1 = 0; i1 < N8; i1++) f for (i2 = 0; i2 < 8; i2++) f j1 = 0; j2 = (i3%2); for (k = 0; k < no of bits; k += 2) f n = no of bits - k; if (i3 & power of two[k]) j1 += power of two[n-2]; if (i3 & power of two[k+1]) j1 += power of two[n-1]; g if (j1 > i3) f store = Xdata[i3]; Xdata[i3] = Xdata[j1]; Xdata[j1] = store; g


// // g

197

Convert to 2-D form. MemoryBankAddress (i3, j2, 0, alpha, bank, offset); Xdata 2D[*bank][i1] = Xdata[i3]; i3 ++; g g Delete dynamic memory. delete bank, offset; End of function.

#include “stdafx.h” void Conversion(int channel type, int N, int N2, int *XRdata, int *XIdata) f // Description: // --------Routine to convert DHT coefficients to DFT coefficients. If the FHT is to be used for the // computation of the real-data FFT, as opposed to being used for the computation of the // complex-data FFT, the complex-valued DFT coefficients are optimally stored in the // following way: // // XRdata[0] = zero’th frequency component // XRdata[1] = real component of 1st frequency component // XRdata[N-1] = imag component of 1st frequency component // XRdata[2] = real component of 2nd frequency component // XRdata[N-2] = imag component of 2nd frequency component // --// XRdata[N/2-1] = real component of (N/2-1)th frequency component // XRdata[N/2+1] = imag component of (N/2-1)th frequency component // XRdata[N/2] = (N/2)th frequency component // For the case of the complex-valued FFT, however, the array “XRdata” stores the // real component of both the input and output data, whilst the array “XIdata” stores the // imaginary component of both the input and output data. // Parameters: // --------// // // // // // // // // // //

channel type = 1 => real input channel, 2 => imaginary input channel. N = transform length. N2 = N / 2. XRdata = on input: FHT output for real input channel; on output: as in “description” above. XIdata = on input: FHT output for imaginary input channel; on output: as in “description” above. Declarations: ---------Integers: ------int j, k, store, store1, store2, store3;

198 //

// //

// //

//

// g

B Source Code Listings for Regularized Fast Hartley Transform *********************************************************************** if (channel type == 1) f R E A L D A T A C H A N N E L. k = N 1; Produce DFT output for this channel. for (j = 1; j < N2; j++) f store = XRdata[k] + XRdata[j]; XRdata[k] = XRdata[k] XRdata[j]; XRdata[j] = store; XRdata[j] /= 2; XRdata[k] /= 2; k ; g g else f I M A G I N A R Y D A T A C H A N N E L. k = N 1; Produce DFT output for this channel. for (j = 1; j < N2; j++) f store = XIdata[k] + XIdata[j]; XIdata[k] = XIdata[k] XIdata[j]; XIdata[j] = store; XIdata[j] /= 2; XIdata[k] /= 2; Produce DFT output for complex data. store1 = XRdata[j] + XIdata[k]; store2 = XRdata[j] XIdata[k]; store3 = XIdata[j] + XRdata[k]; XIdata[k] = XIdata[j] XRdata[k]; XRdata[j] = store2; XRdata[k] = store1; XIdata[j] = store3; k - -; g g End of function.

#include “stdafx.h” #include <math.h> void Coefficients V09M 1Level(int i, int k, int N2, int N4, int kbeta, int *Look Up Sin A1, int *Look Up Sin A2, int *Look Up Sin A3, int *coeffs, int coef levels) f // Description: // --------// Routine to set up the trigonometric coefficients for use by the nine-multiplier version of // the generic double butterfly where one-level LUTs are exploited. // Parameters: // --------// i = temporal addressing index. // k = spatial addressing index. // N2 = N / 2. // N4 = N / 4.

B.3 Listings for Processing Functions // // // // // // // // // //

//

//

//

//

//

199

kbeta = temporal/spatial index. = look-up table for single-angle argument. Look Up Sin A1 = look-up table for double-angle argument. Look Up Sin A2 = look-up table for triple-angle argument. Look Up Sin A3 coeffs = current set of trigonometric coefficients. = number of trigonometric coefficient quantisation levels. coef levels Declarations: --------Integers: ------int m, n, n3, store 00, store 01; static int startup, coeff 00, coeff 01, coeff 02, coeff 03, coeff 04; *********************************************************************** if (startup == 0) f Set up trivial trigonometric coefficients - valid for each type of double butterfly. coeff 00 = +coef levels; coeff 01 = 0; coeff 02 = coef levels; Set up additional constant trigonometric coefficient for Type-II double butterfly. coeff 03 = (int) ((sqrt(2.0) / 2) * coef levels); coeff 04 = coeff 03 + coeff 03; startup = 1; g if (i == 0) f Set up trigonometric coefficients for Type-I double butterfly. n3 = 0; for (n = 0; n < 3; n++) f coeffs[n3++] = coeff 00; coeffs[n3++] = coeff 01; coeffs[n3++] = coeff 00; g g else f if (k == 0) f Set up trigonometric coefficients for Type-II double butterfly. n3 = 0; for (n = 0; n < 2; n++) f coeffs[n3++] = coeff 00; coeffs[n3++] = coeff 01; coeffs[n3++] = coeff 00; g

200

B Source Code Listings for Regularized Fast Hartley Transform coeffs[6] = coeff 04; coeffs[7] = coeff 03; coeffs[8] = 0;

// //

//

//

// g

g else f Set up trigonometric coefficients for Type-III double butterfly. m = kbeta; Set up single-angle sinusoidal & cosinusoidal terms. store 00 = Look Up Sin A1[N4-m]; store 01 = Look Up Sin A1[m]; coeffs[0] = store 00 + store 01; coeffs[1] = store 00; coeffs[2] = store 00 store 01; Set up double-angle sinusoidal & cosinusoidal terms. m alpham1; ca1 = RootNd2 - sa1; sca2 = m % RootNd2; cv1 = Look Up Sin Coarse[ca1]; sv1 = Look Up Sin Coarse[sa1]; cv2 = Look Up Cos Fine[sca2]; sv2 = Look Up Sin Fine[sca2]; sum1 = cv1 + sv1; sum2 = cv2 + sv2; sum3 = cv2 - sv2; store1 = (( int64)sum1*cv2) >> bits to shift; store2 = (( int64)sum2*sv1) >> bits to shift; store3 = (( int64)sum3*cv1) >> bits to shift; store 00 = (int) (store1 - store2); store 01 = (int) (store1 - store3); coeffs[0] = store 00 + store 01; coeffs[1] = store 00; coeffs[2] = store 00 store 01; Set up double-angle sinusoidal & cosinusoidal terms. store1 = (( int64)store 00*store 00) >> bits to shift m1; store2 = (( int64)store 00*store 01) >> bits to shift m1;


//

// g

store 02 = (int) (store1 - coef levels); store 03 = (int) store2; coeffs[3] = store 02 + store 03; coeffs[4] = store 02; coeffs[5] = store 02 store 03; Set up triple-angle sinusoidal & cosinusoidal terms. store1 = (( int64)store 02*store 00) >> bits to shift m1; store2 = (( int64)store 02*store 01) >> bits to shift m1; store 04 = (int) (store1 - store 00); store 05 = (int) (store2 + store 01); coeffs[6] = (int) (store 04 + store 05); coeffs[7] = (int) (store 04); coeffs[8] = (int) (store 04 - store 05); g g End of function.

#include “stdafx.h” #include <math.h> void Coefficients V12M 1Level(int i, int k, int N2, int N4, int kbeta, int *Look Up Sin A1, int *Look Up Sin A2, int *Look Up Sin A3, int *coeffs, int coef levels) f // Description: // --------// Routine to set up the trigonometric coefficients for use by the twelve-multiplier // version of the generic double butterfly where one-level LUTs are exploited. // Parameters: // --------// i = temporal addressing index. // k = spatial addressing index. // N2 = N / 2. // N4 = N / 4. // kbeta = temporal/spatial index. // Look Up Sin A1 = look-up table for single-angle argument. // Look Up Sin A2 = look-up table for double-angle argument. // Look Up Sin A3 = look-up table for triple-angle argument. // coeffs = current set of trigonometric coefficients. = number of trigonometric coefficient quantisation levels. // coef levels // Declarations: // ---------// Integers: // ------int m, n, n3; static int startup, coeff 00, coeff 01, coeff 02, coeff 03;

203

204


// ********************************************************************** if (startup == 0) f // Set up trivial trigonometric coefficients - valid for each type of double butterfly. coeff 00 = +coef levels; coeff 01 = 0; coeff 02 = coef levels; // Set up additional constant trigonometric coefficient for Type-II double butterfly. coeff 03 = (int) ((sqrt(2.0) / 2) * coef levels); startup = 1; g if (i == 0) f // Set up trigonometric coefficients for Type-I double butterfly. n3 = 0; for (n = 0; n < 3; n++) f coeffs[n3++] = coeff 00; coeffs[n3++] = coeff 01; coeffs[n3++] = coeff 02; g g else f if (k == 0) f // Set up trigonometric coefficients for Type-II double butterfly. n3 = 0; for (n = 0; n < 2; n++) f coeffs[n3++] = coeff 00; coeffs[n3++] = coeff 01; coeffs[n3++] = coeff 02; g for (n = 6; n < 9; n++) f coeffs[n] = coeff 03; g g else f // Set up trigonometric coefficients for Type-III double butterfly. m = kbeta; // Set up single-angle sinusoidal & cosinusoidal terms. coeffs[0] = Look Up Sin A1[N4-m]; coeffs[1] = Look Up Sin A1[m];


//

//

// g

Set up double-angle sinusoidal & cosinusoidal terms. m > alpham1; ca1 = RootNd2 - sa1; sca2 = m % RootNd2; cv1 = Look Up Sin Coarse[ca1]; sv1 = Look Up Sin Coarse[sa1]; cv2 = Look Up Cos Fine[sca2]; sv2 = Look Up Sin Fine[sca2]; sum1 = cv1 + sv1; sum2 = cv2 + sv2; sum3 = cv2 - sv2; store1 = (( int64)sum1*cv2) >> bits to shift; store2 = (( int64)sum2*sv1) >> bits to shift; store3 = (( int64)sum3*cv1) >> bits to shift; coeffs[0] = (int) (store1 - store2); coeffs[1] = (int) (store1 - store3); Set up double-angle sinusoidal & cosinusoidal terms. cv1 = coeffs[0]; sv1 = coeffs[1]; store1 = (( int64)cv1*cv1) >> bits to shift m1; store2 = (( int64)cv1*sv1) >> bits to shift m1; coeffs[3] = (int) (store1 coef levels); coeffs[4] = (int) store2; Set up triple-angle sinusoidal & cosinusoidal terms. cv2 = coeffs[3]; store1 = (( int64)cv1*cv2) >> bits to shift m1; store2 = (( int64)sv1*cv2) >> bits to shift m1; coeffs[6] = (int) (store1 cv1); coeffs[7] = (int) (store2 + sv1); Set up remaining trigonometric coefficients through symmetry. coeffs[2] = coeffs[0]; coeffs[5] = coeffs[3]; coeffs[8] = coeffs[6]; g g End of function.

#include “stdafx.h” #include <math.h> void Butterfly V12M(int i, int j, int k, int *X, int *coeffs, int *kk, int *index1, int *index2, int *index3, int *index4, int coef levels, int no of bits coeffs, int scaling, int *Data Max, int shift, int lower, int upper)

207

208 f // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //

//


Description: --------Routine to carry out the generic double butterfly computation using twelve fixed-point fast multipliers. Parameters: --------i = index for temporal loop. j = index for outer spatial loop. k = index for inner spatial loop. X = 1-D data array. coeffs = current set of trigonometric coefficients. kk = offsets for address permutations. index1 = 1st address permutation. index2 = 2nd address permutation. index3 = 3rd address permutation. index4 = 4th address permutation. = number of trigonometric coefficient quantisation levels. coef levels no of bits coeffs = number of bits representing trigonometric coefficients. scaling = scaling flag: 1 => FIXED, 2 => BFP. = maximum magnitude of output data set. Data Max shift = no of bits for input data to be shifted. lower = lower bound for register overflow with unconditional scaling. upper = upper bound for register overflow with unconditional scaling. Declarations: ---------Integers: ------int m, n, n2, n2p1, n3, n3p1, store, bits to shift1, bits to shift2; Long Integers: ---------int64 m1, m2, m3, m4; Integer Arrays: --------int Y[8]; *********************************************************************** Apply 1st address permutation - comprising one data exchange. m = kk[0]; store = X[index1[m++]]; X[6] = X[index1[m]]; X[3] = store; Set up scaling factor for multiplication stage. bits to shift2 = no of bits coeffs - 1; if (scaling == 1) f Y[0] = X[0]; Y[1] = X[1];

B.3 Listings for Processing Functions // ###

// ###

// //

//

//

//

//

//

//

Check for register overflow & flag when overflow arises. for (n = 0; n < 8; n++) f if ((X[n] < lower) jj (X[n] > upper)) f printf (“nnnn Overflow occurred on input register”); g g Check for register overflow completed. g else f Set up scaling factor for first two samples of input data set. bits to shift1 = 3 - shift; Shift data so that MSB occupies optimum position. Y[0] = X[0] shift; g Build in three guard bits for LSB. bits to shift2 -= 3; g Apply trigonometric coefficients and 1st set of additions/subtractions. n3 = 0; for (n = 1; n < 4; n++) f n2 = (n > bits to shift2; m2 = (( int64)coeffs[n3p1]*X[n2p1]) >> bits to shift2; Y[n2] = (int) (m1 + m2); Truncate contents of registers to required levels. m3 = (( int64)coeffs[n3p1]*X[n2]) >> bits to shift2; m4 = (( int64)coeffs[n3+2]*X[n2p1]) >> bits to shift2; Y[n2p1] = (int) (m3 - m4); n3 += 3; g Apply 2nd address permutation. m = kk[1]; for (n = 0; n < 8; n++) f X[index2[m++]] = Y[n]; g Apply 2nd set of additions/subtractions. for (n = 0; n < 4; n++) f n2 = (n 3); Update maximum magnitude of output data set. if (abs(Y[n]) > abs(Data Max[0])) Data Max[0] = Y[n]; g Apply 4th address permutation. X[index4[m++]] = Y[n]; g End of function.

#include “stdafx.h” #include <math.h> void Butterfly V09M(int i, int j, int k, int *X, int *coeffs, int *kk, int *index1, int *index2, int *index3, int *index4, int coef levels, int no of bits coeffs, int scaling, int *Data Max, int shift, int Type, int lower, int upper) f // Declarations: // ---------// Routine to carry out the generic double butterfly computation using nine fixed-point // fast multipliers. // Parameters: // --------// i = index for temporal loop. // j = index for outer spatial loop. // k = index for inner spatial loop. // X = 1-D data array. // coeffs = current set of trigonometric coefficients. // kk = offsets for address permutations. // index1 = 1st address permutation.

B.3 Listings for Processing Functions // // // // // // // // // // // // // // // // // // // // // // // //

//

// //

//

//

211

index2 = 2nd address permutation. index3 = 3rd address permutation. index4 = 4th address permutation. = number of trigonometric coefficient quantisation levels. coef levels = number of bits representing trigonometric coefficients. no of bits coeffs scaling = scaling flag: 1 => FIXED, 2 => BFP. = maximum magnitude of output data set. Data Max shift = no of bits for input data to be shifted. Type = butterfly type indicator: I, II or III. lower = lower bound for register overflow with unconditional scaling. upper = upper bound for register overflow with unconditional scaling. Note: ---Dimension array X[n] from 0 to 8 in calling routine RFHT4 Control. Declarations: ---------Integers: ------int m, n, n2, n2p1, store, bits to shift1, bits to shift2; Long Integers: ------int64 product; Integer Arrays: -------int Y[11]; *********************************************************************** Apply 1st address permutation - comprising one data exchange. m = kk[0]; store = X[index1[m++]]; X[6] = X[index1[m]]; X[3] = store; Set up scaling factor for multiplication stage. bits to shift2 = no of bits coeffs - 1; if (scaling == 2) f Set up scaling factor for first two samples of input data set. bits to shift1 = 3 - shift; Shift data so that MSB occupies optimum position. X[0] = X[0] shift; g Build in three guard bits for LSB. bits to shift2 -= 3; g Apply 1st set of additions/subtractions. Y[0] = X[0]; Y[1] = X[1]; Y[2] = X[2]; Y[3] = X[2] + X[3]; Y[4] = X[3];

212


Y[5] = X[4]; Y[6] = X[4] + X[5]; Y[7] = X[5]; Y[8] = X[6]; Y[9] = X[6] + X[7]; Y[10] = X[7]; if (scaling == 1) f // Scale outputs of 1st set of additions/subtractions. For (n = 0; n < 11; n++) Y[n] = (Y[n]>>1); // ### Check for register overflow & flag when overflow arises. for (n = 0; n < 11; n++) f if ((Y[n] < lower) jj (Y[n] > upper)) f printf (“nnnn Overflow occurred on input register”); g g // ### Check for register overflow completed. g // Apply trigonometric coefficients. for (n = 0; n < 9; n++) f product = (( int64)coeffs[n]*Y[n+2]) >> bits to shift2; X[n] = (int) product; g // Apply 2nd set of additions/subtractions. if (Type < 3) f Y[2] = X[0] + X[1]; Y[3] = X[1] + X[2]; Y[4] = X[3] + X[4]; Y[5] = X[4] + X[5]; g else f Y[2] = X[1] - X[2]; Y[3] = X[0] - X[1]; Y[4] = X[4] - X[5]; Y[5] = X[3] - X[4]; g if (Type < 2) f Y[6] = X[6] + X[7]; Y[7] = X[7] + X[8]; g else f Y[6] = X[7] - X[8]; Y[7] = X[6] - X[7]; g // Apply 2nd address permutation. m = kk[1]; for (n = 0; n < 8; n++) f X[index2[m++]] = Y[n]; g


//

//

//

// //

//

// g

213

Apply 3rd set of additions/subtractions. for (n = 0; n < 4; n++) f n2 = (n 3); Update maximum magnitude of output data set. if (abs(Y[n]) > abs(Data Max[0])) Data Max[0] = Y[n]; g Apply 4th address permutation. X[index4[m++]] = Y[n]; g End of function.

#include “stdafx.h” #include <math.h> void Butterfly Cordic(int *X, int *kbeta, int *kk, int *index1, int *index2, int *index3, int *index4, int halfpi, int minusquarterpi, int growth, int *arctans, int no of iterations, int no of bits angle, int negate flag, int scaling, int *Data Max, int shift, int LSB guard bits, int lower, int upper) f // Description: // --------// Routine to carry out the generic double butterfly computation using three Cordic // arithmetic units. // Externs: // -----void Rotation (int*, int*, int*, int, int, int*);

214 // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //

//

//

//

B Source Code Listings for Regularized Fast Hartley Transform Parameters: --------X = data. kbeta = current set of rotation angles. kk = offsets for address permutations. index1 = 1st address permutation. index2 = 2nd address permutation. index3 = 3rd address permutation. index4 = 4th address permutation. halfpi = integer version of +(pi/2). minusquarterpi = integer version of -(pi/4). growth = integer version of Cordic magnification factor. arctans = micro-rotation angles. no of iterations = no of Cordic iterations. no of bits angle = no of bits to represent Cordic rotation angle. = negation flag for Cordic output. negate flag scaling = scaling flag: 1 => FIXED, 2 => BFP. = maximum magnitude of output data set. Data Max shift = no of bits for input data to be shifted. LSB guard bits = no of bits for guarding LSB. lower = lower bound for register overflow with unconditional scaling. upper = upper bound for register overflow with unconditional scaling. Declarations: ---------Integers: ------int m, n, n2, n2p1, store, bits to shift1, bits to shift2; Integer Arrays: ---------int Y[8], xs[3], ys[3], zs[3]; ***************************************************************** Apply 1st address permutation - comprising one data exchange. m = kk[0]; store = X[index1[m++]]; X[6] = X[index1[m]]; X[3] = store; Set up scaling factor for multiplication stage. bits to shift1 = no of bits angle - 1; if (scaling == 1) f ### Check for register overflow & flag when overflow arises. for (n = 0; n < 8; n++) f if ((X[n] < lower) jj (X[n] > upper)) f printf (“nnnn Overflow occurred on input register”); g g ### Check for register overflow completed. g else


// //

//

//

//

// //

//

//

//

f Set up scaling factor for first two samples of input data set. bits to shift2 = LSB guard bits - shift + 2; Shift data so that MSB occupies optimum position. X[0] = X[0] >> shift; X[1] = X[1] >> shift; for (n = 2; n < 8; n++) f X[n] = X[n] > bits to shift1); Y[1] = (int) ((( int64)growth*X[1]) >> bits to shift1); Set up inputs to Cordic phase rotations of remaining permuted inputs. xs[0] = X[2]; xs[1] = X[4]; xs[2] = X[6]; ys[0] = X[3]; ys[1] = X[5]; ys[2] = X[7]; zs[0] = kbeta[0]; zs[1] = kbeta[1]; zs[2] = kbeta[2]; if (negate flag == 1) zs[2] = minusquarterpi; Carry out Cordic phase rotations of remaining permuted inputs. Rotation (xs, ys, zs, halfpi, no of iterations, arctans); Set up outputs from Cordic phase rotations of remaining permuted inputs. Y[2] = xs[0]; Y[4] = xs[1]; Y[6] = xs[2]; Y[3] = ys[0]; Y[5] = ys[1]; Y[7] = ys[2]; if (scaling == 2) f Scale Cordic outputs to remove LSB guard bits. for (n = 2; n < 8; n++) f Y[n] = Y[n] >> LSB guard bits; g g Negate, where appropriate, phase rotated outputs. if (negate flag > 0) f Y[7] = -Y[7]; if (negate flag > 1) f Y[3] = -Y[3]; Y[5] = -Y[5]; g g Apply 2nd address permutation. m = kk[1]; for (n = 0; n < 8; n++) f X[index2[m++]] = Y[n]; g

215

216 //

//

//

// //

//

// g

B Source Code Listings for Regularized Fast Hartley Transform Apply 1st set of additions/subtractions. for (n = 0; n < 4; n++) f n2 = (n 2); Update maximum magnitude of output data set. if (abs(Y[n]) > abs(Data Max[0])) Data Max[0] = Y[n]; g Apply 4th address permutation. X[index4[m++]] = Y[n]; g End of function.

#include “stdafx.h” void Rotation (int *xs, int *ys, int *zs, int halfpi, int no of iterations, int *arctans) f // Description: // --------// Routine to carry out the phase rotations required by the Cordic arithmetic unit for the // single angle, double angle and triple angle cases. // Parameters: // --------// xs = X coordinates. // ys = Y coordinates. // zs = rotation angles. // halfpi = +(pi/2). // no of iterations = no of Cordic iterations. // arctans = set of micro-rotation angles.

B.3 Listings for Processing Functions // // // // // // // // //

//

//

// g

217

Declarations: ---------Integers: ------int k, n; Integer Arrays: ----------int temp[3]; *********************************************************************** P H A S E R O T A T I O N R O U T I N E. Reduce three rotation angles to region of convergence: [-pi/2,+pi/2]. for (n = 0; n < 3; n++) f if (zs[n] < -halfpi) f temp[n] = +ys[n]; ys[n] = -xs[n]; xs[n] = temp[n]; zs[n] += halfpi; g else if (zs[n] > +halfpi) f temp[n] = -ys[n]; ys[n] = +xs[n]; xs[n] = temp[n]; zs[n] -= halfpi; g g Loop through Cordic iterations. for (k = 0; k < no of iterations; k++) f Carry out phase micro-rotation of three complex data samples. for (n = 0; n < 3; n++) f if (zs[n] < 0) f temp[n] = xs[n] + (ys[n] >> k); ys[n] -= (xs[n] >> k); xs[n] = temp[n]; zs[n] += arctans[k]; g else f temp[n] = xs[n] - (ys[n] >> k); ys[n] += (xs[n] >> k); xs[n] = temp[n]; zs[n] -= arctans[k]; g g g End of function.

#include “stdafx.h” #include <stdlib.h> void DataIndices (int i, int j, int k, int offset, int *kk, int index even 2D[2][4], int index odd 2D[2][4], int bfly count, int alpha)

218 f // // // // // // // // // // // // // // // // // // // // // // // // //

// // //


Description: ---------Routine to set up the data indices for accessing the input data for the generic double butterfly. Parameters: --------i = index for temporal loop. j = index for outer spatial loop. k = index for inner spatial loop. offset = element of power-of-two array. kk = offsets for address permutations. = even data address indices. index even 2D = odd data address indices. index odd 2D = double butterfly address for stage. bfly count alpha = no of temporal stages for transform. Externs: -----void MemoryBankAddress (int, int, int, int, int*, int*); Declarations: ---------Integers: ------int n, n1, n2, twice offset; Pointer Variables: ------------int *bank1, *offset1, *bank2, *offset2; *********************************************************************** Set up dynamic memory. bank1 = new int [1]; bank1[0] = 0; bank2 = new int [1]; bank2[0] = 0; offset1 = new int [1]; offset1[0] = 0; offset2 = new int [1]; offset2[0] = 0; Calculate data indices. if (i == 0) S T A G E = 0. twice offset = offset; Set up even and odd data indices for Type-I double butterfly. n1 = j - twice offset; n2 = n1 + 4; for (n = 0; n < 4; n++) f n1 += twice offset; n2 += twice offset; MemoryBankAddress (n1, n, 1, alpha, bank1, offset1); index even 2D[0][n] = *bank1; index even 2D[1][n] = *offset1; MemoryBankAddress (n2, n, 1, alpha, bank2, offset2); index odd 2D[0][n] = *bank2; index odd 2D[1][n] = *offset2; g


//

//

//

//

//

// // g

Set up offsets for address permutations. kk[0] = 0; kk[1] = 0; kk[2] = 0; g else f S T A G E > 0. twice offset = (offset start up, 1 => butterfly. // alpha = no of temporal stages for transform. // bank = memory bank address of sample: [0,1,2,3,4,5,6,7]. // offset = address offset within memory bank: [0,...,N/8-1]. // Note: // ---// For optimum arithmetic efficiency, comment out coding options not relevant to the // current application. // Declarations: // ---------// Integers: // ------int k1, k2, sub block size, mapping; // *********************************************************************** // Calculate memory bank address for N up to and including 1K. // bank[0] = ((((address%4)+((address%16)>>2)+((address%64)>>4)+ // ((address%256)>>6)+(address>>8)) % 4) >2)+((address%64)>>4)+((address%256) // >>6)+((address%1024)>>8)+(address>>10)) % 4) >2)+((address%64)>>4)+ // ((address%256)>>6)+((address%1024)>>8)+((address%4096)>>10)+ // (address>>12))% 4) 3; // End of function. g

Glossary

ADC – analog-to-digital conversion ASIC – application-specific integrated circuit AWGN – additive white Gaussian noise CD – compact disc CFA – Common Factor Algorithm CLB – configurable logic block CM – trigonometric coefficient memory CN – linear space of complex-valued N-tuples CORDIC – Co-Ordinate Rotation DIgital Computer CRT – Chinese Remainder Theorem CS – computational stage DA – distributed arithmetic DDC – digital down conversion DFT – discrete Fourier transform DHT – discrete Hartley transform DIF – decimation-in-frequency DIT – decimation-in-time DM – data memory DMER – even-real data memory DMI – intermediate data memory DMOI – odd-imaginary data memory DSP – digital signal processing DTMF – dual-tone multi-frequency FDM – frequency division multiplexed FFT – fast Fourier transform FHT – fast Hartley transform FNT – Fermat number transform FPGA – field-programmable gate array GD-BFLY – generic double butterfly HDL – hardware description language IF – intermediate frequency I/O – input–output IP – intellectual property LSB – least significant bit LUT – look-up table MAC – multiplier and accumulator MNT – Mersenne number transform MSB – most significant bit NTT – number-theoretic transform

221

222 O – order PE – processing element PFA – Prime Factor Algorithm PSD – power spectral density RAM – random access memory R24 FHT – regularized radix-4 fast Hartley transform RF – radio frequency RN – linear space of real-valued N-tuples ROM – read only memory SFDR – spurious-free dynamic range SFG – signal flow graph SIMD – single-instruction multiple-data SNR – signal-to-noise ratio TDOA – time-difference-of-arrival TOA – time-of-arrival

Glossary

Index

A Alias-free formulation, 154–155 Analog-to-digital conversion (ADC), 42, 72, 95 Application-specific integrated circuit (ASIC), 1, 2, 8, 39, 65–67, 77, 114 Area efficient, 62, 70–72, 77–98, 160 Arithmetic complexity, 1, 8, 18–20, 25, 37, 39, 61, 72–74, 77, 85, 86, 97, 110, 112, 114, 124, 125, 132, 137, 140, 145, 151, 155 Auto-correlation, 12, 135–137, 141–146, 161

B Bergland algorithm, 16–18 Bit reversal mapping, 23, 47, 98 Bruun algorithms, 16, 18–19, 24 Butterfly, 3, 7, 8, 11, 17, 37–39, 41–56, 59, 61, 62, 81, 84, 90–91, 108, 160

C Channelization, 12, 61, 135, 136, 149–156, 161 Chinese remainder theorem (CRT), 5 Circular convolution, 36, 140, 150 Circular correlation, 140, 144 Clock frequency, 66, 68, 71, 97, 114 Coarse grain parallelism, 68 Common factor algorithm (CFA), 5 Complementary angle LUTs, 85, 86, 92, 125, 126 Computational density, 11, 42, 66, 71, 74, 77, 78, 127, 132, 135, 161, 162 Computational stage (CS), 4, 62, 90, 105, 108, 127 Configurable logic block (CLB), 67 Convolution, 34, 36, 135–137, 140, 144, 149–151, 156

Cooley-Tukey algorithm, 4, 5, 17, 133 Co-Ordinate Rotation Digital Computer (CORDIC), 10, 12, 70, 72, 73, 101, 102, 104–114, 159, 161, 165, 168–171, 173–176, 178–180, 189, 190, 193, 195, 213–215, 217 Correlation, 12, 35, 37, 135–137, 140–149, 151, 156, 161 Cross-correlation, 12, 135, 137, 141–143, 145–148 D Data memory (DM), 12, 59, 79–84, 90–92, 95, 98, 114, 120, 122, 124–127, 130, 132, 156 Data space, 5, 8, 11, 12, 36, 39, 129, 136–140, 142, 143, 149, 156 Decimation-in-frequency (DIF), 5, 16, 17, 20, 23, 42 Decimation-in-time (DIT), 5, 16, 17, 20, 23, 37, 42, 45, 59 Di-bit reversal mapping, 47, 98 Differentiation, 135, 137–140, 147, 156 Digital down conversion (DDC), 7, 24, 95, 135, 136, 149, 151, 153, 156, 160 Digit reversal mapping, 23 Discrete Fourier transform (DFT), 1–8, 10–12, 15–25, 27–37, 39, 41, 59, 65, 78, 96, 114, 117–136, 149, 151–156, 159–161 Discrete Hartley transform (DHT), 1, 6–8, 10–12, 27–39, 42, 44, 129, 130, 133, 138–140, 143–145, 147, 148, 150, 155, 160 Distributed arithmetic (DA), 10, 70, 102 Divide-and-conquer, 4, 46 Double buffering, 95, 98, 127 Double-resolution approach, 117, 118, 124–127, 132 Dragonfly, 62

223

224 Dual-port memory, 91, 96, 97, 132, 156

E Equivalency theorem, 151

F Fast Fourier transform (FFT), 1, 3–8, 10, 11, 15–24, 37, 39, 41, 46, 59–62, 65, 67, 71, 72, 74, 78, 93–98, 102, 118–122, 124, 128–130, 133, 137, 139, 141, 142, 144, 153, 155, 159–162 Fast Hartley transform (FHT), 1, 6–10, 12, 27–39, 41–63, 65, 69, 74, 77–79, 87, 101, 107, 111, 117–133, 135–156, 160 Field-programmable gate array (FPGA), 1, 2, 8, 23, 39, 65–67, 73, 74, 77, 79, 93–97, 101–103, 105, 110, 112, 114, 134, 153 Fine grain parallelism, 68 Fixed point, 2, 39, 42, 61–62, 90, 92, 97, 101–103, 107, 110, 112, 114, 127, 133, 161 Fourier matrix, 2, 5 Fourier space, 11, 30, 31, 39, 98, 118, 120, 121, 129–131, 135, 136, 141 Frequency division multiplexed (FDM), 149

G Generic double butterfly (GD-BFLY), 8, 48, 50, 52, 54–56, 90–91, 108, 160 Global pipelining, 69, 78

H Half-resolution approach, 117, 118, 131–133 Hardware description language (HDL), 13, 73, 84, 161 Hartley matrix, 6 Hartley space, 8, 11, 30, 32, 36, 39, 98, 118, 120, 121, 129–131, 135–142, 147–151, 156

Index L Latency, 8, 10, 25, 37, 66, 71, 72, 78, 95–97, 105, 114, 121, 127, 129, 131, 155, 159, 162 Linear convolution, 36, 137, 139, 144 Linear correlation, 140 Local pipelining, 69

M Matched filter, 2, 39, 144 Memory requirement, 7, 8, 10, 18, 24, 25, 37, 58, 59, 61, 63, 68, 70, 71, 73, 74, 86, 91, 92, 95, 97, 101, 102, 110, 112, 114, 124–126, 131, 132, 145, 151, 161 Minimum arithmetic addressing, 57, 85, 86, 91, 93, 112, 124, 126 Minimum memory addressing, 57–58, 85–86, 88, 92, 112, 125, 126 Mobile communications, 2, 7–9, 11, 65–67, 79, 159 Multiplier and accumulator (MAC), 7

N Nearest-neighbour communication, 61, 68 Noble identity, 151

O Orthogonal, 2, 6, 10, 11, 27, 36, 46, 53, 129, 137, 140, 160 Overlap-add technique, 137, 144, 149 Overlap-save technique, 36, 139, 144, 150

I In-place processing, 80, 123, 132 Input-output (I/O), 9, 11, 66, 70–72, 78, 95, 132

P Parseval’s theorem, 36, 129, 130 Partitioned-memory processing, 66, 71–72, 74, 98 Pipeline delay, 90, 114, 127, 131 Pipelining, 4, 62, 68, 69, 78, 82–84, 88, 90–91, 105 Polyphase DFT filter bank, 135, 136, 151, 153, 154, 156 Power spectral density (PSD), 39 Prime factor algorithm (PFA), 5 Processing element (PE), 9

K Kernels, 2–4, 6, 10, 18, 27, 29, 63

Q Quad-port memory, 96, 97, 132, 156

Index R Radix, 3–5, 7, 8, 11, 16, 19, 20, 23, 24, 37, 41–56, 61, 62, 69, 78, 80, 93–95, 98, 117, 133, 159, 160 Random access memory (RAM), 65–67, 73, 101, 103, 110, 112, 114, 123, 159 Read only memory (ROM), 67, 104 Real-from-complex strategy, 7, 15, 16, 59, 95, 153, 160 Regularity, 3, 8, 24, 37, 39, 41, 45, 47, 52, 54, 59, 61, 62, 114, 133, 160 Regularized radix-4 fast Hartley transform (R24 FHT/, 8–10, 12, 13, 36–38, 41, 42, 45–48, 54, 56–59, 61, 62, 66–71, 73, 74, 77, 78, 80, 81, 83, 85, 86, 91–93, 95–98, 101, 102, 107, 110, 112, 114, 117–121, 123–127, 129–135, 142, 153, 155, 156, 160, 161, 163–165, 167, 169, 175 S Scalable, 70, 71, 77–98, 112, 160, 162 Scaling strategy, 28, 39, 61, 62, 98, 161, 163, 167, 169–171, 178, 180 Shannon-Hartley theorem, 9 Signal flow graph (SFG), 16, 17, 38, 45, 47, 48, 52, 54–56, 62, 108, 109, 120–122 Signal-to-noise ratio (SNR), 42, 136, 144 Silicon area, 11, 66, 68–71, 78 Single-instruction multiple-data (SIMD), 4, 96 Single-port memory, 110 Single-quadrant addressing scheme, 57, 85 Space complexity, 69, 70, 91–92 Spider, 62 Spurious-free dynamic range (SFDR), 95

225 Start-up delay, 90, 123–125 Switching frequency, 66, 68, 70 Symmetry, 3, 10, 16, 29, 52, 62, 63, 205, 207 Systolic, 4

T Time complexity, 23, 70, 78, 92–93, 117, 124, 125, 127, 131, 155 Time-difference-of-arrival (TDOA), 141, 148, 155, 156 Time-of-arrival (TOA), 141, 148, 155 Transform space, 8, 12, 31, 32, 39, 98, 118, 132, 133, 135–137, 140, 142, 143, 147, 149, 150, 155 Trigonometric coefficient memory (CM), 12, 56–58, 69, 70, 72, 79, 80, 85, 86, 91, 92, 96, 103, 104, 110, 112, 132, 156, 163, 166

U Unitary, 2, 6, 10, 11, 27, 28, 36, 46, 129, 137, 140 Update-time, 8, 65, 71, 95–97, 132, 159 Up-sampling, 135, 137–140, 147, 156

W Wireless communications, 9, 13, 135, 136, 159, 161

Z Zero-padding, 118, 133, 144, 150

The Regularized Fast Hartley Transform: Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in

Computational frameworks for the fast fourier transform

Computational frameworks for the fast Fourier transform

Computational frameworks for the fast Fourier transform

Understanding the fast Fourier transform: applications

The fast Fourier transform and its applications

The Fast Fourier Transform and Its Applications

The fast Fourier transform and its applications

Fast Fourier Transform and Its Applications

Fast Fourier Transform - Algorithms and Applications

Gauss and the history of the fast fourier transform

The discrete Fourier transform

Fast Fourier Transform - Algorithms and Applications (Signals and Communication Technology)

Fourier Transform Spectroscopy

Fourier transform methods in finance

Fourier Transform Applications

Fourier Transform Infrared Spectrometry

Fourier transform of analytic functions

Fourier transform of exponential functions

Computational Frameworks for the Fast Fourier Transform (Frontiers in Applied Mathematics)

Computational Frameworks for the Fast Fourier Transform (Frontiers in Applied Mathematics)

Computational Frameworks for the Fast Fourier Transform (Frontiers in Applied Mathematics)

Fast Fourier transforms

Modern Fourier Transform Infrared Spectroscopy

Mathematics of the Discrete Fourier Transform

Fourier and Laplace Transform (Solutions)

Dsp - application of the fast fourier transform to digital audio electrical and acoustical measurement techniques

Distribution Theory: Convolution, Fourier Transform, and Laplace Transform

The Fourier Transform And Its Applications

The Fourier Transform & Its Applications

Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms

The Regularized Fast Hartley Transform: Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in

Computational frameworks for the fast fourier transform

Computational frameworks for the fast Fourier transform

Computational frameworks for the fast Fourier transform

Understanding the fast Fourier transform: applications

The fast Fourier transform and its applications

The Fast Fourier Transform and Its Applications

The fast Fourier transform and its applications

Fast Fourier Transform and Its Applications

Fast Fourier Transform - Algorithms and Applications

Gauss and the history of the fast fourier transform

The discrete Fourier transform

Fast Fourier Transform - Algorithms and Applications (Signals and Communication Technology)

Fourier Transform Spectroscopy

Fourier transform methods in finance

Fourier Transform Applications

Fourier Transform Infrared Spectrometry

Fourier transform of analytic functions

Fourier transform of exponential functions

Computational Frameworks for the Fast Fourier Transform (Frontiers in Applied Mathematics)

Computational Frameworks for the Fast Fourier Transform (Frontiers in Applied Mathematics)

Computational Frameworks for the Fast Fourier Transform (Frontiers in Applied Mathematics)

Fast Fourier transforms

Modern Fourier Transform Infrared Spectroscopy

Mathematics of the Discrete Fourier Transform

Fourier and Laplace Transform (Solutions)

Dsp - application of the fast fourier transform to digital audio electrical and acoustical measurement techniques

Distribution Theory: Convolution, Fourier Transform, and Laplace Transform

The Fourier Transform And Its Applications

The Fourier Transform & Its Applications

Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms

Recommend Documents