R. Todd Ogden
Essential Wavelets for Statistical Applications and Data Analysis
Birkhauser Boston • Basel • Berlin
...
69 downloads
1060 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
R. Todd Ogden
Essential Wavelets for Statistical Applications and Data Analysis
Birkhauser Boston • Basel • Berlin
R. Todd Ogden Department of Statistics University of South Carolina Columbia, SC 29208
Library of Congress Cataloging-in-Publication Data Ogden, R. Todd, 1965Essential wavelets for statistical applications and data analysis I R. Todd Ogden. p. em. Includes bibliographical references (p. 191-198) and index. ISBN 0-8176-3864-4 (hardcover : alk. paper). -- ISBN 3-7643-3864-4 (hardcover : alk. paper) 1. Wavelets (Mathematics) 2. Mathematical statistics I. Title. QA403.3.043 1997 519.5--dc20 97-27379 CIP
Printed on acid-free paper © 1997 Birkhauser Boston
Birkhiiuser
Jir
Copyright is not claimed for works of U.S. Government employees. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without prior permission of the copyright owner. Permission to photocopy for internal or personal use of specific clients is granted by Birkhauser Boston for libraries and other users registered with the Copyright Clearance Center (CCC), provided that the basefeeof$6.00percopy, plus $0.20perpage is paid directly to CCC, 222 Rosewood Drive, Danvers, MA 01923, U.S.A. Special requests should be addressed directly to Birkhauser Boston, 675 Massachusetts A venue, Cambridge, MA 02139, U.S.A. ISBN 0-8176-3864-4 ISBN 3-7643-3864-4 Typeset in LATFX by ShadeTree Designs, Minneapolis, MN. Cover design by Spencer Ladd, Somerville, MA. Printed and bound by Maple-Vail, York, PA. Printed in the U.S.A.
9 8 7 6 5 4 3 2 1
To Christine
Contents
Preface Prologue: Why Wavelets? 1
Wavelets: A Brief Introduction 1.1 The Discrete Fourier Transform 1.2 The Haar System Multiresolution Analysis The Wavelet Representation Goals of Multiresolution Analysis 1.3 Smoother Wavelet Bases
2 Basic Smoothing Techniques 2.1 Density Estimation
2.2 2.3
Histograms Kernel Estimation Orthogonal Series Estimation Estimation of a Regression Function Kernel Regression Orthogonal Series Estimation Kernel Representation of Orthogonal Series Estimators
3 Elementary Statistical Applications 3.1 Density Estimation 3.2
Haar-Based Histograms Estimation with Smoother Wavelets Nonparametric Regression
4 Wavelet Features and Examples 4.1 Wavelet Decomposition and Reconstruction
4.2
Two-Scale Relationships The Decomposition Algorithm The Reconstruction Algorithm The Filter Representation
ix
xili 1 1 7
14 16 22 23 29 29 31 32 35 38 39 42 45 49 49 49 52 54 59 59 60 62 63 66
vi
4.3 Time-Frequency Localization
4.4
The Continuous Fourier Transform The Windowed Fourier Transform The Continuous Wavelet Transform Examples of Wavelets and Their Constructions Orthogonal Wavelets Biorthogonal Wavelets Semiorthogonal Wavelets
5 Wavelet-based Diagnostics 5.1 Multiresolution Plots 5.2 Time-Scale Plots 5.3 Plotting Wavelet Coefficients 5.4 Other Plots for Data Analysis 6 Some Practical Issues 6.1 The Discrete Fourier Transform of Data
6.2 6.3
6.4 7
8
The Fourier Transform of Sampled Signals The Fast Fourier Transform The Wavelet Transform of Data Wavelets on an Interval Periodic Boundary Handling Symmetric and Antisymmetric Boundary Handling Meyer Boundary Wavelets Orthogonal Wavelets on the Interval When the Sample Size is Not a Power of Two
69 69 72 74 79 81 83 87 89
89 92 95 100 103
104 104 105 107 110 111 112 113 114 115
Other Applications 7.1 Selective Wavelet Reconstruction Wavelet Thresholding Spatial Adaptivity Global Thresholding Estimation of the Noise Level 7.2 More Density Estimation 7.3 Spectral Density Estimation 7.4 Detections of Jumps and Cusps
119
Data Adaptive Wavelet Thresholding 8.1 SURE Thresholding 8.2 Threshold Selection by Hypothesis Testing
143
8.3 8.4
Recursive Testing Minimizing False Discovery Cross-Validation Methods Bayesian Methods
119 124 126 128 131 132 133 140 144 149 151 154 156 161
vii
9 Generalizations and Extensions 9.1 Two-Dimensional Wavelets 9.2 Wavelet Packets
9.3
Wavelet Packet Functions The Best Basis Algorithm Translation Invariant Wavelet Smoothing
167 167 173 174 177 180
Appendix
185
References
191
Glossary of Notation
199
Glossary of Terms
201
Index
205
Preface I once heard the book by Meyer (1993) described as a "vulgarization" of wavelets. While this is true in one sense of the word, that of making a subject popular (Meyer's book is one of the early works written with the nonspecialist in mind), the implication seems to be that such an attempt somehow cheapens or coarsens the subject. I have to disagree that popularity goes hand-in-hand with debasement. While there is certainly a beautiful theory underlying wavelet analysis, there is plenty of beauty left over for the applications of wavelet methods. This book is also written for the non-specialist, and therefore its main thrust is toward wavelet applications. Enough theory is given to help the reader gain a basic understanding of how wavelets work in practice, but much of the theory can be presented using only a basic level of mathematics. Only one theorem is formally stated in this book, with only one proof. And these are only included to introduce some key concepts in a natural way.
Aim and Scope This book was written to become what the reference that I wanted when I began my own study of wavelets. I had books and papers, I studied theorems and proofs, but no single one of these sources by itself answered the specific questions I had: In order to apply wavelets successfully, what do I need to know? And why do I need to know it? It is my hope that this book will answer these questions for others in the same situation. In keeping with the title of this book, I have attempted to pare down the possible number of topics of coverage to just the essentials required for statistical applications and analysis of data. New statistical applications are being developed quickly, so due to the combination of careful choosing of topics and natural delays in writing and printing, this book is necessarily incomplete. It is hoped, however, that the introduction provided in this text will provide a suitable foundation for readers to jump off into other wavelet-related topics. I am of the opinion that basic wavelet methods of smoothing functions, for example, should be as widely understood as standard kernel methods are now. Admittedly, understanding wavelet methods requires a substantial amount of overhead, in terms of time and effort, but the richness of wavelet
X
PREFACE
applications makes such an investment well worth it. This modest work is thus put forward to widen the circle of wavelet literacy. It is important to point out that I am not at all advocating the complete abandonment of all other methods. In a recent article, Fan, et al. (1996) discuss local versions of some standard smoothing techniques and show that they provide a good alternative to wavelet methods, and in fact may be preferred in many applications because of their familiarity. This book was written primarily to increase the familiarity of wavelets in data analysis: wavelets are simply another useful tool in the toolbag of applied statisticians and data analysts. The treatment of topics in this book assumes only that the reader is familiar with calculus and linear algebra, with a basic understanding of elementary statistical theory. With this background, this book is essentially self-contained, with other topics (Fourier analysis, £ 2 function space, function estimation, etc.) treated when introduced. A brief overview of L 2 function space is given as an appendix, along with glossaries of notation and terms. Thus, the material is accessible to a wide audience, including graduate students and advanced undergraduates in mathematics and statistics, as well as those in other disciplines interested in data analysis. Mathematically sophisticated readers can use this reference as quick reading to gain a basic understanding of how wavelets can be used.
Chapter Synopses The Prologue gives a basic overview of the topic of wavelets and describes their most important features in nonmathematicallanguage. Chapter 1 provides a fundamental introduction to what wavelets are, with brief hints as to how they can be used in practice. Though the results of this chapter apply to general orthogonal wavelets, the material is presented primarily in terms of the simplest case of wavelet: the Haar basis. This greatly simplifies the treatment in introducing wavelet features, and once the basic Haar framework is understood, the ideas are readily extended to smoother wavelet bases. Leaving the treatment of wavelets momentarily, Chapter 2 gives a general introduction to fundamental methods of statistical function estimation in such a way that will lead naturally to basic applications of wavelets. This will of course be review material for readers already familiar with kernel and orthogonal series methods; it is included primarily for the non-specialist. Chapter 3 treats the wavelet versions of the smoothing methods described in Chapter 2, applied to density estimation and nonparametric regression. Chapter 4 returns to describing wavelets, continuing the coverage of Chapter 1. It covers more details of the earlier introduction to wavelets, and treats wavelets in more generality, introducing some of the fundamental properties of wavelet methods: algorithms, filtering, wavelet extension of the Fourier transform, and examples of wavelet families. This chapter is not,
Preface
xi
strictly speaking, essential for applying wavelet methods, but it provides the reader with a better understanding of the principles that make wavelets work well in practice. Chapters 6-9 deal with applying wavelet methods to various statistical problems. Chapter 5 describes diagnostic methods essential to a complete data analysis. Chapter 6 discusses the important practical issues that arise in wavelet analysis of real data. Chapter 7 extends and enhances the basic wavelet methods of Chapter 3. Chapter 8 gives an overview of current research in data dependent wavelet threshold selection. Finally, Chapter 9 provides a basic background into wavelet-related methods which are not explicitly treated in earlier chapters. The information in this book could have been arranged in a variety of orders. If it were intended strictly as a reference book, a natural way to order the information might be to place the chapters dealing primarily with the mathematics of wavelets (Chapters 2, 5, and 10) at the beginning, followed by the statistical application chapters (Chapters 4, 8, and 9), with the diagnostic chapter last, the smoothing chapter being included as an appendix. Instructors using this book in a classroom might cover the topics roughly in the order given, but with the miscellaneous topics in Chapter 4 distributed strategically within subsequent applications chapters. The current order was carefully selected so as to provide a natural path through wavelet introduction and application to facilitate the reader's first learning of the subject, but with like topics grouped sufficiently close together so that the book will have some value for subsequent reference.
Supplements on the World Wide Web The figures in this book were mostly generated using the commercial 5-Plus software package, some using the 5-Plus Wavelet Toolkit, and some using the freely available set of S-Plus wavelet subroutines by Guy Nason, available through StatUb (http: I I 1 ib. stat . emu. edu /). To encourage readers' experimentation with wavelet methods and facilitate other applications, I have made available the S-Plus functions for generating most of the pictures in this book over the World Wide Web (this is in lieu of including source code in the text). These will be located both on Birkhauser's web site (http:llwww.birkhauser.cornlbookslisbni0-8176-3864-41),
and as a link from my personal home page (http: I lwww. stat. sc. edul ~ogden/), which will also contain errata and other information regarding this book. As they become available, new routines for wavelet-based analysis will be included on these pages as well. Though I have only used the 5-Plus software, there are many other available software packages available, such as Wavel.ab, an extensive collection of MATLAB-based routines for wavelet analysis which is available free from Stanford's Statistics Department WWW site. Vast amounts of wavelet-related material is available through the WWW,
xii
PREFACE
including technical reports, a wavelet newsletter, Java applets, lecture notes, and other forms of information. The web pages for this book, which will be updated periodically, will also describe and link relevant information sites.
Acknowledgments This book represents the combination of efforts of many different people, some of whom I will acknowledge here. Thanks are due to Manny Parzen and Charles Chui for their kind words of encouragement at the outset of this project. I gratefully acknowledge Andrew Bruce, Hong-Ye Gao and others at StatSci for making available their S-PLUS Wavelet software. The suggestions and comments by Jon Buckheit, Christian Cenker, Cheng Cheng, and Webster West were invaluable in improving the presentation of the book and correcting numerous errors. I am deeply indebted to each of them. Mike Hilton and Wim Sweldens have the ability to explain difficult concepts in an easily understandable way-my writing of this book has been motivated by their examples in this regard. Carolyn Artin read the entire manuscript and made countless excellent suggestions on grammar and wording. Joe Padgett, John Spurrier, Jim Lynch, and my other colleagues at the University of South Carolina have been immensely supportive and helpful; I thank them as well. Thanks are also due to Wayne Yuhasz and Lauren Lavery at Birkhauser for their support and encouragement of the project. Finally, my deepest thanks go to my family: my wife Christine and daughter Caroline, who stood beside me every word of the way.
PROLOGUE
Why Wavelets? The development of wavelets is fairly recent in applied mathematics, but wavelets have already had a remarkable impact. A lot of people are now applying wavelets to a lot of situations, and all seem to report favorable results. What is it about wavelets that make them so popular? What is it that makes them so useful? This prologue will present an overview in broad strokes (using descriptions and analogies in lieu of mathematical formulas). It is intended to be a brief preview of topics to be covered in more detail in the chapters. It might be useful for the reader to refer back to the prologue from time to time, to prevent the possibility of getting bogged down in mathematical detail to the extent that the big picture is lost. The prologue describes the forest; the trees are the subjects of the chapters. Broadly defined, a wavelet is simply a wavy function carefully constructed so as to have certain mathematical properties. An entire set of wavelets is constructed from a single "mother wavelet" function, and this set provides useful "building block" functions that can be used to describe any in a large class of functions. Several different possibilities for mother wavelet functions have been developed, each with its associated advantages and disadvantages. In applying wavelets, one only has to choose one of the available wavelet families; it is never necessary to construct new wavelets from scratch, so there is little emphasis placed on construction of specific wavelets. Roughly speaking, wavelet analysis is a refinement of Fourier analysis. The Fourier transform is a method of describing an input signal (or function) in terms of its frequency components. Consider a simple musical analogy, following Meyer (1993) and others. Suppose someone were to play a sustained three-note chord on an organ. The Fourier transform of the resulting digitized acoustic signal would be able to pick out the exact frequencies of the three component notes, and the chord could be analyzed by studying the relationships among the frequencies. Suppose the organist plays the same chord for a measure, then abruptly change to a different chord and sustains that for another measure. Here, the classical Fourier analysis becomes confused. It is able to determine the frequencies of all the notes in either chord, but it is unable to distinguish which frequencies belong to the first chord and which are part of the second. Essentially, the frequencies are averaged over the two measures, and the
xiv
WHY WAVELETS?
Fourier reconstruction would sound all frequencies simultaneously, possibly sounding quite dissonant. While usual Fourier methods do a very good job at picking out frequencies from a signal consisting of many frequencies, they are utterly incapable of dealing properly with a signal that is changing over time. This fact has been well-known for years. To increase the applicability of Fourier analysis, various methods such as "windowed Fourier transforms" have been developed to adapt the usual Fourier methods to allow analysis of the frequency content of a signal at each time. While some success has been achieved, these adaptations to the Fourier methods are not completely satisfactory. Windowed transforms can localize simultaneously in time and in frequency, but the amount of localization in each dimension remains fixed. With wavelets, the amount of localization in time and in frequency is automatically adapted, in that only a narrow time-window is needed to examine high-frequency content, but a wide time-window is allowed when investigating low-frequency components. This good time-frequency localization is perhaps the most important advantage that wavelets have over other methods. It might not be immediately clear, however, how this time-frequency localization is helpful in statistics. In statistical function estimation, standard methods (e.g., kernel smoothers or orthogonal series methods) rely upon certain assumptions about the smoothness of the function being estimated. With wavelets, such assumptions are relaxed considerably. wavelets have a built-in "spatial adaptivity" that allows efficient estimation of functions with discontinuities in derivatives, sharp spikes, and discontinuities in the function itself. Thus, wavelet methods are useful in nonparametric regression for a much broader class of functions. Wavelets are intrinsically connected to the notion of "multiresolution analysis." That is, objects (signals, functions, data) can be examined using widely varying levels of focus. As a simple analogy, consider looking at a house. The observation can be made from a great distance, at which the viewer can discern only the basic shape of the structure-the pitch of the roof, whether or not it has an attached garage, etc. As the observer moves closer to the building, various other features of the house come into focus. One can now count the number of windows and see where the doors are located. Moving closer still, even smaller features come into clear view: the house number, the pattern on the curtains. Continuing, it is possible even to examine the pattern of the wood grain on the front door. The basic framework of all these views is essentially the same using wavelets. This capability of multiresolution analysis is known as the "zoom-in, zoom-out" property. Thus, frequency analysis using the Fourier decomposition becomes "scale analysis" using wavelets. This means that it is possible to examine features of the signal (the function, the house) of any size by adjusting a scaling parameter in the analysis. Wavelets are regarded by many as primarily a new subject in pure mathe-
Why Wavelets?
XV
matics. Indeed, many papers published on wavelets contain esoteric-looking theorems with complicated proofs. This type of paper might scare away people who are primarily interested in applications, but the vitality of wavelets lies in their applications and the diversity of these applications. The objective of this book is to introduce wavelets with an eye toward data analysis, giving only the mathematics necessary for a good understanding of how wavelets work and a knowledge of how to apply them. Since no wavelet application exists in complete isolation (in the sense that substantial overlap can be found among virtually all applications), we review here some of the ways wavelets have been applied in various fields and consider how specific advantages of wavelets in these fields can be exploited in statistical analysis as well. Certainly, wavelets have an "interdisciplinary" flavor. Much of the predevelopment of the foundations of what is now known as wavelet analysis was led by Yves Meyer, Jean Morlet, and Alex Grossman in France (a mathematician, a geophysicist, and a theoretical physicist, respectively). With their common interest in time-frequency localization and multiresolution analysis, they built a framework and dubbed their creation ondelette (little wave), which became "wavelet" in English. The subject really caught on with the innovations of Ingrid Daubechies and Stephane Mallat, which had direct applicability to signal processing, and a veritable explosion of activity in wavelet theory and application ensued.
What are Wavelets Used For? Here, we describe three general fields of application in which wavelets have had a substantial impact, then we briefly explore the relationships these fields have with statistical analysis.
1. Signal processing Perhaps the most common application of wavelets (and certainly the impetus behind much of their development) is in signal processing. A signal, broadly defined, is a sequence of numerical measurements, typically obtained electronically. This could be weather readings, a radio broadcast, or measurements from a seismograph. In signal processing, the interest lies in analyzing and coding the signal, with the eventual aim of transmitting the encoded signal so that it can be reconstructed with only minimal loss upon receipt. Signals are typically contaminated by random noise, and an important part of signal processing is accounting for this noise. A particular emphasis is on denoising, i.e., extracting the "true" (pure) signal from the noisy version actually observed. This endeavor is precisely the goal in statistical function estimation as well-to "smooth" the noisy data points to obtain an estimate of the underlying function. wavelets have performed admirably in both of these fields. Signal processors now have new, fast tools at their disposal that are
xvi
WHY WAVELETS?
well-suited for denoising signals, not only those with smooth, well-behaved natures, but also those signals with abrupt jumps, sharp spikes, and other irregularities. These advantages of wavelets translate directly over to statistical data analysis. If signal processing is to be done in "real time," i.e., if the signals are treated as they are observed, it is important that fast algorithms are implemented. It doesn't matter how well a particular de-noising technique works if the algorithm is too complex to work in real time. One of the key advantages that wavelets have in signal processing is the associated fast algorithms-faster, even, than the fast Fourier transform.
2. Image analysis Image analysis is actually a special case of signal processing, one that deals with two-dimensional signals representing digital pictures. Again, typically, random noise is included with the observed image, so the primary goal is again denoising. In image processing, the denoising is done with a specific purpose in mind: to transform a noisy image into a "nice-looking" image. Though there might not be widespread agreement as to how to quantify the "niceness" of a reconstructed image, the general aim is to remove as much of the noise as possible, but not at the expense of fine-scale details. Similarly, in statistics, it is important to those seeking analysis of their data that estimated regression functions have a nice appearance (they should be smooth), but sometimes the most important feature of a data set is a sharp peak or abrupt jump. Wavelets help in maintaining real features while smoothing out spurious ones, so as not to "throw out the baby with the bathwater."
3. Data compression Electronic means of data storage are constantly improving. At the same time, with the continued gathering of extensive satellite and medical image data, for example, amounts of data requiring storage are increasing too, placing a constant strain on current storage facilities. The aim in data compression is to transform an enormous data set, saving only the most important elements of the transformed data, so that it can be reconstructed later with only a minimum of loss. As an example, Wickerhauser (1994) reports that the United States Federal Bureau of Investigation (FBI) has collected 30 million sets of fingerprints. For these to be digitally scanned and stored in an easily accessible form would require an enormous amount of space, as each digital fingerprint requires about 0.6 megabytes of storage. Wavelets have proven extremely useful in solving such problems, often requiring less than 30 kilobytes of storage space for an adequate representation of the original data, an impressive compression ratio of 20:1. How does this relate to problems in statistics? To quote Manny Parzen, "Statistics is like art is like dynamite: The goal is compression." In multiple
Why Wavelets?
xvii
linear regression, for example, it is desired to choose the simplest model that represents the data adequately, to achieve a parsimonious representation. With wavelets, a large data set can often be summarized well with only a relatively small number of wavelet coefficients. To summarize, there are three main answers to the question "Why wavelets?":
1. good time-frequency localization, 2. fast algorithms, 3. simplicity of form.
This chapter has spent some time covering Answer 1 and how it is important in statistics. Answer 2 is perhaps more important in pure signal processing applications, but it is certainly valuable in statistical analysis as well. Some brief comments on Answer 3 are in order here. An entire set of wavelet functions is constructed by means of two simple operations on a single prototype function (referred to earlier as the "mother wavelet"): dilation and translation. The prototype function need never be computed when taking the wavelet transform of data. Just as the Fourier transform describes a function in terms of simple functions (sines and cosines), the wavelet transform describes a function in terms of simple wavelet component functions. The nature of this book is expository. Thus, it consists of an introduction to wavelets and descriptions of various applications in data analysis. For many of the statistical problems treated, more than one methodology is discussed. While some discussion of relative advantages and disadvantages of each competing method is in order, ultimately, the specific application of interest must guide the data analyst to choose the method best suited for his/her situation. In statistics and data analysis, there is certainly room for differences of opinion as to which method is most appropriate for a given application, so the discussion of various methods in this book stops short of making specific recommendations on which method is "best," leaving this entirely to the reader to determine. With the basic introduction of wavelets and their applications in this text, readers will gain the necessary background to continue their study of other applications and more advanced wavelet methods. As increasingly more researchers become interested in wavelet methods, the class of problems to which wavelets have application is rapidly expanding. The References section at the end of this book lists several articles not covered in this book that provide further reading on wavelet methods and applications. There are many good introductory papers on wavelets. Rioul and Vetterli (1991) give a basic introduction focusing on the signal processing uses
:xviii
WHY WAVELETS?
of wavelets. Graps (1995) describes wavelets for a general audience, giving some historical background and describing various applications. Jawerth and Sweldens (1994) give a broad overview of practical and mathematical aspects of wavelet analysis. Statistical issues pertaining to the application of wavelets are given in Bock (1992), Bock and Pliego (1992), and Vidakovic and Muller (1994). There have been many books written on the subject of wavelets as well. Some good references are Daubechies (1992), Chui (1992), and Kaiser (1994) -these are all at a higher mathematical level than this book. The book by Strang and Nguyen (1996) provides an excellent introduction to wavelets from an engineering/signal processing point of view. Echoing the assertion of Graps (1995), most of the work in developing the mathematical foundations of wavelets has been completed. It remains for us to study their applications in various areas. We now embark upon an exploration of wavelet uses in statistics and data analysis.
CHAPTER
ONE
Wavelets: A Brief Introduction
This chapter gives an introductory treatment of the basic ideas concerning wavelets. The wavelet decomposition of functions is related to the analogous Fourier decomposition, and the wavelet representation is presented first in terms of its simplest paradigm, the Haar basis. This piecewise constant Haar system is used to describe the concepts of the multiresolution analysis, and these ideas are generalized to other types of wavelet bases. This treatment is meant to be merely an introduction to the relevant concepts of wavelet analysis. As such, this chapter provides most of the background for the rest of this book. It is important to stress that this book covers only the essential elements of wavelet analysis. Here, we assume knowledge of only elementary linear algebra and calculus, along with a basic understanding of statistical theory. More advanced topics will be introduced as they are encountered.
1.1
The Discrete Fourier Transform
Transformation of a function into its wavelet components has much in common with transforming a function into its Fourier components. Thus, an introduction to wavelets begins with a discussion of the usual discrete Fourier transform. This discussion is not by any means intended to be a complete treatment of Fourier analysis, but merely an overview of the subject to highlight the concepts that will be important in the development of wavelet analysis. While studying heat conduction near the beginning of the nineteenth century, the French mathematician and physicist Jean-Baptiste Fourier discovered that he could decompose any of a large class of functions into component functions constructed of only standard periodic trigonometric func-
2
THE DISCRETE FOURIER TRANSFORM
tions. Here, we will only consider functions defined on the interval [-1r, 1r]. (If a particular function of interest g is defined instead on a different finite
interval [a, b], then it can be transformed via f(x) = g(21rxj(b- a) - (a+ b)1r / (b- a)).) The sine and cosine functions are defined on all of lR and have period 21r, so the Fourier decomposition can be thought of either as representing all such periodic functions, or as representing functions defined only on [-1r, 1r] by simply restricting attention to only this interval. Here, we will take the latter approach. The Fourier representation applies to square-integrable functions. Specifically, we say that a function f belongs to the square-integrable function space L 2 [a, b] if
Fourier's result states that any function f E £ 2 [ -1r, 1r] can be expressed as an infinite sum of dilated cosine and sine functions: 1
f(x) =
00
2ao + 2:::)ai cos(jx) + bj sin(jx)),
(1.1)
j=l
for an appropriately computed set of coefficients { a 0 , a 1 , b1 , ... } . A word of caution is in order about the representation (1.1). The equality is only meant in the £ 2 sense, i.e.,
It is possible that f and its Fourier representation differ on a few points (and this is, in fact, the case at discontinuity points). Since this book is concerned primarily with analyzing functions in £ 2 space, this point will usually be neglected hereafter in similar representations. It is important to keep in mind, however, that such an expression does not imply pointwise convergence. The summation in (1.1) is up to infinity, but a function can be well-approximated (in the £ 2 sense) by a finite sum with upper summation limit index J: J
SJ(x)
= ~a 0 + l:(aj cos(jx) + bj sin(jx)). j=l
(1.2)
Wavelets: A Brief Introduction
C!
C!
10
10
Eo
Eo oo u
ci
3
ci IJl
.5 ci IJl
C!
•
C!
";-
";-
-3
-2
-1
0
2
3
C!
C!
10
10
ci
-3
-2
-1
0
2
3
-3
-2
-1
0
2
3
-3
-2
-1
0
2
3
ci
x ~C! :go
xC\JO
co
"iii
u
C!
C!
";-
";-
-3
-2
-1
0
2
3
C!
C!
10
10
~0
x ~C! :go
ci
0
co
"iii
u
C!
C!
";-
";-
-3
-2
-1
0
2
3
Figure 1.1: The first three sets of basis functions for the discrete Fourier transform This Fourier series representation is extremely useful in that any L 2 function can be written in terms of very simple building block functions: sines and cosines. This is due to the fact that the set of functions { sin(j ·), cos(j ·), j = 1, 2, ... } , together with the constant function, form a basis for the function space L 2 (-1r, 1r]. We now examine the appearance of some ofthese basis functions and how they combine to reconstruct an arbitrary £ 2 function. Figure 1.1 plots the first three pairs of Fourier basis elements (not counting the constant function): sine and cosine functions dilated by j for j = 1, 2, 3. Increasing the dilation index j has the effect of increasing the function's frequency (and thus decreasing its period). Next, we examine the finite-sum Fourier representation of a simple example function, as this will lead into the discussion of wavelets in the next sec-
4
THE DISCRETE FOURIER TRANSFORM
Example function
Reconstruction with J=1
If!
q It)
It)
0
0
0
0
~~--~~~--~~
-3
-2
-1
0
2
0
0 -3
3
Reconstruction with J=2
-2
-1
0
2
3
Reconstruction with J=3
If!
If!
q
q
It)
It)
ci
ci
0
0
0
0
-3
-2
-1
0
2
3
-3
-2
-1
0
2
3
Figure 1.2: An example function and its Fourier sum representations
tion. The truncated Fourier series representations (1.2) for J are displayed in Figure 1.2 for the piecewise linear function
X+ 7r, f(x) =
1rj2, { 7r- x,
= 1, 2, and 3
-?r ~X~ -?r/2 -?r/2 <X~ ?r/2 7r
(1.3)
/2 <X ~ 7r.
Figure 1.2 shows the original example function and the three representations of it. As the summation limit J gets larger, more terms are included in the reconstruction, so the resulting sum does a better job of approximating f. In this simple example, using three pairs of basis functions in the reconstruction gives a fairly good representation of the original function. Of course, even this good representation could be improved by allowing J to increase even more. The next issue to consider is calculation of the coefficients { a0 , a 1, b1, a 2 , b2 , .•• } • For j ~ 1, the Fourier coefficients can be computed by taking the inner product of the function f and the corresponding basis functions:
1 -(!, cos(j·)) = -1 /71" j(x)cos(jx) dx, 7r 7r -71"
j = 0, 1, ... ,
(1.4)
Wavelets: A Brief Introduction
5
!_(!, sin(j·)) = !_ /_11" f(x)sin(jx) dx, j = I, 2,.... 7r
7r
(1.5)
-7r
The coefficients ai and bi are said to measure the "frequency content" of the function f at the level of resolution j. Examining the set of Fourier coefficients can aid in understanding the nature of the corresponding function. The coefficients in (1.4) and (1. 5) are given in terms of the £ 2 inner product of two functions:
J
(j, g) =
f(x)g(x) dx,
where the integral is taken over the appropriate subset of JR. The £ 2 norm of a function is defined to be
11!11 = VfD) =
Jj
J2(x)dx.
Let us return to our earlier example and look at some of the coefficients, which are given in Table 1.1. First, note that all the bi 's (corresponding to the sine basis functions) are zero. The reason for this is that the example function is an even function, so the inner product of f with each of the odd sine functions is zero. From inspection of Table 1.1, we note that the even-index cosine coefficients are also zero (for j ~ 4) and that odd-index coefficients are given by aj = 2/(j 2 7r), with coefficients aj becoming small quickly as j gets large. This indicates that most of the frequency content of this example function is concentrated at low frequencies, which can be see in the reconstructions in Figure 1.2. The only relatively large coefficients are a0 , a 1 , a 2 , and a 3 , so the third reconstruction (J = 3) does a very good job at piecing f back together. By increasing J further, the approximation will only improve (in the £ 2 sense), but the amount of the improvement will be smaller.
Table 1.1: Fourier coefficients for the example function. j 0
1 2 3 4
aj
37r/4 2/7r -l/7r 2/(97r) 0
bj
0 0 0 0
j 5 6 7 8 9
aj
bj
2/(257r)
0 0 0 0 0
0
2/(497r) 0
2/(817r)
6
THE DISCRETE FOURIER TRANSFORM
The representation (1.1) holds uniformly for all x E [-1r, 1r] under certain restrictions on f (for instance, if f has one continuous derivative, f (1r) = f( -1r), and j'(1r) = f'( -1r)-see, e.g., Dym and McKean (1972)). The example function in Figure 1.2 has discontinuities in its derivative, but the Fourier representation will converge at all other points. For any £ 2 [ -71", 1r] function, the truncated representation (1.2) converges in the £ 2 sense:
as J -t oo. In practical terms, this means that many functions can be described using only a handful of coefficients. The extension of this to wavelets will become clear in the following section. Though not mentioned previously, the Fourier basis has an important property: It is an orthogonal basis. Definition 1.1 Two functions ft,
fz
E L 2 [a, b] are said to be orthogonal
if
(ft, h) = 0. The orthogonality of the Fourier basis can be seen through orthogonality properties inherent in the sine and cosine functions:
(sin(m·), sin(n·))
=I:
sinmxsinnxdx
={~:
(cos(m·),cos(n·)) = /_: cosmxcosnxdx = {
~:
271",
(sin(m·),cos(n·))
m =f. n, m = n > 0,
m =f. n, m = n > 0, m = n = 0,
= /_: sinmxcosnxdx = Oforallm,n ~ 0.
The three expressions can be verified easily by applying the standard trigonometric identities for sin a sin f3, cos a cos f3, and sin a sin f3. A minor modification of the sine and cosine functions will yield an orthonormal basis with another important property. Definition 1.2 A sequence offunctions { iJ} is said to be orthonormal the iJ 's are pairwise orthogonal and II iJ II = 1 for all j.
if
Wavelets: A Brief Introduction
7
The orthogonality requirement is already satisfied with the sine and cosine functions. Defininggj(x) = 7r- 112 sin(jx) for j = 1,2, ... and hj(x) = 1 2 7r- 1 cos(jx) for j = 1, 2, ... with the constant function h0 (x) = 1/V21f on x E [-1r, 1r] makes the set of functions { h 0 , g 1 , h 1 , ... } orthonormal as well. Normalizing the basis in this manner allows us to write the Fourier representation (1.1) along with the expressions for computing the coefficients (1.4) and (1.5) as 00
f(x)
= (f,ho)ho(x) + L
((f,gj)gj(x)
+ (J,hj)hj(x)).
j=l
Definition 1.3 A sequence of function { fJ} is said to be a complete orthonormal system (CONS) if the fJ 's are pairwise orthogonal, II fJ II = 1 for each j, and the only function orthogonal to each fJ is the zero function. Thus defined, the set { h0 , gj, hj : j = 1, 2, ... } is a complete orthonormal system for £ 2 [ -1r, 1r]. The Fourier basis is not the only CONS for intervals. Others include Legendre polynomials and wavelets, the latter to be studied in detail.
1.2
The Haar System
The extension from Fourier analysis to wavelet analysis will be made via the Haar basis. The Haar function is a bona fide wavelet, though it is not used much in current practice. The primary reason for this will become apparent. Nevertheless, the Haar basis is an excellent place to begin a discussion of wavelets. This section will begin with a definition of the Haar wavelet and go on to derive the Haar scaling function. Following this development, we will begin with the Haar scaling function and then rederive the Haar wavelet. Of course, terms like "wavelet" and "scaling function" have not yet been defined. Their meaning will become clear as we progress through a discussion of issues associated with wavelets. The Haar wavelet system provides a paradigm for all wavelets, so it is important to keep in mind that the simple developments in this chapter have much broader application: All the principles discussed in this chapter pertaining to the Haar wavelet hold generally for all orthogonal wavelets. The Haar wavelet is nothing new, having been developed in 1910 (Haar, 191 0), long before anyone began speaking of "wavelets." The Haar function, given by
-1,
o::;x
0,
otherwise
1,
'lj;(x) =
{
(1.6)
8
THE HAAR SYSTEM
1+-------------~
0
1
2
1
-1
Figure 1.3: The Haar function
is better expressed by a picture, shown in Figure 1. 3. The Haar function is not particularly awe-inspiring, either in appearance or in expression (1.6). It is piecewise constant over intervals of length onehalf. What can be so important about these wavelets? The Haar function 'ljJ defined above is called a mother wavelet. The mother wavelet "gives birth" to an entire family of wavelets by means of two operations: dyadic dilations and integer translations. Let j denote the dilation index and k represent the translation index. Each wavelet born of the mother wavelet will be indexed by both of these indices:
for integer-valued j and k. As in the Fourier series, dilation by larger j "compresses" the function on the x-axis. Altering k has the effect of sliding the function along the x-axis. Some of these dilated and translated wavelet functions are plotted in Figure 1.4. The primary importance of this set of functions is expressed in the following theorem. Another very useful property is brought out in its proof, which is sketched here informally only to help in the understanding of the Haar system and to bring out some new notation in a natural way. Though the ideas in this development are not particularly difficult, the notation gets fairly complex, so repeated reading of this section might help with understanding. Eventually, the Haar system will be extended to general wavelet bases, and all the same principles will apply.
Wavelets: A Brief Introduction
9
j=4~=13
-
j::3, k=2
j::O, k=O 0
I
I
I
I
..__
.... 0.0
0.2
0.4
0.6
0.8
1.0
Figure 1.4: Haar wavelet examples
Theorem 1.1 The set {'1/Jj,kl j, k E 7l} constitutes a complete orthonormal system for L 2 (1R). The informal proof of this theorem generally follows that given in Chapter 1 of Daubechies (1992). This is the only "proof" given in this book, included
here because it leads into a discussion of the principles of wavelet analysis. To establish the theorem's result, it is necessary to show two things: 1. The set is orthonormal; 2. Any function f E L 2 (JR) can be approximated arbitrarily well by a finite linear combination of the '1/Ji,k 's.
First, we note that each Haar wavelet satisfies
L:
1/J;,k(x) dx = 0,
for all j, k E 7l. To show (1), note that the support of the wavelet function '1/Jj,k is
Two wavelets with the same dilation index j but differing k can never have overlapping support and are therefore orthogonal. If two wavelets have different dilation indices, say j' < j, then supp '1/Jj,k is in a region in which the other wavelet is constant, so they are also orthogonal. Since 11'1/Ji,kll = 1 for all j, k E 7l, the set is orthonormal.
10
THE HAAR SYSTEM
Original Function
Approximation with j=4
0 0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
Approximation with i=5
0.2
0.4
0.6
0.8
1.0
Approximation with j=6
M
M
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 1. 5: A function and its approximations of increasing resolution. At each step, the width of the constant intervals is halved. To establish (2) above, we first approximate any L 2 (JR) function function with compact support: Since
f
by a
as J 1 -+ oo, we can approximate f arbitrarily well in the £ 2 sense by choosing an integer J 1 to be large; the first approximation off is thus the restriction of f to the interval [- 2J1 , 2J1 ), which we will denote !I [_2 Jl , 2 J1). This first approximation is approximated further by a function that is piecewise constant over all intervals of the form [RzJo, (f + 1)2-J0 ), where the integer J 0 is chosen to be large enough to make the approximation as good as desired. (This approximation of a smooth function by a piecewise constant function will be illustrated in Figure 1.5.) Since these approximations can be made for any function in £ 2 ( JR), we now restrict attention to such piecewise constant functions with compact support. Let JJo represent a function that is piecewise constant on intervals of length z-Jo as described above, which is understood to have support [-2J1 , 2J1 ]. Then let 0 represent the constant value of the function on the interval
J/
[Rz-Jo, (f + l)z-Jo), i.e.,
Wavelets: A Brief Introduction
11
It is now possible to write JJo as the sum of two functions: (1.7)
where JJo- 1 is an approximation to JJo that is piecewise constant over intervals oflength z-(Jo- 1), twice as large as before, i.e., t an t-jJo-1 - f . ! Jo-11 [f2-(Jo-I),(f+l)z-(J0-I)) =cons
The values of this coarser approximation to JJo are obtained by averaging the two corresponding constant values of the function JJo; 0 - 1 = 4(![£ +
f/
! Jo )
2£+1 ·
We can now define a "detail function" gJo- 1 , which is piecewise constant over the same intervals as those for JJo. Following the subscripting conventions established earlier, it follows that
and
This detail function, which is constant over intervals of length z-Jo, is the part that must be added to the coarser approximation JJo to get a finer approximation JJo. The decomposition in (1. 7) is illustrated in Figure 1.6. Using the Haar function (1.6), we can now get a useful expression for the detail function gJo- 1 , in terms of dilated and translated Haar wavelets:
Hence the original piecewise constant function can be written
L.....t Jo-1,£ nf, 'f/Jo-1,£, ! Jo _- JJo-1 + ""'d f_
where
THE HAAR SYSTEM
12
JJo 2l
g:{rl
j
Jjo-1
JJo 2£+1
:M:h 2
f_
2Jo-l
0
Jo-1
g2l
0
----------------------------1
Figure 1.6: The decomposition of a piecewise constant approximation function into a coarser approximation and a detail function
The approximation function
JJo =
JJo- 1
can be broken down again, giving
JJo-1 + gJo-1 JJo-2 + gJo-2 + gJo-1.
Note that JJo- 2 has the same support as JJo but is piecewise constant over widerintervals: [£2-(Jo- 2), (£+1)2-(Jo- 2)). Using the wavelet representation of gJo- 2 , the function JJo can be written
JJo = jJo- 2 +
L f_
dJ0 -2,t'I/JJ0 -2,f +
L
dJ0 -1,£'1/JJo-1,£,
f_
where the dj,k 's may be computed according to (1.8).
Wavelets: A Brief Introduction
13
Continuing in this way, we obtain Jo-1 JJo
= ~-JI +
L L
j=-J!
dj,k'I/Jj,k.
f_
The coarsest approximation f-J 1 has two constant pieces: f-J 1 I[o, 2 J 1 ) = f 0 J 1 which is the average of JJo over [0,2J1 ), and f-J 1 1[- 2 JI,o) !~(1 , the average of JJo over [-2J1 ,0). This represents a fundamental concept of wavelet analysis-breaking a function down into a very coarse approximation, with an ordered sequence of detail functions {g-J 1 , g-J 1 + 1 , ... , gJo-1} making up the difference. Even though the entire support of JJo has been represented, it is possible to double the support of the approximation to JJo, say from -2J1 + 1 to 2J1 + 1 . Then f-J 1 can be broken down as well: f-J 1 = f-(J 1 + 1) + g-(J1 + 1), where
=
and
Repeating this process K times results in
! Jo = ~-(J!+K) +
Jo-1
""' L-
j=-(JI+K)
""'d L- j,k .!, 'f/j,k' k
Using only the sequence of detail functions to approximate pute the resulting £ 2 error of approximation: Jo-1 II JJo -
L
L
j=-(JI+K)
k
dj,k '1/Jj,k 11
2
=
IIJ-(JI+K) 11
JJo,
we can com-
2
This error can be made as small as may be desired by choosing K large enough. Since the approximation in the previous expression is written only
14
THE HAAR SYSTEM
in terms of dilated and translated 'ljJ functions, (2) is established and the "proof" is complete. 0 This "proof" brings out an important property of wavelets, that of the multiresolution analysis, which is discussed in the next section.
Multiresolution Analysis In the proof of Theorem 1.1, it was seen that any function f E L 2 (JR) can be approximated by a piecewise constant function j1, and that as j gets larger, the approximation gets better (at least in the £ 2 sense). Figure 1.5 illustrates a smooth function and three such approximations. Once the principles of the multiresolution analysis are presented, the example function from Section 1.1 will be approximated also using piecewise constant functions to allow comparison with the Fourier series approximations in Section 1.1. From the proof of Theorem 1.1, it is seen that, using the Haar function and the corresponding piecewise constant approximations, for each level j, one can construct j1, an approximation of the original function. This approximation can be written as the sum of the next coarser approximation j1-I and a detail function g1 -I. As the index j runs from small to large, the corresponding approximations run from coarse to fine. Each detail function g1 can be written as a linear combination of the corresponding 'I/J1,k functions. This illustrates the basic principles of multiresolution analysis, which will be discussed more formally later in this section. To make this argument more rigorous, define a function space Vj, j E 7L, to be
Vj = {! E L 2 (JR) :
f
is piecewise constant on [k2-1, (k
+ 1)2-1), k
E 7L}.
Then the sequence of spaces (Vj )1Ez represents a ladder of subspaces of increasing resolution (as j increases). Each subspace Vj consists of functions that are piecewise constant over intervals of exactly twice the length of those for VJ- 1 (and half the length of those for VJ+ 1). This sequence of subspaces possesses the following properties: 1. ... C V_z C V-1 C Vo C V1 C Vz C ... ; 2.
3. f 4.
= L 2 (1R); E Vj if and only if f(2·) E VJ+t;
n1Ez Vj
f
= {o},
u1Ez Vj
E V0 implies!(·- k) E
Vo for all k E 7L.
The third property demonstrates that each space Vj is a scaled version of the original space V0 . In the proof of the theorem, it was seen that each approximation can be written as the sum of a coarser approximation and a detail function. Using the notation p1 f to denote the projection of a function f
Wavelets: A Brief Introduction
15
onto the space Vj (the "best" approximation to fin Vj), this is expressed pJ f = pJ-1 f
+ gJ-1.
The detail function g1- 1 (representing the "residual" between two approximations) can be written in terms of dilated and translated wavelets, giving pJ f
= pJ- 1f + L
(j, 'I/Jj-1,k) 'I/Jj-1,k.
(1.9)
kEZ
It is important to note that {'1/JJ,k, k E Z} is not a CONS in Vj, but is actually a set of functions that are orthogonal to each function in Vj . The significance of this will be discussed more later in this section. As in the previous section, the decomposition can be extended recursively: j-1
pJt
=
pJot+
LL9£
f=Jo k j-1
=
pJo f
+
L L (j, '1/Je,k) '1/Je,k · f=Jo k
The concept of multiresolution analysis relates back to wavelets by noting that whenever there is a sequence of spaces (Vj )jEZ that satisfies the four properties above along with 5. There exists a function ¢ E Vo such that the set { c/Jo,k = ¢( · - k), k E Z} constitutes an orthonormal basis for V0 , then there exists a function 'ljJ such that (1.9) is true! In the Haar case considered in this chapter, it is clear that one choice for ¢ is
¢(x)
= l[o,1)(x),
(1.10)
where I A ( ·) is the indicator function of the set A. The function ¢ is called the scaling function since its dilates and translates constitute orthonormal bases for all Vj subspaces, which are simply scaled versions of V0 . The scaling function is also referred to as the father wavelet. Since this concept will be used quite a bit throughout the rest of this book, a formal definition is in order. Definition 1.4 Closed subspaces (Vj )JEZ that satisfy properties (1)-(5) above are said tofonn a multiresolution analysis (MRA) of L 2 (IR).
16
THE HAAR SYSTEM
If a function ¢ can be used to form spaces
Vj
= span{¢j,k, k E Z}
such that (Vj) jEZ constitute an MRA, then the (scaling) function ¢ is said to generate a multiresolution analysis.
The Wavelet Representation In the Haar example, it is clear that the subspace sequence (Vj) jEZ generated by the function¢ defined in (1.10) satisfies properties (1)-(5) above. In fact, for any arbitrary (not necessarily piecewise constant) scaling function ¢ which generates a sequence of subspaces Vj satisfying the five properties above, it is possible to construct a "wavelet" function 'ljJ so that (1.9) holds. To illustrate this fact, we will start with the Haar scaling function and show how the Haar wavelet can be derived from it. Here, we begin with the Haar scaling function as defined in (1.1 0) with
defining each space Vj to be the span of the set of functions { cPi,k, k E Z}. By previous arguments, it is clear that properties (1)- (5) hold and that the spaces discussed in Section 1.2 exactly correspond with those just defined. The projection of an £ 2 ( IR) function onto an approximation space Vj is done by writing it in terms of the appropriately dilated and translated scaling functions pi f
""" cJ,. k '-!JJ, k. = "'L....J ,J.. .
(1.11)
k
Since the set { cPi,k, k E Z} is an orthonormal basis for Vj, the scaling function coefficients can be computed:
c;,k
= (f,.-to 2" For a suitably chosen A, a natural estimator of the density would result from replacing F in the expression above with the empirical distribution function and disregarding the limit:
}(x)
\ (F-(x 2
=
+A) - F1x- A))
-+--·#of X/s in (x- A, x +A].
2An
(2.4)
We will refer to this expression as the "naive" estimator. Note that for any x, this estimator counts only the points that lie within a bandwidth A of x. The naive estimator can be written in another form by defining a particular weight function or kernel function
K(x) = {
~'
0,
-l<x~l,
otherwise.
(2.5)
Using this kernel function, (2.4) can be written
](x) = 2_ nA
tK (x-AXi).
(2.6)
i=l
The resolution of this estimator can be adjusted by changing the bandwidth A, with a small choice of A giving a narrow "window" with more localization, and a larger choice of A giving a wider window and less localization. The idea of giving weight only to data points in the vicinity of x is a very natural one, but it needs to be refined slightly for the estimator (2.6) to be really useful. The problem with the naive estimator is that the "aU-or-nothing" nature of the associated weight function makes for a jagged estimator: As the reference point x is moved continuously along the real line JR, data points are abruptly included or excluded from the domain of the weight function w, giving sharp discontinuities in the resulting estimator j (x).
34
DENSITY ESTIMATION
This is easily remedied by considering smoother weight functions in place of the piecewise constant function (2.6). A suitable kernel function should satisfy
I:
K(x) dx =I,
so it is common to use probability density functions as kernels, typically those that are symmetric about zero. The Gaussian kernel (normal pdf) is a popular choice: -x2/2 K( x ) =_I_ .~e . y27r
Other kernel functions in common use include the triangular kernel
K(x) = 1 - lxl, the Epanechnikov kernel
and the biweight kernel
K(x)
15 ( 1 - x 2) 2, = 16
each of these last three defined to be zero outside of [- 1, 1]. Figure 2. 2 gives a plot of these four kernel functions. For kernel density estimation, it is not required that the kernel function be nonnegative. In fact, kernels that take on negative values in places are often used to reduce the asymptotic bias of the resulting estimator. Using such higher-order kernels was originally considered by Parzen (1962) and Bartlett (1963); an overview of these methods is given in Chapter 3 of Silverman (1986). In practice, such kernels are often avoided, since they can occasionally give density estimates that are negative in places. As mentioned before, the naive estimator (2.4) was dismissed because of its jagged appearance. This jagged nature is inherited from the shape of its boxy weight function. In the same way, when a smoother kernel function is used, the resulting estimator is also smooth, since it is simply a sum of smooth functions. Analogous to the selection of a binwidth in constructing a histogram, the most important factor in determiningthe appearance of the density estima-
Basic Smoothing Techniques Gaussian kernel
'o ·o ~ !i= Q) 8o Q)
Q)o
> ~, ca
3:
0
..
r·
_..I
~
~C\1
c
Q)
·o !i= Q)O 0 0
Q)
Q)
>
ctlC\1
3:'
~
-2 0 2 Quantiles of Standard Normal
-2 0 2 Quantiles of Standard Normal
Figure 5.15: Q-Q plots for empirical wavelet coefficients for the noisy blocky function and for white noise
cally lie off the line formed by the "small" coefficients. Figure 5.15 displays the Q-Q plots associated with the noisy blocky function in Figure 5.11 and the white noise sequence in Figure 5.13. It is clear from this plot that there is significant change in the function associated with the first data set, and the number of significant coefficients can be approximated by counting the number of points off the line. The coefficients from the second data set behave just as a random sample of normal random variables should, so there is no evidence of any significant wavelet coefficients. Another standard method for graphical data analysis is the box plot. Since the wavelet decomposition orders coefficients according to their dilation index, it is natural to consider the sets of coefficients grouped by scale. A set of side-by-side boxplots for the wavelet coefficients of the five highest levels associated with the noisy function in Figure 5.11 is plotted in Figure 5.16. For comparison purposes, the box plots for the coefficients of the white noise sequence in Figure 5.13 are also plotted. The corresponding wavelet levels are listed below each boxplot. Recall that there are 2i coefficients at level j. For a set of pure noise, all the coefficients should have the same distribution, which seems to hold for the right-hand plot in Figure 5.16. It is clear that this is not the case for the left-hand plot, however, as can be seen by the widely differing widths of the five boxplots. This is related to the sparsity of representation idea inherent in wavelet analysis: At high levels, there are many coefficients, but only a very few correspond to actual signal; the rest are just noise. At low levels, however, a larger proportion of coefficients is needed to represent the function adequately. This is illustrated well by the first plot in
102
OTHER PLOTS FOR DATA ANALYSIS
Noisy blocky function
f(w)
=
L
W(j)R(k) cos(jw).
j=-(n-1)
Here, the function W is called the lag window since it controls the averaging over various lags of the covariance function. (This dual representation in both the time and the frequency domains is perhaps reminiscent of the discussion in Section 4.3.) The lag window and the spectral window are in fact a Fourier pair: the spectral window is the Fourier transform of the lag window, and the lag window is the inverse Fourier transform of the spectral window. Wei (1990) gives several examples of lag-spectral window pairs. Since the spectral density is typically thought to be "mostly smooth," with the possible exception of one or more very sharp spikes, it is natural to desire a spatially adaptive procedure such as that offered by wavelet shrinkage. The application is not straightforward, however. Wahba (1980) suggests taking the log of the periodogram to stabilize the variance. Smoothing the log pe-
Other Applications White noise
139 AR(1 ), r = 0.5
It)
' ,\ will be nonzero after the shrinking). The computational effort involved with minimizing the SURE criterion is light-if the observations are re-ordered in order of increasing lxk I, then the criterion function SURE(t; x) is strictly increasing between adjacent values of the lxk I's. It is also strictly increasing between 0 and the smallest lxk 1. as well as fort> maxk lxkl, so the minimum must occur at 0 or at one of the lxkl's. Thus, the criterion must only be computed ford+ 1 values oft, and, in practice, there is no need to order the lxkl's. Figure 8.1 illustrates this method in action. This figure displays plots of v'nlw)~21 for levels 10, 9, and 8 for the blocky function shown in Figure 5.10 normalized to have signal-to-noise ratio 5 with n = 2048. Signal-to-noise ratio (SNRatio) for a set of means Ill , ... , ftd with additive noise is defined to be the ratio of the standard deviation of the mean vector to the standard deviation of the noise. In the first column of plots, the absolute values of y'n times the coefficients are plotted in increasing order. In the second column, the SURE criterion is plotted as a function oft, evaluated for each t = v'nlwt21 at the current level. The dashed line in the first column of plots indicates the value of the threshold selected by the SURE criterion; all points below this line will be shrunk to zero, and all points above will be shrunk toward zero by that amount.
Data Adaptive Wavelet Thresholding
Level 10 coefficients 'o::t
:f:
:3M
SURE(t;x) for level 10
0 0
+ +
It)
147
~
x
.±:::.a we
::I
a:te
~N
:::>
en
0 0
N
0
0
200
400 600 800 Index
0
Level 9 coefficients
4
SURE(t;x) for level 9
* ?j,k, k E Z}. Another possible orthonormal basis for Vj is the set of wavelet packet functions
Since VJ+I = Vj space" Wj is
+ Wj, it can be seen that an orthonormal basis for the "detail (x) ' 2j { wm O,k
< m < 2j+I k E Z}. -
The orthonormal bases we are familiar with for these spaces are written in terms of this new notation as {wJ,k, k E Z} for Vj and {w],k, k E Z} for Wj. In addition to these two examples of possible orthonormal bases, there are many others that can be used, the elements of which result from the appropriate choice of various combinations of the indices m, j, and k. To be more precise, a basis for L 2 ( JR) can be formed by allowing k to range over Z, and choosing an index set I = {(m0 ,j0 ), (m 1,it), ... } such that the intervals [2jimi, 2ji (mi + 1)) are disjoint and "cover" the entire interval [0, oo): 00
ur2jimi, 2ji (mi
+ 1)) = [0, oo).
(9.4)
i=l
This can be thought of as covering the entire time-frequency plane with windows of various shapes. It is easily shown that the usual wavelet basis forms such a cover: let (m0 , j 0 ) = (0, 0), and then set m1 = m2 = ... = 1 and let ji = i, for i = 1, 2, ....
176
WAVELET PACKETS
Haar scaling function (m=O) co 0
Haar mother wavelet (m=1) L()
0 0
0
00
~-~-~-~---~-~
0.0
0.2
0.4
0.6
0.8
1.0
Packet function (m=2)
~l,r---..---r---=:::;:::===;::==~
";"" 0.0
0.2
0.4
0.6
0.8
1.0
Packet function (m=3)
~~====~---~~====~
~ ~.=====~-~====~~-,
L()
L()
0
0
0
0
0
0
~ ~-~~~==~~~..--~
";"" 0.0
0.2
0.4
0.6
0.8
1.0
~~-~~~~~-~~==~
";"" 0.0
Packet function (m=4} ,..---
0.2
0.4
0.6
0.8
1.0
Packet function (m=S) ,..---
,..---
L()
L()
0
0
,..---
0
0
0
0
'--
";"" 0.0
0.2
0.4
0.6
0.8
1.0
";"" 0.0
,..-
..--
-
..--
L()
L()
0
0
0
'---
0.6
0.8
1.0
..--
..--
..--
0
0
0
-
'---
";"" 0.0
0.4
Packet function (m= 7)
Packet function (m=6) -
0.2
0.2
0.4
0.6
0.8
1.0
";"" 0.0
0.2
0.4
..____
0.6
0.8
1.0
Figure 9. 5: Wavelet packet functions corresponding to the Haar system
Generalizations and Extensions
177
Though the previous discussion was given in terms of the Haar basis, the same results hold for all sets of wavelet packet functions and their associated subspaces Vj and Wj for j E 7L The collection of all wavelet packet functions {wrk' j, k E Z, m = 0, 1, ... } contains far too many elements to form an orthonormal basis. Care must be taken in choosing a subset of this collection in order to obtain a proper basis. Denoting by I a suitably chosen set of indices, decomposition of an L 2 (1R) function f into its wavelet packet components is given by f(x)
= LL
L:aJ:kwrk(x),
{m,j)E/ kEZ
where the coefficients are computed via
Thus, wavelet packets offer an enormous amount of flexibility in possible sets of basis functions. The grouping of all possible bases is called a library of bases. For the idea of wavelet packets to be really useful in practical situations, there must be some good adaptive way to choose the most appropriate set of basis functions with which to represent a particular function. This is the aim of the best basis algorithm.
The Best Basis Algorithm As we moved from the discussion of the wavelet decomposition of continuous functions to the decomposition of discrete data earlier in this text, we do so now as well in our discussion of wavelet packets. This is perhaps a more natural way to describe the main conceptual points of wavelet packets and the associated best basis algorithm. Recall from Section 4.1 that a decomposition algorithm exists to compute scaling function and wavelet coefficients at level j from the scaling function coefficients at level j + 1, specifically Cj,k
= :2::: ht-2k Cj+I,f, f
dj,k
= L:(-1)fh_t+2k+I
Cj+I,k·
fEZ
Recall also from Section 6.2 that this algorithm was begun by regarding the data values Y1 , ... , Yn as the highest level scaling function coefficients from which all lower-level coefficients are ultimately computed. In Section 4.2, we
178
WAVELET PACKETS
noted that such a decomposition algorithm is a pair of filtering operations, and thus the two decomposition expressions above can be expressed
Cj,·
= H Cj+l,·
dj,·
= Gci+I,·,
where H and G represent the low-pass and high-pass filters associated with the respective decomposition formulas. Thus, if the data points are regarded as the scaling function coefficients at level J, then all scaling function coefficients are obtained by repeated application of the filter H:
where Y = (Y1 , ... , Yn)'. Similarly, wavelet coefficients are computed by applying the filter G after successive applications of H:
By recalling the usual wavelet decomposition of data, we are better equipped to describe the organization structure inherent in the wavelet packet decomposition.
HY y
HGY GY
Q3y
Figure 9.6: Tree diagram of the usual wavelet decomposition algorithm
Generalizations and Extensions
179
The usual wavelet decomposition is displayed in a tree diagram in Figure 9.6. This idea is generalized to describe the wavelet packet decomposition. Each set of coefficients is subject to either of the filters H and G. Computing the full wavelet packet decomposition involves applying both filters to the Yi values and then recursively to each intermediate signal, giving the tree diagram in Figure 9. 7. The decomposition of each signal at each node of the tree by applying the two filters is known as the splitting algorithm. By computing the full wavelet packet decomposition on a data vector Y with n = 2J points depicted in Figure 9. 7 for r resolution levels, the result is a group of 2 + 4 + 8 + ... + 2r = 2r+ 1 - 2 sets of coefficients. At each level, note that the downsampling inherent in the filtering ensures that there are n total coefficients among all the sets at that level. The total number of coefficients (including the original data values) is thus n(r+ 1), which is obviously a highly redundant way to represent n data values. Choosing a particular basis of wavelet packet functions amounts to "pruning" the decomposition tree. In the usual wavelet decomposition algorithm shown in Figure 9.6, each right-hand node is "pruned," meaning that lowerlevel decompositions are not computed from the right-hand branches. The best basis algorithm, developed by Coifman and Wickerhauser (1992), consists of traveling down the tree structure, making a data-based decision at each node as to whether or not to split. The result (when all nodes are split as far
H 3Y HY 2
GH 2 Y HY HGHY GHY G2 HY y
H 2 GY HGY GHGY GY HG 2 Y azy Q3y
Figure 9.7: Tree diagram of the full wavelet packet decomposition algorithm
180
TRANSLATION INVARIANT WAVELET SMOOTHING
0.0
0.2
0.4
0.6
0.8
1.0
Figure 9.8: Schematic design of the result of the best basis algorithm
as they will be split) represents the "best" basis (according to the criterion in use) for representing the particular set of data in question. This tree-based algorithm will automatically ensure that the resulting index set { (m 0 , j 0 ), (m 1 , j 1 ), ... } will "cover" the interval [0, oo) in the best possible way (see (9.4)), guaranteeing an orthonormal basis for L 2 (JR). This tree-based approach is illustrated in two figures, Figure 9.8 denoting an arbitrary wavelet packet basis, and Figure 9.9 representing the usual wavelet basis. Note that the shaded boxes taken together should "cover" the entire width of the figure. In these figures, an unshaded box indicates that the box was split, so that it does not correspond directly to any element of the final basis. Of course, choosing a best basis begs the question as to what "best" means: the criterion function. Coifman and Wickerhauser (1992) focus primarily on the Shannon entropy measure. Other possibilities include counting the number of coefficients greater in absolute value than a given threshold.
9.3
Translation Invariant Wavelet Smoothing
One problem that wavelet bases have is the lack of translation in variance. To illustrate this point by example, consider the Haar basis decomposition of
Generalizations and Extensions
0.0
0.2
0.4
0.6
0.8
181
1.0
Figure 9.9: Schematic design of the usual wavelet decomposition
the function f(x) = 'lj;(x) (the Haar wavelet). It is clear that d 0 ,0 = 1 and all the other wavelet coefficients are identically zero. Now consider translating f to the right by a small amount 8: f(x) = '1/J(x- 8). This new function is of course still in L 2 (JR) and thus can still be decomposed into its wavelet components, but it is clear to see that the nice coefficient structure of the original decomposition is lost. For the shifted function, there are two non-zero coefficients at level 0: (d0 ,0 , d0 ,t) = (1 - 28, 8), three non-zero coefficients at level 1: (d 1 ,0 , d 1 , 1 , d 1 , 2 ) = v'2( -8, 8, 8) and so on. If the shift 8 is taken to be an integer, then the nice structure is preserved: Now do,J = 1 and all the other coefficients are zero, so the Haar wavelet decomposition is translation invariant under integral shifts, but not in general. No matter how an £ 2 function f(x) is shifted, it is still in L 2 (1R), and it can be written in terms of its wavelet components. Furthermore, the wavelet Parseval identity guarantees that the energy in the function is preserved in the total set of wavelet coefficients, regardless of how the energy is distributed among coefficients. It is true that the lack of translation invariance is not a real problem in this sense. It is a significant weakness, though, when applying wavelet methods to finite data sets, especially those with small or moderate sample sizes.
182
TRANSLATION INVARIANT WAVELET SMOOTHING
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 9.10: Wavelet reconstructions of translated versions of a data set
A simulated example of how this can affect a statistical estimator in practice is shown in Figure 9.10. The first data set consists of 64 N(2, 1) randomvariablesfollowedby64N(-2, 1) observations(j(x) = 2'1/J(x), theHaar wavelet, with a signal-to-noise ratio of 2). The other two sets of data are just translated versions of the first: zi = Y(i+h)modn for h = 42 (corresponding to a leftward shift of 1/3) and h = 32 (corresponding to a shift of 1/4). These versions of the original data are actually "wrapped-around" translations of the original data, as if we were applying periodic boundary handling. For all three data sets, the universal threshold was applied to all levels of coefficients using the hard thresholding operator. The resulting wavelet estimator is shown along with each data set. The errors (1/n) 'E~=I (j(i/n) - f(i/n)) 2 were computed for the three estimates, giving 0.031, 0.045, and 2.157, respectively. It should come as no
Generalizations and Extensions
0.0
0.2
0.4
0.6
0.8
183
1.0
Figure 9.11: Translation-invariant wavelet estimator of simulated data using the Haar function, hard thresholding, and the universal threshold across all levels
surprise that the first estimate is quite good- the abrupt jump at 112 lines up exactly with the middle jump of the Haar wavelet '1/Jo,o. The second estimate is not quite as good as the first, but is still quite good, since the pair of jumps at 1 I 4 and 3I 4 line up exactly with the jumps in the wavelets 'lj; 1 ,o and 'ljJ 1 , 1 . The third estimate illustrates what can go wrong in general. Shifting by 1I 3 ensures that none of the wavelets will line up exactly. The estimation procedure does the best it can, but fails miserably. For general wavelet bases, the same phenomenon holds. Though it is not disastrous in every case, the wavelet thresholding estimator does not perform well for arbitrary translations. Coifman and Donoho (1995) make note of this problem, and also that the lack of translation invariance causes various spurious artifacts in the reconstruction of functions, such as Gibbs-type phenomena (rapid oscillations of high amplitude which are typical in Fourier reconstructions near jumps) near jump discontinuities. They propose an ingenious, yet simple solution, which is described in this section. One possibility, of course, would be to impose a shift on the data set before the decomposition takes place (in order to align apparent features of the data to avoid the problem seen in the third plot in the example), then decompose, shrink, reconstruct, and inverse-shift. In practice, however, it would be quite difficult to know the exact amount to shift the data (if at all). Instead, Coifman and Donoho propose to compute the wavelet estimator for all possible shifts, then inverse-shift them and take as a final estimate the average of the estimates resulting from all shift values. By "all possible shifts," we mean all n shifts of the data - considered on the unit interval, this will be shifts off-by amounts iln, fori = 1, ... , n. Though, as in the example,
184
TRANSLATION INVARIANT WAVELET SMOOTHING
0.0
0.2
0.4
0.6
0.8
1.0
Figure 9.12: Translation-invariant wavelet estimator of simulated data using Daubechies' N = 5 wavelet, soft thresholding, and the SURE thresholding scheme some shifts will likely give poor results, these can reasonably be expected to average themselves out over all possible shifts. This "Spin Cycle" algorithm is demonstrated in Figure 9.11 for the example data set from Figure 9.10. Note that since the second two data sets are just translated versions of the first, in this translation-invariance scheme, all three versions will give the same estimate. The estimation scheme used to produce Figure 9.11 was the same as that used in Figure 9.10: using the Haar wavelet basis with the universal threshold applied to all levels. The error for this example was 0.411, much worse than the 0.031 and 0.045 for the first two plots of Figure 9.10, but a great deal better than the third plot. Note that even though the Haar wavelet was used, averaging over 128 different estimators gave a fairly smooth final estimate. Figure 9.12 is a plot of the same data set, smoothed by applying the Spin Cycle algorithm with a smoother wavelet, soft thresholding, and the SURE threshold selection scheme. This estimator does a good job of picking up the jump at the middle (and the jump at the edges which is induced by periodic boundary handling), but is a little wavy through the "flat" parts of the data. The error for this estimate is 0.273.
Appendix
This book is concerned with L 2 function space, and while the notion of function spaces may not be familiar to the reader at first, it can be readily understood by relating it to vector spaces in linear algebra. (It is presupposed that the reader has had some exposure to linear algebra.) Most of the specific material on L 2 function space needed for this book is introduced as it is needed in Chapter 1. The following pages, while certainly not intended to be a complete discussion of vector spaces and functions spaces, are devoted to briefly reviewing some basic concepts from linear algebra and then extending them to general Hilbert spaces. In linear algebra, a vector in JRk is an ordered k-tuple of real numbers x = (x 1 ,x 2 , ... ,xk) which is viewed as a directed line segment from 0 = (0, ... , 0) to x. Two vectors x andy are said to be equal if Xi = Yi for each i = 1, ... , k. Two fundamental algebraic operations may be applied to vectors. Vector addition is the elementwise sum of two k-tuples:
while the scalar multiplication of a vector x and a scalar a E lR is
Note that both of these operations result in a new k-tuple. From these two basic properties, it is easily shown that addition of k-tuples is commutative and associative, and that various other algebraic properties hold. With these basic ideas, we turn next to the idea of vector spaces. The set of all ordered k-tuples is said to form the vector space JRk. The space JR2 is often represented by the usual x-y plane. Three dimensional space corresponds with JR3 , but higher-order vector spaces are difficult to visualize. Formally, a vector space is any set of vectors V which is closed under vector addition and scalar multiplication, i.e., for all x, y E V, a E JR,
x + y E V and ax E V.
186
APPENDIX
These two operations must also satisfy a set of standard postulates, including commutativity, associativity, existence of a zero vector, etc. These postulates are listed in any basic linear algebra book. A subspace of a vector space V is a subset of vectors in V which is itself closed under addition and scalar multiplication. A subspace is also a vector space, so it must also include the zero vector and satisfy the other necessary postulates. In the vector space JR3 some examples of subspaces are the set consisting only of the zero vector; all vectors of the form (c, 0, 2c) for c E JR; and in fact any plane or any line which passes through the origin. To discuss a basis for a vector space, we need a few preliminary definitions. A vector y is a linear combination of the vectors x 1 , x 2 , ••• , Xm if it can be expressed
A set of vectors { x 1 , x 2 , ..• , Xm} is said to be linearly dependent if the zero vector is a non-trivial linear combination of the Xi's (non-trivial means that not all the ai 's can be zero). Thus, if a set of non-zero vectors is linearly dependent, then at least one of the vectors can be written as a linear combination of the others. If a set of vectors is not linearly dependent, then it is linearly independent, which means that none of the vectors in the set can be written as a linear combination of the others. If every vector in a vector space V can be written as a linear combination of a set of vectors { x 1 , x 2 , ... , Xn}, then it is said that these vectors span V. A set of vectors { x 1 , x 2 , ..• , Xm} is said to be a basis for a vector space V if the vectors are linearly independent and span V. The concept of a basis is essential to a discussion of linear algebra. For a particular basis x 1 , x 2 , ... , Xm, each vector in the space can be written in terms of the Xi's:
and furthermore, the representation is unique. There are many possible bases (infinitely many, in fact) for each non-trivial vector space. A simple example of a basis for JRk is the standard basis: x 1 = (1,0,0, ... ,0)', x 2 = (0, 1,0, ... ,0)', ... , Xk = (0,0,0, ... , 1)'. In fact, any set of k linearly independent vectors in JRk constitute a basis for JRk, and every possible basis for JRk will have exactly k vectors. The number of basis vectors for any vector space is known as the dimension of the space, with the dimension of the space {0} defined to be zero. A basis can be thought of geometrically as a set of coordinate axes. The standard basis is represented by the usual Euclidean axes. Any vector in the space has a unique representation in terms of these axes or bases. In Euclidean geometry, the well-known formula for the squared length of
Appendix
187
a vector x, a generalization of the Pythagorean theorem, is given by
Using the usual notation for the dot product (or scalar product between two vectors x and y: X·
Y
= XtYI + XzYz + .. · + XkYk,
the angle between the vectors x and y can be computed according to
cos
0
X·y
= llxiiiiYII.
(9.5)
To allow ready extension to other types of vector spaces, we will use the term inner product in place of dot product and write, for example, for k-tuples x andy,
(x, y)
= XtYI + XzYz + · · · + XkYk·
In terms of the inner product, the length of a vector x, which we will henceforth refer to as the norm of the vector, is given by
llxll = (x,x) 1/ 2 = Jxi +x~ + ... +x~. From (9.5) it is seen that if two vectors have an inner product of zero, the angle between them is 90 degrees (rr /2 radians), and the vectors are said to be perpendicular, or orthogonal. Orthogonality may be difficult to visualize in more than three dimensions, but it is a key concept for this book. A set of vectors {x 1 , x 2 , ... , Xm} forms an orthogonal basis for a vector space V if the vectors are a basis for V and if each pair of basis vectors is orthogonal. If each vector of an orthogonal basis for V is normalized to have length (norm) one: i = l, ... ,m,
then the resulting set of vectors { y 1 , y 2 , ... , y m} constitutes an orthonormal basis for V. The notion of orthogonality extends to subspaces as well. Two subspaces V and W (both in the same vector space) are said to be orthogonal if every vector in V is orthogonal to every vector in W. If each vector of a basis for V is orthogonal to each vector of a basis for W, then this implies that the subspaces V and W are orthogonal.
188
APPENDIX
Every subspace W of a vector space V has an orthogonal complement in V, which consists of the set of all vectors in V which are orthogonal to W. It is straightforward to show that the orthogonal complement of a subspace is also a subspace. Given a vector x and a subspace W, the projection of x onto W is a vector y such that y E W and x - y is in the orthogonal complement of W in V. The projection operation is denoted y = Pw x. The projection of a vector x onto a subspace W is the vector in W that is the "closest" to x, in the sense that the magnitude of the "error" llx- Yll is minimized when y = Pwx. From the vector space JRk, we can extend to the infinite-dimensional space lR 00 , which contains all infinite-length vectors x = (x 1 , x 2 , x 3 , ... ) ' with finite norm: llxll = L:~ 1 < oo. Though infinite-dimensional vector space might be hard to conceptualize, lR 00 defined this way does form a bona fide vector space, since adding any two vectors with finite norm or multiplying by a finite scalar will result in another with finite norm. It is possible now to move from the countably infinite-dimensional vector space to uncountably infinite-dimensional vector spaces, which are simply spaces of functions. An element of this function space is a function f (x) defined on a continuous set of the real line. The notions of inner product and norm extend to function space as well, where the summation in vector space is replaced by its continuous counterpart, the integral. The inner product of two functions is given by
xr
(j,g)
=
J
f(x)g(x) dx,
(9.6)
the range of the integration determined by the definition of the particular space. The treatment of vector spaces and function spaces can be unified by considering the more general framework of Hilbert spaces. A Hilbert space is simply a complete 1 vector space (finite- or infinite-dimensional) which has an inner product defined. This book is primarily concerned with the particular Hilbert space known as L 2 function space. With the inner product defined as in (9.6) (integration taking place over some specified interval I C JR), this function space consists of all functions that are square-integrable:
Clearly, this space is closed under addition and scalar multiplication, so it is indeed a valid Hilbert space since it is also complete. 1 Completeness is a closure condition on the space, requiring that all Cauchy sequences converge to a limit that is also in the space.
Appendix
189
All the concepts discussed earlier in terms of usual k-tuple vector space extend to L 2 function space as well. The norm of a vector in L 2 space is defined to be
11!11 2 = (j, f)
=
jf
2
(x) dx.
We can also speak of subspaces in L 2 function space. For example, the span of a set of L 2 ( JR) functions {f 1 , ... , f m} is a subspace of L 2 , defined to be 2 m
{f E L 2 (JR) : f(x)
=L
aifi(x), for some constants
a 1 , ... , am}
(9.7)
i=l
Other concepts that extend immediately to L 2 function space are orthogonality, bases, orthonormal bases, projections, etc.
2 To be precise, the representation (9.7) of a function fin terms of a linear combination of other functions needs hold only "almost everywhere" (a.e.), i.e. II! adi II = 0.
2:::
References
Abramovich, F., and Benjamini, Y. (1995). Thresholding of wavelet coefficients as multiple hypotheses testing procedure. In Wavelets and Statistics. Antoniadis, A., and Oppenheim, G. (eds.). New York: SpringerVerlag. pp. 5-14. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Proceedings of the 2nd International Symposium on Information Theory. Petrov, B. N., and Csaki, F. (eds.). Akademiai Kiado: Budapest. Altman, N. S. (1990). Kernel smoothing of data with correlated errors. journal of the American Statistical Association 85: 749-759. Anderson, T. W. (1971). The Statistical Analysis of Time Series. Wiley: New York. Anderson, L., Hall, N., Jawerth, B., and Peters, G. (1993). Wavelets on closed subsets of the real line. In Recent Advances in Wavelet Analysis. Schumaker, L. L., and Webb, G. (eds.). Academic Press: New York. Antoniadis, A., Gregoire, G., and McKeague, I. W. (1994). Wavelet methods for curve estimation. journal of the American Statistical Association 89: 1340-1353. Ariiio, M.A., and Vidakovic, B. (1995). On wavelet scalograms and their applications in economic time series. Discussion Paper 95-21, ISDS, Duke University, Durham, North Carolina. Auscher, P. (1989). Ondelettes fractales et applications. Ph.D. Thesis, Universite Paris, Dauphine, Paris. Bartlett, M. S. (1963). Statistical estimation of density functions. Sankhya Series A 25: 245-254. Battle, G. (1987). A block spin construction of ondelettes. part 1: Lemarie functions. Communications in Mathematical Physics 100: 601-615.
192
REFERENCES
Benjamini, Y, and Hochberg, Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. journal of the Royal Statistical Society, Series B 57: 289-300. Bloomfield, P. (1976). Fourier Analysis of Time Series: An Introduction. Wiley: New York. Bock, M. E. (1992). Estimating functions with wavelets. Statistical Computing and Statistical Graphics Newsletter 4-8. Bock, M. E., and Pliego, G.]. (1992). Estimating functions with wavelets Part II: Using a Daubechies wavelet in nonparametric regression. Statistical Computing and Statistical Graphics Newsletter 27-34. Brigham, E. 0. (1988). The Fast Fourier Transform and Its Applications. Prentice-Hall: Englewood Cliffs, New Jersey. Cencov, N. N. (1962). Evaluation of an unknown distribution density from observations. Soviet Mathematics 3: 1559-1562. Chambers,]. M., Cleveland, W. S., Kleiner, B., and Thkey, P. A. (1983). Graphical Methods for Data Analysis. Wadsworth: Belmont, California. Cheng, K. F., and Lin, P. E. (1981). Nonparametric estimation of a regression function. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete 57: 223-233. Chipman, H. A., Kolaczyk, E. D., and McCulloch, R. E. (1995). Adaptive Bayesian wavelet shrinkage. Technical Report, University of Chicago, Chicago, Illinois. Chui, C. K. (1992). An Introduction to Wavelets. Academic Press: New York. Chui, C. K., and Wang,]. Z. (1991). A cardinal spline approach to wavelets. Proceedings of the American Mathematical Society 113: 785-793. Cohen, A., Daubechies, 1., and Feauveau, J. C. (1992). Biorthogonal bases of compactly supported wavelets. Communications in Pure and Applied Mathematics 45: 485-560. Cohen, A., Daubechies, 1., and Vial, P. (1993). Wavelets on the interval and fast wavelet transforms. Applied and Computational Harmonic Analysis 1: 54-81. Cohen, A., Daubechies, 1., Jawerth, B., and Vial, P. (1993). Multiresolution analysis, wavelets and fast algorithms on an interval. Comptes Rendus des Seances de l'Academie des Sciences, Serie I 316: 417-421. Coifman, R. R., and Donoho, D. L. (1995). Translation-invariant de-noising. In Wavelets and Statistics. Antoniadis, A., and Oppenheim, G. (eds.). New York: Springer-Verlag. pp. 125-150. Coifman, R. R., and Meyer, Y (1991). Remarques sur !'analyse de Fourier fenetre. Comptes Rendus des Seances de l'Academie des Sciences, Serie I 312: 259-261. Coifman, R. R., and Wickerhauser, M. W. (1992). Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory 38: 713-718. Coifman, R., Meyer, Y, and Wickerhauser, M. V. (1994).Wavelet analysis and
a
References
193
signal processing. In Wavelets and Their Applications, Ruskai, M. B., Beylkin, G., Coifman, R., Daubechies, 1., Mallat, S., Meyer, Y., and Raphael, L. (eds.). Jones and Bartlett: Boston. Coifman, R. R., Meyer, Y., Quake, S., and Wickerhauser, M. W (1994). Signal processing and compression with wavelet packets. In Wavelets and Their Applications, Byrnes, J. S., Byrnes,]. L., Hargreaves, K. A., and Berry, K. (eds.). Kluwer Academic Publications: Dordrecht, The Netherlands. Collineau, S. (1994). Some remarks about the scalograms of wavelet transform coefficients. In Wavelets and Their Applications, Byrnes,]. S., Byrnes,]. L., Hargreaves, K. A., and Berry, K. (eds.). Kluwer Academic Publications: Dordrecht, The Netherlands. Cooley,]. W, and Thkey, J. W (1965). An algorithm for the machine calculation of complex Fourier seriew. Mathematics of Computation 19: 297-301. Craven, P., and Wahba, G. (1979). Smoothing noisy data with spline functions. Numerische Mathematik 31: 377-403. Csorg6, M., and Horvath, L. (1988). Nonparametric methods for changepoint problems. In Handbook of Statistics, Volume 7. Krishnaiah, P.R., and Rao, C. R. (eds.). Elsevier: Amsterdam. Daniel, C. (1959). Use of half-normal plots in interpreting factorial two-level experiments. Technometrics 1: 311-341. Daubechies, I. (1988). Orthonormal bases of compactly supported wavelets. Communications in Pure and Applied Mathematics 41: 909-996. Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM: Philadelphia. Daubechies, I. (1993). Orthonormal bases of compactly supported wavelets II. Variations on a theme. SIAM]ournal on Mathematical Analysis 24: 499-519. Daubechies, 1., and Lagarias, J. (1991). Two-scale difference equations I. Existence and global regularity of solutions. SIAM journal on Mathematical Analysis 22: 1388-1410. Daubechies, 1., and Lagarias,]. (1992). Two-scale difference equations II. Local regularity, infinite products of matrices and fractals. SIAM journal on Mathematical Analysis 23: 1031-1079. Delacroix, M. (1983). Histogrammes et Estimation de la Densite Que saisje? # 2055. Presses Universitaires de France: Paris. de Boor, C. (1978). A Practical Guide to Splines. Applied Mathematical Sciences, Volume 27. Springer-Verlag: London. DeVore, R. A., and Lucier, B.]. (1992). Fast wavelet techniques for nearoptimal processing. In Proceedings of the IEEE Military Communications Conference 48.3.1-48.3.7. New York. Donoho, D. L. (1993). Nonlinear wavelet methods for recovery of signals, densities, and spectra from indirect and noisy data. Proceedings of Symposia in Applied Mathematics 47: 173-205.
194
REFERENCES
Donoho, D. L., and Johnstone, I. M. (1992). Nonlinear solution for linearinverse problems by wavelet-vaguelet decomposition. Technical Report 403. Stanford University Department of Statistics, Stanford, California. Donoho, D. L., and Johnstone, I. M. (1994). Ideal spatial adaptation via wavelet shrinkage. Biometrika 81: 425-455. Donoho, D. L., and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. journal of the American Statistical Association 90: 1200-1224. Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., and Picard, D. (1993). Density estimation by wavelet thresholding. Technical report, Stanford University Department of Statistics, Stanford, California. Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., and Picard, D. (1995). Wavelet shrinkage: Asymptopia? journal of the Royal Statistical Society, Series B 57: 301-369. Doukhan, P., and Leon, J. (1990). Deviation quadratique d'estimateur de densite par projection orthogonale. Comptes Rendus des Seances de l'Academie des Sciences, Serie I 310: 424-430. Dutilleux, P. (1989). An implementation of the "algorithme trous" to compute the wavelet transform. In Wavelets: Time-Frequency Methods and Phase Space. Combes, J. M, Grossman, A., and Tchamitchian, Ph. (eds.). Springer-Verlag: New York. Dym, H., and McKean, H. P. (1972). Fourier Sums and Integrals. Academic Press: New York. Engel,]. (1990). Density estimation with Haar series. Statistics and Probability Letters 9: 111-117. Eubank, R. L. (1988). Spline Smoothing and Nonparametric Regression. Marcel Dekker: New York. Fan, J., Hall, P., Martin, M., and Patil, P. (1996). On local smoothing of nonparametric curve estimators. journal of the American Statistical Association 91: 258-266. Gabor, D. (1946). Theory of communications. journal of the Institute of Electrical Engineering, London III 93: 429-457. Gao, H.-Y. (1993). Choice of threshold for wavelet estimation of the log spectrum. Technical Report 438. Stanford University Department of Statistics, Stanford, California. Gasser, Th., and Muller, H. G. (1979). Kernel estimation of regression functions. In Smoothing Techniques for Curve Estimation. Gasser, Th. and Rosenblatt, M. (eds.). Heidelberg: Springer. Gasser, Th., Muller, H. G., and Mammitzsch, V. (1985). Kernels for nonparametric curve estimation. journal of the Royal Statistical Society B 47: 238-252. Good, I.]. (1958). The interaction algorithm and practical Fourier analysis. journal of the Royal Statistical Society, Series B 20: 361-372.
a
References
195
Graps, A. (1995). An introduction to wavelets. IEEE Computational Science and Engineering 2. Haar, A. (191 0). Zur Theorie der orthoganalen Funktionen-Systeme. Annals of Mathematics 69: 331-371. Hart,]. D. (1994). Automated kernel smoothing of dependent data by using time series cross-validation. journal of the Royal Statistical Society, Series B 56: 529-542. Hu, Y.-S. (1994). Wavelet approach to change-point detection with application to density estimation. Ph.D. thesis, Texas A&M University, College Station, Texas. Janssen, A.]. E. M. (1992). The Smith-Barnwell condition and non-negative scaling functions. IEEE Transactions in Information Theory 38: 884886. Jawerth, B., and Sweldens, W. (1994). An overview of wavelet based multiresolution analysis. SIAM Review 36: 3 77-412. Johnstone, I. M., Kerkyacharian, G., and Picard, D. (1992). Estimation d'une densite de probabilite par methode d'ondellettes. Comptes Rendus des Seances de l'Academie des Sciences, Serie I 315: 211-216. Johnstone, I. M., and Silverman, B. W. (1995). Wavelet threshold estimators for data with correlated noise. Technical report, Stanford University Department of Statistics, Stanford, California. Kaiser, G. (1994). A Friendly Guide to Wavelets. Birkhauser: Boston. Karlin, S., and Taylor, H. (1975). A First Course in Stochastic Processes, 2nd Edition. Academic Press: New York. Kerkyacharian, G., and Picard, D. (1992). Density estimation in Besov spaces. Statistics and Probability Letters 13: 14-24. Kerkyacharian, G., and Picard, D. (1993). Density estimation by kernel and wavelets methods: Optimality of Besov spaces. Statistics and Probability Letters 18: 327-336. Lemarie, P. G. (1988). Une nouvelle base d'ondelettes de L 2 (1Rn). journal de Mathematiques Pures et Appliquees 67: 227-236. Li, K. C. (1985). From Stein's unbiased risk estimates to the method of generalized cross-validation. Annals of Statistics 13: 1352-1377. Li, K. C. and Hwang,]. (1984). The data-smoothing aspect of Stein estimates. Annals of Statistics 12: 887-897. Lombard, E (1988). Detecting change points by Fourier analysis. Technometrics 30: 305-310. Mallat, S. G. (1989a). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11: 674-693. Mallat, S. G. (1989b). Multifrequency channel decomposition of images and wavelet models. IEEE Transactions on Acoustics, Speech, and Signal Processing 37: 2091-2110. Messiah, A. (1961). Quantum Mechanics. North-Holland: Amsterdam.
196
REFERENCES
Meyer, Y (1985). Principe d'incertitude, bases hilbertiennes et algebres d'operateurs, Seminaire Bourbaki, 1985-1986, No. 662. Meyer, Y (1990). Ondelettes et Operateurs L· Ondelettes. Hermann: Paris. Meyer, Y (1992). Ondelettes sur l'intervalle. Revista Matematica lberoamericana 7: 115-133. Meyer, Y (1993). Wavelets: Algorithms and Applications. SIAM: Philadelphia. Moulin, P. (1993a). A wavelet regularization method for diffuse radar-target imaging and speckle-noise reduction. journal of Mathematical Imaging and Vision, Special Issue on Wavelets 3: 123-134. Moulin, P. (1993b). Wavelet thresholding techniques for power spectrum estimation. IEEE Transactions on Signal Processing 42: 3126-3136. Muller, H.-G., and Stadtmuller, U. (1987). Variable bandwidth kernel estimators of regression curves. Annals of Statistics 15: 182-201. Nason, G. (1994). Wavelet regression by cross-validation. Technical Report 447, Department of Statistics, Stanford University, Stanford California. Nason, G. P. (1995). Choice of the threshold parameter in wavelet function estimation. In Wavelets and Statistics. Antoniadis, A., and Oppenheim, G. (eds.). New York: Springer-Verlag. pp. 261-280. Nason, G. (1996). Wavelet shrinkage using cross-validation. journal of the Royal Statistical Society, Series B 58: 463-479. Ogden, R. T. (1994). Wavelet thresholding in nonparametric regression with change-point applications. Ph.D. thesis, TexasA&M University, College Station, Texas. Ogden, R. T. (1997). On preconditioning for the discrete wavelet transform when the sample size is not a power of two. Communications in Statistics B: Simulation and Computation, to appear. Ogden, R. T., and Parzen, E. (1996a). Change-point approach to data analytic wavelet thresholding. Statistics and Computing 6: 93-99. Ogden, R. T., and Parzen, E. (1996b). Data dependent wavelet thresholding in nonparametric regression with change-point applications. Computational Statistics and Data Analysis 22: 53-70. Ogden, R. T., and Richwine,]. (1996). WaveletsinBayesianchange-pointanalysis. Technical report, University of South Carolina, Columbia, South Carolina. Page, E. S. (1954). Continuous inspection schemes. Biometrika 41: 100115. Page, E. S. (1955). A test for a change in a parameter occurring at an unknown point. Biometrika 42: 523-526. Parzen, E. (1962). On estimation of a probability density function. Annals of Mathematical Statistics 31: 1065-1076. Parzen, E. (1974). Some recent advances in time series modelling. IEEE Transactions on Automatic Control19: 723-729. Prakasa Rao, B. L. S. (1983). Nonparametric Functional Estimation. Academic Press: New York.
References
197
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Nu-
merical Recipes in C, the Art of Scientific Computing, 2nd edition. Cambridge University Press: Cambridge. Priestley, M. B. (1981). Spectra/Analysis and Time Series. Academic Press: New York. Richwine, J. (1996). Bayesian estimation of change-points using Haar wavelets. Master's thesis, University of South Carolina Department of Statistics, Columbia, South Carolina. Rioul, 0., and Vetterli, M. (1991). Wavelets and signal processing. IEEE Signal Processing Magazine 14-38. Ross, S. (1983). Stochastic Processes. Wiley: New York. Rudemo, H. (1982). Empirical choice of histograms and kernel density estimators. Scandinavian journal of Statistics 9: 65-78. Schumaker, L. L. (1981). Spline Functions: Basic Theory. Wiley-Interscience: New York. Shensa, M.]. (1992). The discrete wavelet transform: Wedding the trous and Mallat algorithms. IEEE Transactions on Signal Processing 40: 24642482. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall: London. Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Annals of Statistics 10: 1135-1151. Stone, M. (1978). Cross-validation: A review. Statistics 9: 127-140. Strang, G., and Nguyen, T. (1996). Wavelets and Filter Banks. WellesleyCambridge Press: Wellesley, MA. Stromberg, ]. 0. (1982). A modified Franklin system and higher order spline systems on IRn as unconditional bases for Hardy spaces. In Conference in Honor ofA. Zygmund, Vol. II. Beckner, A. et al. (eds.). Wadsworth Mathematics Series, pp. 475-493. Taniguchi, M. (1979). On estimation of parameters of Gaussian stationary processes. journal of Applied Probability 16: 575-591. Taniguchi, M. (1980). On estimation of the integrals of certain functions of spectral density. journal of Applied Probability 17: 73-83. Tchamitchian, Ph. (1987). Biorthogonalite et theorie des operateurs. Revista Matemdtica Iberoamericana 3: 163-189. Unser, M. (1996). A practical guide to the implementation of the wavelet transform. In Wavelets in Medicine and Biology. Aldroubi, A., and Unser, M. (eds.). CRC Press: Boca Raton, Florida. Vidakovic, B. (1994). Nonlinear wavelet shrinkage with Bayes rules and Bayes factors. Discussion Paper 94-A-24, ISDS, Duke University, Durham, North Carolina. Vidakovic, B., and Muller, P. (1994). Wavelets for kids: A tutorial introduction. Discussion Paper 94-A-13, ISDS, Duke University, Durham, North Carolina.
a
198
REFERENCES
Wahba, G. (1980). Automatic smoothing of the log periodogram. journal of the American Statistical Association 75: 122-132. Walter, G. G. (1992). Approximation of the delta function by wavelets. journal of Approximation Theory 71: 329-343. Walter, G. G. (1994). Wavelets and Other Orthogonal Systems With Applications CRC Press: Boca Raton, Florida. Wang, Y (1995). Jump and sharp cusp detection by wavelets. Biometrika 82: 385-397. Wang, Y (1996). Function estimation via wavelet shrinkage for long-memory data. Annals of Statistics, to appear. Weaver, J. B., Yansun, X., Healy, D. M., Jr., and Cromwell, L. D. (1991). Filtering noise from images with wavelet transforms. Magnetic Resonance in Medicine 24: 288-295. Wei, W. W. S. (1990). Time Series Analysis: Univariate and Multivariate Methods. Addison-Wesley: Redwood City, California. Wertz, W. (1978). Statistical Density Estimation: A Survey. Vandenhoeck and Ruprecht: GOttingen. Weyrich, N., and Warhola, G. T. (1994). De-noising using wavelets and crossvalidation. Technical Report AFIT/EN!fR/94-01, Department of Mathematics and Statistics, Air Force Institute of Technology, Wright-Patterson Air Force Base, Ohio. Wickerhauser, M. V. (1994). Adapted Wavelet Analysis: From Theory to Software. AK Peters: Boston.
Glossary of Notation
lR the set of real numbers (- oo, oo).
7L the set of integers: 7L = { ... , -1,0, 1, ... }.
L 2 (I) the set of square-integrable functions on the interval I: { f : oo}. (/,g) the £ 2 inner product: (/,g) = J f(x)g(x) dx.
J1 f 2 ( x) dx