R. Todd Ogden
Essential Wavelets for Statistical Applications and Data Analysis
Birkhauser Boston • Basel • Berlin
R. Todd Ogden Department of Statistics University of South Carolina Columbia, SC 29208
Library of Congress Cataloging-in-Publication Data Ogden, R. Todd, 1965Essential wavelets for statistical applications and data analysis / R. Todd Ogden. p. cm. Includes bibliographical references (p. 191-198) and index. ISBN 0-8176-3864-4 (hardcover: alk. paper). -- ISBN 3-7643-3864-4 (hardcover: alk. paper) 1. Wavelets (Mathematics) 2. Mathematical statistics 1. Title. QA403.3.043 1997 519.5--dc20 97-27379 CIP
Printed on acid-free paper © 1997 Birkhauser Boston
Birkhiiuser
Ji5
Copyright is not claimed for works of U.S. Government employees. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without prior permission of the copyright owner. Permission to photocopy for internal or personal use of specific clients is granted by Birkhauser Boston for libraries and other users registered with the Copyright Clearance Center eCCC), provided that the basefeeof$6.00percopy, plus $0.20 per page is paid directly to CCC, 222 Rosewood Drive, Danvers, MA 01923, U.S.A. Special requests should be addressed directly to Birkhauser Boston, 675 Massachusetts Avenue, Cambridge, MA 02139, U.S.A. ISBN 0-8176-3864-4 ISBN 3-7643-3864-4 Typeset in LATFX by ShadeTree Designs, Minneapolis, MN. Cover design by Spencer Ladd, Somerville, MA. Printed and bound by Maple-Vail, York, PA. Printed in the U.S.A. 9 8 7 6 543 2 I
To Christine
Contents
Preface Prologue: Why Wavelets? 1
Wavelets: A Brief Introduction 1.1 The Discrete Fourier Transform 1.2 The Haar System Multiresolution Analysis The Wavelet Representation Goals of Multiresolution Analysis 1.3 Smoother Wavelet Bases
2 Basic Smoothing Techniques 2.1 Density Estimation
2.2 2.3
Histograms Kernel Estimation Orthogonal Series Estimation Estimation of a Regression Function Kernel Regression Orthogonal Series Estimation Kernel Representation of Orthogonal Series Estimators
3 Elementary Statistical Applications 3.1 Density Estimation 3.2
Haar-Based Histograms Estimation with Smoother Wavelets Nonparametric Regression
4 Wavelet Features and Examples 4.1 Wavelet Decomposition and Reconstruction
4.2
Two-Scale Relationships The Decomposition Algorithm The Reconstruction Algorithm The Filter Representation
ix xiii 1
1 7
14 16 22 23 29 29 31 32 35 38 39 42 45 49 49 49 52 54 59 59 60 62 63 66
vi
4.3 Time-Frequency Localization
4.4
The Continuous Fourier Transform The Windowed Fourier Transform The Continuous Wavelet Transform Examples of Wavelets and Their Constructions Orthogonal Wavelets Biorthogonal Wavelets Semiorthogonal Wavelets
69 69 72 74 79 81 83 87
5 Wavelet-based Diagnostics 5.1 Multiresolution Plots 5.2 Time-Scale Plots 5.3 Plotting Wavelet Coefficients 5.4 Other Plots for Data Analysis
89 89 92 95 100
6 Some Practical Issues
103 104 104 105 107 110 111 112 113 114 115
6.1
6.2 6.3
6.4
The Discrete Fourier Transform of Data The Fourier Transform of Sampled Signals The Fast Fourier Transform The Wavelet Transform of Data Wavelets on an Interval Periodic Boundary Handling Symmetric and Antisymmetric Boundary Handling Meyer Boundary Wavelets Orthogonal Wavelets on the Interval When the Sample Size is Not a Power of Two
7
Other Applications 7.1 Selective Wavelet Reconstruction Wavelet Thresholding Spatial Adaptivity Global Thresholding Estimation of the Noise Level 7.2 More Density Estimation 7.3 Spectral Density Estimation 7.4 Detections ofJumps and Cusps
119 119 124 126 128 131 132 133 140
8
Data Adaptive Wavelet Thresholding 8.1 SURE Thresholding 8.2 Threshold Selection by Hypothesis Testing Recursive Testing Minimizing False Discovery 8.3 Cross-Validation Methods 8.4 Bayesian Methods
143 144 149 151 154 156 161
vii
9
Generalizations and Extensions 9.1 Two-Dimensional Wavelets 9.2 Wavelet Packets Wavelet Packet Functions The Best Basis Algorithm 9.3 Translation Invariant Wavelet Smoothing
167
Appendix
185
References
191
Glossary of Notation
199
Glossary of Terms
201
Index
205
167 173 174 177 180
Preface I once heard the book by Meyer (1993) described as a "vulgarization" of wavelets. While this is true in one sense of the word, that of making a subject popular (Meyer's book is one of the early works written with the nonspecialist in mind), the implication seems to be that such an attempt somehow cheapens or coarsens the subject. I have to disagree that popularity goes hand-in-hand with debasement. While there is certainly a beautiful theory underlying wavelet analysis, there is plenty of beauty left over for the applications of wavelet methods. This book is also written for the non-specialist, and therefore its main thrust is toward wavelet applications. Enough theory is given to help the reader gain a basic understanding of how wavelets work in practice, but much of the theory can be presented using only a basic level of mathematics. Only one theorem is formally stated in this book, with only one proof. And these are only included to introduce some key concepts in a natural way.
Aim and Scope This book was written to become what the reference that I wanted when I began my own study of wavelets. I had books and papers, I studied theorems and proofs, but no single one of these sources by itself answered the specific questions I had: In order to apply wavelets successfully, what do I need to know? And why do I need to know it? It is my hope that this book will answer these questions for others in the same situation. In keeping with the title of this book, I have attempted to pare down the possible number of topics of coverage to just the essentials required for statistical applications and analysis of data. New statistical applications are being developed quickly, so due to the combination of careful choosing of topics and natural delays in writing and printing, this book is necessarily incomplete. It is hoped, however, that the introduction provided in this text will provide a suitable foundation for readers to jump off into other wavelet-related topics. I am of the opinion that basic wavelet methods of smoothing functions, for example, should be as widely understood as standard kernel methods are now. Admittedly, understanding wavelet methods requires a substantial amount of overhead, in terms of time and effort, but the richness of wavelet
x
PREFACE
applications makes such an investment well worth it. This modest work is thus put forward to widen the circle of wavelet literacy. It is important to point out that I am not at all advocating the complete abandonment of all other methods. In a recent article, Fan, et at. (1996) discuss local versions of some standard smoothing techniques and show that they provide a good alternative to wavelet methods, and in fact may be preferred in many applications because of their familiarity. This book was written primarily to increase the familiarity of wavelets in data analysis: wavelets are simply another useful tool in the toolbag of applied statisticians and data analysts. The treatment of topics in this book assumes only that the reader is familiar with calculus and linear algebra, with a basic understanding of elementary statistical theory. With this background, this book is essentially self-contained, with other topics (Fourier analysis, £2 function space, function estimation, etc.) treated when introduced. A brief overview of £2 function space is given as an appendix, along with glossaries of notation and terms. Thus, the material is accessible to a wide audience, including graduate students and advanced undergraduates in mathematics and statistics, as well as those in other disciplines interested in data analysis. Mathematically sophisticated readers can use this reference as quick reading to gain a basic understanding of how wavelets can be used.
Chapter Synopses The Prologue gives a basic overview of the topic of wavelets and describes their most important features in nonmathematicallanguage. Chapter 1 provides a fundamental introduction to what wavelets are, with brief hints as to how they can be used in practice. Though the results of this chapter apply to general orthogonal wavelets, the material is presented primarily in terms of the simplest case of wavelet: the Haar basis. This greatly simplifies the treatment in introducing wavelet features, and once the basic Haar framework is understood, the ideas are readily extended to smoother wavelet bases. Leaving the treatment of wavelets momentarily, Chapter 2 gives a general introduction to fundamental methods of statistical function estimation in such a way that will lead naturally to basic applications of wavelets. This will of course be review material for readers already familiar with kernel and orthogonal series methods; it is included primarily for the non-specialist. Chapter 3 treats the wavelet versions of the smoothing methods described in Chapter 2, applied to density estimation and nonparametric regression. Chapter 4 returns to describing wavelets, continuing the coverage of Chapter 1. It covers more details of the earlier introduction to wavelets, and treats wavelets in more generality, introducing some of the fundamental properties of wavelet methods: algorithms, filtering, wavelet extension of the Fourier transform, and examples of wavelet families. This chapter is not,
Preface
xi
strictly speaking, essential for applying wavelet methods, but it provides the reader with a better understanding of the principles that make wavelets work well in practice. Chapters 6-9 deal with applying wavelet methods to various statistical problems. Chapter 5 describes diagnostic methods essential to a complete data analysis. Chapter 6 discusses the important practical issues that arise in wavelet analysis of real data. Chapter 7 extends and enhances the basic wavelet methods of Chapter 3. Chapter 8 gives an overview of current research in data dependent wavelet threshold selection. Finally, Chapter 9 provides a basic background into wavelet-related methods which are not explicitly treated in earlier chapters. The information in this book could have been arranged in a variety of orders. If it were intended strictly as a reference book, a natural way to order the information might be to place the chapters dealing primarily with the mathematics of wavelets (Chapters 2,5, and 10) at the beginning, followed by the statistical application chapters (Chapters 4, 8, and 9), with the diagnostic chapter last, the smoothing chapter being included as an appendix. Instructors using this book in a classroom might cover the topics roughly in the order given, but with the miscellaneous topics in Chapter 4 distributed strategically within subsequent applications chapters. The current order was carefully selected so as to provide a natural path through wavelet introduction and application to facilitate the reader's first learning of the subject, but with like topics grouped sufficiently close together so that the book will have some value for subsequent reference.
Supplements on the World Wide Web The figures in this book were mostly generated using the commercial S-Plus software package, some using the S-Plus Wavelet Toolkit, and some using the freely available set of S-Plus wavelet subroutines by Guy Nason, available through StatUb (http://lib.stat.emu.edul). To encourage readers' experimentation with wavelet methods and facilitate other applications, I have made available the S-Plus functions for generating most of the pictures in this book over the World Wide Web (this is in lieu of including source code in the text). These will be located both on Birkhauser's web site (http://www.birkhauser.eom/books/isbn/O-8176-3864-4/), and as a link from my personal home page (http://www.stat.se . edu/ ~ ogden I), which will also contain errata and other information regarding this book. As they become available, new routines for wavelet-based analysis will be included on these pages as well. Though I have only used the S-Plus software, there are many other available software packages available, such as WaveLab, an extensive collection of MATLAB-based routines for wavelet analysis which is available free from Stanford's Statistics Department WWW site. Vast amounts of wavelet-related material is available through the ~
xii
PREFACE
including technical reports, a wavelet newsletter, Java applets, lecture notes, and other forms of information. The web pages for this book, which will be updated periodically, will also describe and link relevant information sites.
Acknowledgments This book represents the combination of efforts of many different people, some of whom I will acknowledge here. Thanks are due to Manny Parzen and Charles Chui for their kind words of encouragement at the outset of this project. I gratefully acknowledge Andrew Bruce, Hong-Ye Gao and others at StatSci for making available their S-PLUS Wavelet software. The suggestions and comments by Jon Buckheit, Christian Cenker, Cheng Cheng, and Webster West were invaluable in improving the presentation of the book and correcting numerous errors. I am deeply indebted to each of them. Mike Hilton and Wim Sweldens have the ability to explain difficult concepts in an easily understandable way-my writing of this book has been motivated by their examples in this regard. Carolyn Artin read the entire manuscript and made countless excellent suggestions on grammar and wording. Joe Padgett, John Spurrier, Jim Lynch, and my other colleagues at the University of South Carolina have been immensely supportive and helpful; I thank them as well. Thanks are also due to Wayne Yuhasz and Lauren Lavery at Birkhauser for their support and encouragement of the project. Finally, my deepest thanks go to my family: my wife Christine and daughter Caroline, who stood beside me every word of the way.
PROLOGUE
Why Wavelets? The development of wavelets is fairly recent in applied mathematics, but wavelets have already had a remarkable impact. A lot of people are now applying wavelets to a lot of situations, and all seem to report favorable results. What is it about wavelets that make them so popular? What is it that makes them so useful? This prologue will present an overview in broad strokes (using descriptions and analogies in lieu of mathematical formulas). It is intended to be a brief preview of topics to be covered in more detail in the chapters. It might be useful for the reader to refer back to the prologue from time to time, to prevent the possibility of getting bogged down in mathematical detail to the extent that the big picture is lost. The prologue describes the forest; the trees are the subjects of the chapters. Broadly defined, a wavelet is simply a wavy function carefully constructed so as to have certain mathematical properties. An entire set of wavelets is constructed from a single "mother wavelet" function, and this set provides useful "building block" functions that can be used to describe any in a large class of functions. Several different possibilities for mother wavelet functions have been developed, each with its associated advantages and disadvantages. In applying wavelets, one only has to choose one of the available wavelet families; it is never necessary to construct new wavelets from scratch, so there is little emphasis placed on construction of specific wavelets. Roughly speaking, wavelet analysis is a refinement of Fourier analysis. The Fourier transform is a method of describing an input signal (or function) in terms of its frequency components. Consider a simple musical analogy, following Meyer (1993) and others. Suppose someone were to play a sustained three-note chord on an organ. The Fourier transform of the resulting digitized acoustic signal would be able to pick out the exact frequencies of the three component notes, and the chord could be analyzed by studying the relationships among the frequencies. Suppose the organist plays the same chord for a measure, then abruptly change to a different chord and sustains that for another measure. Here, the classical Fourier analysis becomes confused. It is able to determine the frequencies of all the notes in either chord, but it is unable to distinguish which frequencies belong to the first chord and which are part of the second. Essentially, the frequencies are averaged over the two measures, and the
xiv
WHY WAVELETS?
Fourier reconstruction would sound all frequencies simultaneously, possibly sounding quite dissonant. While usual Fourier methods do a very good job at picking out frequencies from a signal consisting of many frequencies, they are utterly incapable of dealing properly with a signal that is changing over time. This fact has been well-known for years. To increase the applicability of Fourier analysis, various methods such as "windowed Fourier transforms" have been developed to adapt the usual Fourier methods to allow analysis of the frequency content of a signal at each time. While some success has been achieved, these adaptations to the Fourier methods are not completely satisfactory. Windowed transforms can localize simultaneously in time and in frequency, but the amount of localization in each dimension remains fixed. With wavelets, the amount of localization in time and in frequency is automatically adapted, in that only a narrow time-window is needed to examine high-frequency content, but a wide time-window is allowed when investigating low-frequency components. This good time-frequency localization is perhaps the most important advantage that wavelets have over other methods. It might not be immediately clear, however, how this time-frequency localization is helpful in statistics. In statistical function estimation, standard methods (e.g., kernel smoothers or orthogonal series methods) rely upon certain assumptions about the smoothness of the function being estimated. With wavelets, such assumptions are relaxed considerably. wavelets have a built-in "spatial adaptivity" that allows efficient estimation of functions with discontinuities in derivatives, sharp spikes, and discontinuities in the function itself. Thus, wavelet methods are useful in nonparametric regression for a much broader class of functions. Wavelets are intrinsically connected to the notion of "multiresolution analysis." That is, objects (signals, functions, data) can be examined using widely varying levels of focus. As a simple analogy, consider looking at a house. The observation can be made from a great distance, at which the viewer can discern only the basic shape of the structure-the pitch of the roof, whether or not it has an attached garage, etc. As the observer moves closer to the building, various other features of the house come into focus. One can now count the number of windows and see where the doors are located. Moving closer still, even smaller features come into clear view: the house number, the pattern on the curtains. Continuing, it is possible even to examine the pattern of the wood grain on the front door. The basic framework of all these views is essentially the same using wavelets. This capability of multiresolution analysis is known as the "zoom-in, zoom-out" property. Thus, frequency analysis using the Fourier decomposition becomes" scale analysis" using wavelets. This means that it is possible to examine features of the signal (the function, the house) of any size by adjusting a scaling parameter in the analysis. Wavelets are regarded by many as primarily a new subject in pure mathe-
Why Wavelets?
xv
matics. Indeed, many papers published on wavelets contain esoteric-looking theorems with complicated proofs. This type of paper might scare away people who are primarily interested in applications, but the vitality of wavelets lies in their applications and the diversity of these applications. The objective of this book is to introduce wavelets with an eye toward data analysis, giving only the mathematics necessary for a good understanding of how wavelets work and a knowledge of how to apply them. Since no wavelet application exists in complete isolation (in the sense that substantial overlap can be found among virtually all applications), we review here some of the ways wavelets have been applied in various fields and consider how specific advantages of wavelets in these fields can be exploited in statistical analysis as well. Certainly, wavelets have an "interdisciplinary" flavor. Much of the predevelopment of the foundations of what is now known as wavelet analysis was led by Yves Meyer, Jean Morlet, and Alex Grossman in France (a mathematician, a geophysicist, and a theoretical physicist, respectively). With their common interest in time-frequency localization and multiresolution analysis, they built a framework and dubbed their creation ondelette (little wave), which became "wavelet" in English. The subject really caught on with the innovations of Ingrid Daubechies and Stephane Mallat, which had direct applicability to signal processing, and a veritable explosion of activity in wavelet theory and application ensued.
What are Wavelets Used For? Here, we describe three general fields of application in which wavelets have had a substantial impact, then we briefly explore the relationships these fields have with statistical analysis.
1. Signal processing Perhaps the most common application of wavelets (and certainly the impetus behind much of their development) is in signal processing. A signal, broadly defined, is a sequence of numerical measurements, typically obtained electronically. This could be weather readings, a radio broadcast, or measurements from a seismograph. In signalprocessing, the interest lies in analyzing and coding the signal, with the eventual aim of transmitting the encoded signal so that it can be reconstructed with only minimal loss upon receipt. Signals are typically contaminated by random noise, and an important part of signal processing is accounting for this noise. A particular emphasis is on denoising, i.e., extracting the "true" (pure) signal from the noisy version actually observed. This endeavor is precisely the goal in statistical function estimation as well-to "smooth" the noisy data points to obtain an estimate of the underlying function. wavelets have performed admirably in both of these fields. Signal processors now have new, fast tools at their disposal that are
xvi
WHY WAVELETS?
well-suited for denoising signals, not only those with smooth, well-behaved natures, but also those signals with abrupt jumps, sharp spikes, and other irregularities. These advantages of wavelets translate directly over to statistical data analysis. If signal processing is to be done in "real time," i.e., if the signals are treated as they are observed, it is important that fast algorithms are implemented. It doesn't matter how well a particular de-noising technique works if the algorithm is too complex to work in real time. One of the key advantages that wavelets have in signal processing is the associated fast algorithms-faster, even, than the fast Fourier transform.
2. Image analysis Image analysis is actually a special case of signal processing, one that deals with two-dimensional signals representing digital pictures. Again, typically, random noise is included with the observed image, so the primary goal is again denoising. In image processing, the denoising is done with a specific purpose in mind: to transform a noisy image into a "nice-looking" image. Though there might not be widespread agreement as to how to quantify the "niceness" of a reconstructed image, the general aim is to remove as much of the noise as possible, but not at the expense of fine-scale details. Similarly, in statistics, it is important to those seeking analysis of their data that estimated regression functions have a nice appearance (they should be smooth), but sometimes the most important feature of a data set is a sharp peak or abrupt jump. Wavelets help in maintaining real features while smoothing out spurious ones, so as not to "throw out the baby with the bathwater."
3. Data compression Electronic means of data storage are constantly improving. At the same time, with the continued gathering of extensive satellite and medical image data, for example, amounts of data requiring storage are increasing too, placing a constant strain on current storage facilities. The aim in data compression is to transform an enormous data set, saving only the most important elements of the transformed data, so that it can be reconstructed later with only a minimum of loss. As an example, Wickerhauser (1994) reports that the United States Federal Bureau of Investigation (FBI) has collected 30 million sets of fingerprints. For these to be digitally scanned and stored in an easily accessible form would require an enormous amount of space, as each digital fingerprint requires about 0.6 megabytes of storage. Wavelets have proven extremely useful in solving such problems, often requiring less than 30 kilobytes of storage space for an adequate representation of the original data, an impressive compression ratio of 20: 1. How does this relate to problems in statistics? To quote Manny Parzen, "Statistics is like art is like dynamite: The goal is compression." In multiple
Why Wavelets?
xvii
linear regression, for example, it is desired to choose the simplest model that represents the data adequately, to achieve a parsimonious representation. With wavelets, a large data set can often be summarized well with only a relatively small number of wavelet coefficients. To summarize, there are three main answers to the question "Why wavelets?":
1. good time-frequency localization, 2. fast algorithms, 3. simplicity of form.
This chapter has spent some time covering Answer 1 and how it is important in statistics. Answer 2 is perhaps more important in pure signal processing applications, but it is certainly valuable in statistical analysis as well. Some brief comments on Answer 3 are in order here. An entire set of wavelet functions is constructed by means of two simple operations on a single prototype function (referred to earlier as the "mother wavelet"): dilation and translation. The prototype function need never be computed when taking the wavelet transform of data. Just as the Fourier transform describes a function in terms of simple functions (sines and cosines), the wavelet transform describes a function in terms of simple wavelet component functions. The nature of this book is expository. Thus, it consists of an introduction to wavelets and descriptions of various applications in data analysis. For many of the statistical problems treated, more than one methodology is discussed. While some discussion of relative advantages and disadvantages of each competing method is in order, ultimately, the specific application of interest must guide the data analyst to choose the method best suited for his/her situation. In statistics and data analysis, there is certainly room for differences of opinion as to which method is most appropriate for a given application, so the discussion of various methods in this book stops short of making specific recommendations on which method is "best," leaving this entirely to the reader to determine. With the basic introduction of wavelets and their applications in this text, readers will gain the necessary background to continue their study of other applications and more advanced wavelet methods. As increasingly more researchers become interested in wavelet methods, the class of problems to which wavelets have application is rapidly expanding. The References section at the end of this book lists several articles not covered in this book that provide further reading on wavelet methods and applications. There are many good introductory papers on wavelets. Rioul and Vetterli (1991) give a basic introduction focusing on the signal processing uses
xviii
WHY WAVELETS?
of wavelets. Graps (1995) describes wavelets for a general audience, giving some historical background and describing various applications. ]awerth and Sweldens (1994) give a broad overview of practical and mathematical aspects of wavelet analysis. Statistical issues pertaining to the application of wavelets are given in Bock (1992), Bock and Pliego (1992), and Vidakovic and Muller (1994). There have been many books written on the subject of wavelets as well. Some good references are Daubechies (1992), Chui (1992), and Kaiser (1994) -these are all at a higher mathematical level than this book. The book by Strang and Nguyen (1996) provides an excellent introduction to wavelets from an engineering/signal processing point of view. Echoing the assertion of Graps (1995), most of the work in developing the mathematical foundations of wavelets has been completed. It remains for us to study their applications in various areas. We now embark upon an exploration of wavelet uses in statistics and data analysis.
CHAPTER
ONE
Wavelets: A Brief Introduction
This chapter gives an introductory treatment of the basic ideas concerning wavelets. The wavelet decomposition of functions is related to the analogous Fourier decomposition, and the wavelet representation is presented first in terms of its simplest paradigm, the Haar basis. This piecewise constant Haar system is used to describe the concepts of the multiresolution analysis, and these ideas are generalized to other types of wavelet bases. This treatment is meant to be merely an introduction to the relevant concepts of wavelet analysis. As such, this chapter provides most of the background for the rest of this book. It is important to stress that this book covers only the essential elements of wavelet analysis. Here, we assume knowledge of only elementary linear algebra and calculus, along with a basic understanding of statistical theory. More advanced topics will be introduced as they are encountered.
1.1
The Discrete Fourier Transform
Transformation of a function into its wavelet components has much in common with transforming a function into its Fourier components. Thus, an introduction to wavelets begins with a discussion of the usual discrete Fourier transform. This discussion is not by any means intended to be a complete treatment of Fourier analysis, but merely an overview of the subject to highlight the concepts that will be important in the development of wavelet analysis. While studying heat conduction near the beginning of the nineteenth century, the French mathematician and physicist Jean-Baptiste Fourier discovered that he could decompose any of a large class of functions into component functions constructed of only standard periodic trigonometric func-
2
THE DISCRETE FOURIER TRANSFORM
tions. Here, we will only consider functions defined on the interval [-'IT, 'IT]. (If a particular function of interest 9 is defined instead on a different finite interval [a, b], then it can be transformed via f(x) = g(2'ITxj(b - a) - (a + b) 'IT j (b - a)).) The sine and cosine functions are defined on all of JR and have period 2 'IT, so the Fourier decomposition can be thought of either as representing all such periodic functions, or as representing functions defined only on [-'IT, 'IT] by simply restricting attention to only this interval. Here, we will take the latter approach. The Fourier representation applies to square-integrable functions. Specifically, we say that a function f belongs to the square-integrable function space £2[a, b] if
Fourier's result states that any function f E £2[ -'IT, 'IT] can be expressed as an infinite sum of dilated cosine and sine functions: 00
f(x) =
~ao + ~(aj cos(jx) + bj sin(jx)),
(1.1)
J=I
for an appropriately computed set of coefficients {ao, aI, bl , ...}. A word of caution is in order about the representation (1.1). The equality is only meant in the £2 sense, i.e.,
L
[f(X) - (
It is possible that
~ao + ~ aj cos(jx) + j sin(jX)) ] b
2
dx =
o.
f and its Fourier representation differ on a few points (and
this is, in fact, the case at discontinuity points). Since this book is concerned primarily with analyzing functions in £2 space, this point will usually be neglected hereafter in similar representations. It is important to keep in mind, however, that such an expression does not imply pointwise convergence. The summation in (1.1) is up to infinity, but a function can be well-approximated (in the £2 sense) by a finite sum with upper summation limit index J: J
SJ(x)
= ~ao + 2.::(aj cos(jx) + bj sin(jx)). j=1
(1.2)
Wavelets: A BriefIntroduction
q
q
10
10
Eo c:
Eo
a
"iii
3
a
a
Ul
0
u
0
"
";""L...,---r-----r--,....--.----,...----r-'
-3
-2
-1
0
2
3
-3
q.-----_--:..._-------,
-2
-1
0
2
3
q.-----_--:..._-------,
10
a
xC\J0
xC\JO
cO
lila o
"iii
u
";""'-r---r-----r--,....--.--...----r-'
-3
-2
-1
0
2
3
-3
-2
-1
0
2
3
qr------:...--------,
qr-------'--------,
10
10
a
o
X
~O cO
(")0
lila o
"iii
u
";""'-r---r-----r--,....--.----,...----r-'
-3
-2
-1
0
2
3
-3
-2
-1
0
2
3
Figure 1.1: The first three sets of basis functions for the discrete Fourier transform This Fourier series representation is extremely useful in that any L 2 function can be written in terms of very simple building block functions: sines and cosines. This is due to the fact that the set of functions {sin(j·), cos(j·), j = 1, 2, ... }, together with the constant function, form a basis for the function space L 2 [-7r, 7r]. We now examine the appearance of some ofthese basis functions and how they combine to reconstruct an arbitrary L2 function. Figure 1.1 plots the first three pairs of Fourier basis elements (not counting the constant function): sine and cosine functions dilated by j for j = 1, 2, 3. Increasing the dilation index j has the effect of increasing the function's frequency (and thus decreasing its period). Next, we examine the finite-sum Fourier representation of a simple example function, as this wi11lead into the discussion of wavelets in the next sec-
4
THE DISCRETE FOURIER TRANSFORM
Example function
Reconstruction with J=1
It! q
It)
0 o
o'--r-~--.----r----r--.----r-'
-3
-2
-1
0
2
0
0
3
-3
Reconstruction with J=2
-2
-1
0
2
3
Reconstruction with J=3
It!
It!
q
q ,..
It)
It)
ci
ci
0
0
0
0 -3
-2
-1
0
2
3
-3
-2
-1
0
2
3
Figure 1.2: An example function and its Fourier sum representations
tion. The truncated Fourier series representations (1.2) for J = 1, 2, and 3 are displayed in Figure 1.2 for the piecewise linear function X
f(x)
+ 7T, -7T < X < -7T/2
= { 7T/2,
-7T/2 < x 1, the Fourier coefficients can be computed by taking the inner product of the function f and the corresponding basis functions:
a·J
1 -(f, cosU·))
7T
= -1 17r 7T
-7r
f(x)cos(jx) dx, j
= 0, 1, ... ,
(1.4)
Wavelets: A BriefIntroduction
b·J
~(I, sin(j·)) = ~ 171"
7r -71"
7r
5
I(x)sin(jx) dx, j = 1,2,.. ..
(1.5)
The coefficients aj and bj are said to measure the "frequency content" of the function 1 at the level of resolution j. Examining the set of Fourier coefficients can aid in understanding the nature of the corresponding function. The coefficients in (1.4) and (1.5) are given in terms of the £2 innerproduct of two functions:
J
(I, g) =
I(x)g(x) dx,
where the integral is taken over the appropriate subset of JR. The £2 norm of a function is defined to be
11/11
=
vU,f} =
VJ
j2(x)dx.
Let us return to our earlier example and look at some of the coefficients, which are given in Table 1.1. First, note that all the bj's (corresponding to the sine basis functions) are zero. The reason for this is that the example function is an even function, so the inner product of 1 with each of the odd sine functions is zero. From inspection of Table 1.1, we note that the even-index cosine coefficients are also zero (for j > 4) and that odd-index coefficients are given by aj = 2/(j2 7r), with coefficients aj becoming small quickly as j gets large. This indicates that most of the frequency content of this example function is concentrated at low frequencies, which can be see in the reconstructions in Figure 1.2. The only relatively large coefficients are ao, aI, a2, and a3, so the third reconstruction (J = 3) does a very good job at piecing 1 back together. By increasing J further, the approximation will only improve (in the £2 sense), but the amount of the improvement will be smaller.
Table 1.1: Fourier coefficients for the example function. J
a·J
0
37r/4 2/7r -1/7r 2/(97r)
1 2 3 4
0
b·J
-
0 0 0 0
J 5 6 7 8 9
a·J
2/(257r) 0
2/(497r) 0
2/(817r)
b·J 0 0 0 0 0
6
THE DISCRETE FOURIER TRANSFORM
The representation (1.1) holds uniformly for all x E [-71",71"] under certain restrictions on I (for instance, if I has one continuous derivative, I (71") = I( -71"), and l' (71") = 1'( -7I")-see, e.g., Dym and McKean (1972)). The example function in Figure 1.2 has discontinuities in its derivative, but the Fourier representation will converge at all other points. For any £2 [-71" , 71"] function, the truncated representation (1.2) converges in the £2 sense:
as J -t 00. In practical terms, this means that many functions can be described using only a handful of coefficients. The extension of this to wavelets will become clear in the following section. Though not mentioned previously, the Fourier basis has an important property: It is an orthogonal basis. Definition 1.1 Twofunctions II, Jz E £2[a, b] are said to be orthogonal (It, Jz) = 0.
if
The orthogonality of the Fourier basis can be seen through orthogonality properties inherent in the sine and cosine functions:
(sin(m·),sin(n.)) =
(cos(m·),cos(n·)) =
(sin(m·),cos(n·))
L:
sinmxsinnxdx = {
i: i:
cosmxcosnxdx = {
=
~:
~:
271" ,
sinmxcosnxdx
m i:- n, m = n > 0,
m i:- n, m = n > 0, m n 0,
= =
= Oforallm,n > 0.
The three expressions can be verified easily by applying the standard trigonometric identities for sin a sin f3, cos a cos f3, and sin a sin f3. A minor modification of the sine and cosine functions will yield an orthonormal basis with another important property. Definition 1.2 A sequence offunctions {Ii} is said to be orthonormal the Ii's are pairwise orthogonal and II Ii II = 1 for all j.
if
Wavelets: A BriefIntroduction
7
The orthogonality requirement is already satisfied with the sine and cosine functions. Defininggj(x) = 1r- 1 / 2 sin(jx) for j = 1,2, ... and hj(x) = 2 1r-1/ cos(jx) for j = 1,2, ... with the constant function ho(x) = I/vz:;r on x E [-1r, 1r] makes the set of functions {h o, gl , hI, ... } orthonormal as well. Normalizing the basis in this manner allows us to write the Fourier representation (1.1) along with the expressions for computing the coefficients (1.4) and (1.5) as 00
f(x) = (f,ho)ho(x)
+
L ((f,gj)gj(x) + (f,hj)hj(x)). j=1
Definition 1.3 A sequence offunction {!J} is said to be a complete orthonormal system (CONS) if the !J's are pairwise orthogonal, II!J II = 1 for each j, and the only function orthogonal to each !J is the zero function. Thus defined, the set {h o, gj, h j : j = I,2, ...} is a complete orthonormal system for £2[ -1r, 1r]. The Fourier basis is not the only CONS for intervals. Others include Legendre polynomials and wavelets, the latter to be studied in detail.
1.2
The Haar System
The extension from Fourier analysis to wavelet analysis will be made via the Haar basis. The Haar function is a bona fide wavelet, though it is not used much in current practice. The primary reason for this will become apparent. Nevertheless, the Haar basis is an excellent place to begin a discussion of wavelets. This section will begin with a definition of the Haar wavelet and go on to derive the Haar scaling function. Following this development, we will begin with the Haar scaling function and then rederive the Haar wavelet. Of course, terms like "wavelet" and "scaling function" have not yet been defined. Their meaning will become clear as we progress through a discussion of issues associated with wavelets. The Haar wavelet system provides a paradigm for all wavelets, so it is important to keep in mind that the simple developments in this chapter have much broader application: All the principles discussed in this chapter pertaining to the Haar wavelet hold generally for all orthogonal wavelets. The Haar wavelet is nothing new, having been developed in 1910 (Haar, 1910), long before anyone began speaking of "wavelets." The Haarfunction, given by
'IjJ(x) = {
1,
o<X
-1,
1<x
• • • •• •
CD
CD
-:t
-:t
•
C\I
•
0
~
••
0 .....
...
ex:>
•
C\I
0
• 0.0
• 0.2
0.4
0.6
•• •
~
0.8
1.0
0.0
0.2
0.4
.....
0.6
J=5
..... ex:>
CD
CD
-:t
-:t
C\I
C\I
0
0
•
•
• 0.2
1.0
0.8
0
ex:>
0.0
•
• • • •• •• •
•
•
J=4
••
0
~
57
0.4
0.6
0.8
1.0
• •• ••
~
0.0
0.2
0.4
0.6
0.8
. 1.0
Figure 3.4: Wavelet estimates of a simulated data set for varying values of J
It is not apparent that the expression (3.13) depends explicitly on the choice of a bandwidth in the same way that other kernel estimators do. The user still has control over the resulting smoothness of the estimator, but, as was seen in the kernel representation of the Fourier series, this control comes in the form of choosing the level J. As in other orthogonal series estimators, increasing J amounts to decreasing the amount of smoothing, so In the same way, the kernel EJ(u, v) becomes narrower for larger J, affecting the estimate the same way as using a smaller bandwidth in a "standard" kernel estimator. The wavelet estimator (3.13) has distinct advantages over classical nonparametric regression techniques. One of these is that the asymptotic rates of convergence hold for weaker conditions on the underlying function than must be assumed in obtaining similar results for other types of smoothing. Figure 3.4 displays a simulated data set, and the resulting wavelet estimator for four choices of J. The mean function is the same as that used in the examples in Figure 2.6: a linear trend upward, a linear trend downward, and a flat portion. Like standard kernel and orthogonal series methods, this wavelet estimator tends to undershoot the peak for small values of J (corresponding to large bandwidth). The development of this estimator made it clear that it is at the same time both an orthogonal series and a kernel estimator. Viewed as a series esti-
58
NONPARAMETRIC REGRESSION
mator, this method captures the essence of the !llultiresolution analysis in a function estimation framework. The estimator f J represents the projection of the function f- onto the approximating space VJ as defined in Chapter 1. Analogous to nonparametric regression with orthogonal series, increasing the smoothing parameter J allows additional detail in the estimated reconstruction (at the expense of adding greater variability of the resulting estimator). Though the simple situation considered in this chapter requires that the design points be fixed and equally spaced, analogous estimators can be constructed under more general conditions. Estimators similar to (3.13) with non-equally spaced Xi'S and random design points are considered in the paper by Antoniadis, Gregoire, and McKeague (1994).
CHAPTER
FOUR
Wavelet Features and Examples
Chapter 1 presented the bare necessities for understanding the basic principles of wavelet analysis, presenting the concepts through the simplest example, the Haar system. With a good understanding of these principles, it is possible to skip forward to later chapters dealing with statistical analysis, but before treating more advanced statistical applications of wavelet analysis, a more thorough and general treatment of wavelets is useful. This chapter will give more insight into some of the advantages inherent in wavelet analysis, describing basic algorithms, and time-frequency localization concepts. It finishes up with a more complete development of the wavelet examples mentioned in Section 1.3. It should be emphasized here that while perhaps none of the topics in this chapter are essential to applying wavelets in data analysis, a good working knowledge of the relevant concepts will greatly aid in appreciation and understanding. Section 4.4 describes a few example wavelet bases, including those in the more general class of biorthogonal wavelets. The first three sections of this chapter (and most of the rest of this book) are concerned only with orthogonal wavelets.
4.1
Wavelet Decomposition and Reconstruction
In Section 1.2, it was seen that both the Haar scaling function ¢J(x) and the Haar wavelet 'l/J (x) can be written in terms of Haar scaling functions at level 1: ¢JI,O(X) and ¢JI,I. (See equations (1.12) and (1.13).) This provided a simple example of the two-scale relationships of wavelets. Later in the same section, it was shown that, at least in the Haar case, a decomposition algorithm existed (equations (1.16) and (1.18)) that allow us to express the wavelet and scaling function coefficients at any level of resolution in terms of the scaling function coefficients at the next higher level. At that time, it was hinted that the two concepts (two-scale relationships and decomposition algorithms) were
60
WAVELET DECOMPOSITION AND RECONSTRUCTION
related, and that they existed in some form for all sets of wavelets. This section, tracing some of the work in Mallat (1989a), will develop these ideas further and with more generality.
Two-Scale Relationships We begin with an MRA (see properties (1)-(5) in Section 1.2) consisting of spaces {Vj, j E 71:} with each Vj having orthonormal basis {¢j,k' k E 71:} where, as before, ¢j,dx) = 2j / 2 ¢(2 j x - k). From this, we will present an expression for a wavelet function 'IjJ, define W j spaces based on 'IjJ, and show that this leads to a CONS for L 2 (JR). Note that ¢ E V() and therefore also ¢ E VI since Vo C VI. Since { ¢I ,k, k E 71:} is an orthonormal basis for VI, there exists a sequence {h k } such that ¢(x)
= ~ hk¢I,k(X)
(4.1)
kEZ
and that the sequence elements may be written (4.2) This sequence {h k } is a square-summable sequence: We say {h k } E £2 (71:) if :LkEZ h~ < 00. The two-scale relationship (4.1), relating functions with differing scaling factors, is also known as the dilation equation or the refinement equation. For the Haar basis, it was seen in (1.12) that this sequence is
hk
_ { )Z,
-
0,
k = 0, 1 otherwise.
(4.3)
In this multiresolution context, this same sequence that relates scaling functions at two levels of h k 's can be used to define the mother wavelet: 'IjJ(x) = ~(-l)kh_k+I¢I,k(X).
(4.4)
kEZ
A special case of this construction was seen in (1.13) for the Haar wavelet. The reason for such a construction is to ensure that the scaling function and wavelet will be orthogonal:
Wavelet Features and Examples
('ljJ, ¢)
=
J
J(~(
'ljJ(X )¢(X) dx
61
-l)kh_k+l¢l,k(X)) ¢(X) dx
2:(-l)kh_k+l k
J
¢1,k(X)¢(X) dx
~(-l)kh_k+lhk k
0.
The last step follows since the summand for k is the opposite of the summand for 1 - k, so each term is negated, convergence holding since {hk} E £2(72:). It can be seen similarly that each integer translation of the mother wavelet 'ljJ is also orthogonal to ¢:
J J
'ljJ(x - k)¢(x) dx 2:(-l)fh_f+l¢l,f(X - k)¢(x) dx fEZ
2:(-1)fh_ f + 1 fEZ
J
¢1,2k+f(X)¢(x)dx
2:( -1)fh_f+l h2k+f fEZ 0,
the last step following because the summands for £ and for 1 - £ - 2k cancel, and convergence holds because of the square summability of the sequence {h k }. A straightforward extension of this argument will show that 'ljJo,k .1 ¢O,f for all k, £ E 72: and, further, that'ljJj,k .1 ¢j,f for all j, k, £ E 72:. Thus, if we define the space W o to be the span of the set of wavelets {'ljJo,k' k E 72:}, then it is clear that Vo .1 W o, and it follows readily that Yj .1 W j for all j E 72:. Now, to show that V() .1 WI, we must first show that, for each k, £ E 72:, ¢O,k .1 'ljJl,f. This is straightforward given what we know already: The result follows by using (4.1) to express ¢O,k in terms of ¢l,m'S and applying earlier results. This argument can be extended recursively to show that Vi .1 W j + m for all m > O,j E 72:. From this, it can be seen that
'ljJj,k .1 'ljJj' ,k' for all j, j', k, k' E 72:, j =I- j', k =I- k' (write either wavelet according to (4.4), and apply known results), so that the wavelet spaces {Wj, j E 72:} are mutually orthogonal as well. Thus, from the MRA structure of the Yj 's and from the results derived in this section, we have
62
WAVELET DECOMPOSITION AND RECONSTRUCTION
shown that the set of all wavelets {'ljJj,k, j, k E 7Z} is a complete orthonormal system for L 2 (IR).
The Decomposition Algorithm In Section 1.2, the two-scale relationships (1.12) and (1.13) were converted to decomposition algorithms (1.16) and (1.18). This is now accomplished in more generality. As before, let {Cj,k, j,k E 7Z} and {dj,k, j,k E 7Z} represent the scaling function and wavelet coefficients of a function f respectively. These can be computed as Cj,k
=
J
(4.5)
dj,k
=
J
(4.6)
f(X)¢j,k(X) dx
and f(x)'ljJj,k(x) dx
as before. Section 6.2 will discuss computation of scaling function and scaling function coefficients that don't involve integration, but regarding these coefficients conceptually as being computed according to (4.5) and (4.6) will help in their interpretation. Given the two-scale relationship (4.1), an expression for any scaling function in terms of higher-level scaling functions can be derived: Since ¢j,k(x) = 2 j / 2¢(2 j x - k),
2: hf 2j / 2¢1,f(2 j x 2: h 2(J+l)/2¢(2j +
k)
fEZ
f
1X -
2k - f)
fEZ
2: hf ¢j+l,f+2k(X) 2: hf-2k ¢j+l,f(X).
fEZ
(4.7)
fEZ
Substituting this result into the definitional formula (4.5) for Cj,k, then interchanging the sum and the integral, gives the general decomposition algorithm for scaling function coefficients: Cj,k
=L
hf - 2k Cj+l,f'
(4.8)
f
Similarly, a two-scale relationship relating any'IjJj,k to the ¢j+l,e'S can be
Wavelet Features and Examples
+--
CJ-M+I,.
+--
+--
CJ-2,.
+--
63
CJ-I,.
+--
C J,.
Figure 4.1: Schematic representation of the decomposition algorithm
derived using (4.4), which leads to the wavelet coefficient portion of the decomposition algorithm: dj,k
= :~::::) _1)f h-f+2k+I
Cj+I,f.
(4.9)
fEZ
Thus, given scaling coefficients at any level J, all lower-level scaling function coefficients for j < J can be computed recursively using (4.8), and all lower-level wavelet coefficients (j < J) can be computed from the scaling function coefficients using (4.9). Defining Cj,. and d j ,. to represent the sets of scaling function and wavelet coefficients at level j respectively, this decomposition algorithm is represented schematically in Figure 4.1. The arrows represent the decomposition computations: CJ-2,. and dJ-2,. can be computed using only the coefficients CJ-I,., for instance. The algorithms given in (4.8) and (4.8) share an interesting feature. Note that in either equation, if the dilation index k is increased by one, the indices of the {h f } sequence are all offset by two. Thus, in computing either decomposition, there is an inherent down-sampling of coefficients. Roughly speaking, this means that if there are only finitely many non-zero elements in the { hf} sequence, then applying the decomposition algorithm to a set of nonzero scaling function coefficients at level j + 1 will yield only half as many nonzero scaling function coefficients at level j. Similarly, there will only be half as many non-zero wavelet coefficients at level j. Computing the decomposition algorithm recursively yields fewer coefficients at each level. This structure led Mallat (1989a) to refer to the decomposition of coefficients as the "pyramid algorithm"; Daubechies (1992) terms it the "cascade algorithm." This downsampling is the key to the fast wavelet algorithms, which will be discussed thoroughly in Chapter 6.
The Reconstruction Algorithm Though perhaps not as apparent, it is also possible to move back up the ladder, starting with low-level coefficients and computing higher-level coefficients. This is known as the reconstruction algorithm, which is derived here for general orthogonal wavelet bases.
64
WAVELET DECOMPOSITION AND RECONSTRUCTION
Again, start with an MRA with {cPj,k, k E 71:} and {'l/Jj,k, k E 71:} forming orthonormal bases for Vi and W j , respectively, and the W j spaces representing the orthogonal detail spaces as in Section 1.2. Since cPI,O E VI and since VI = Vo EB W o, we know that cPI,O can be written as a linear combination of the cPO,k 's (basis for V(») and the 'l/Jo,k 's (basis for W o). The same can be said for cPI, I· As before, the coefficients of the linear combinations will be computed by taking inner products. To allow a later unification of treatment, we adopt an unusual numbering scheme, and we define, for k E 71:,
Thus,
cPI,O(X) =
L
(azkcPo,k(X)
+ bzk'I/Jo,k(X))
(4.10)
+ bzk-I'l/Jo,k(X)) .
(4.11)
kEZ
and
cPI,I(X) =
L
(aZk-IcPO,k(X)
kEZ
Then using (4.10) and (4.11), we can write a similar expression for any cPI,k. We derive first the formula for even k: k
cPI,O(X - 2)
L aUcPO,f(X - 2)k + bU'I/Jo,f(X - 2)k L aZfcPo,t+f(X) + bZf'I/Jo,t+f(X) L aZf-kcPO,f(X) + bU-k'I/Jo,f(X).
fEZ
fEZ
(4.12)
fEZ
Working through a formula for odd k gives precisely the same formula, so (4.12) holds for all k E 71:. For odd (even) k, only the odd-indexed (evenindexed) elements of the sequences {ad and {bd are accessed.
Wavelet Features and Examples
CJ,.
-+
-+
CJ+1,·
-+
CJ+M-1,.
65
-+
CJ+M,.
Figure 4.2: Schematic representation of the reconstruction algorithm
Following similar arguments, an expression relating each scaling function , and the second is generated by a "dual" scaling function 4>. As in the orthogonal case, each of these sequences of approximation spaces has a sequence of successive detail spaces: (Wj )jEZ and (W)jEZ,
Wavelet Features and Examples
85
respectively. Furthermor defined in (1.10). Dilates and translates of these two functions also fit nicely into the unit interval: Combining supports of 'ljJ1,O and 'ljJ1,1 also make up the interval [0, 1); four wavelets at level 2 are required, and so on. Thus, the set of wavelets that form a CONS for the interval [0,1) is {'ljJj,k : k
= 0, ... ,2 j
- 1, j
= 0,1, ... }.
Taking the discrete Haar transform on a set of n = 2 J (where J is a positive integer) data values gives wavelet coefficients at levels 0, 1, ... , J - 1, with 2j
Some Practical Issues
111
wavelet coefficients at levelj. This gives a total of 1+2+4+ ...+2 J - 1 = n-1 wavelet coefficients. Including the lowest level scaling function coefficient co,o along with the wavelet coefficients gives an orthogonal transformation of the original data. Note that the coefficient Co,o represents a "final smoothing" of the data - in the Haar case, this is just the sample mean. Adapting other wavelets to the interval is a little more difficult. Certainly, if a function f has domain [0, 1], techniques discussed previously could be applied directly if the domain of f were extended to include all of JR, with the expanded version of f defined to be zero outside of [0, 1J. Since this can introduce artificial discontinuities at the endpoints of the interval, this approach is not entirely satisfactory. Various other techniques have been proposed, and these will each be explored briefly. Some good discussions of this adaptation are given in Unser (1996), ]awerth and Sweldens (1994), and Chapter 10 of Daubechies (1992), among other sources. The discussion here begins with two very straightforward (and quite useful) solutions: periodic and symmetric boundary handling.
Periodic Boundary Handling A very useful approach to adapting wavelet analysis to the unit interval is accomplished through periodic boundary handling. The function f defined on [0, 1] could be expanded to live on the real line by regarding it as a periodic function with period one: f(x) = f(x - [x]) for x E JR. As might be expected, this approach introduces a periodicity in the wavelet coefficients as well. Computing a coefficient dj,k with j > is accomplished via
I: I:
°
f(x) ,pi,k(X) dx f(x) 2i !2 ,p(2i x - k) dx.
Applying a change of variables and the fact that f (x) = f (x - 1) for all x E JR gives further that
I: I:
f(x - 1) 2i !2 ,p(2i x - k - 2i ) dx f(x) ,pi,k+2j (x) dx
d j ,k+2i
.
Thus, there are only 2j unique wavelet coefficients at any level j > 0, so the set of wavelet coefficients for j > can be indexed the same as the Haar
°
112
WAVELETS ON AN INTERVAL
basis on the unit interval. Restricting j to be nonnegative provides only for the detail at higher resolution than the approximation space Vo. By forcing the periodicity of f and considering the resulting multiresolution analysis on [0,1]' we are actually adapting the scaling functions and wavelets to live on the unit interval as well. In essence, these functions are "wrapped around" as well, so that the portions of the functions on intervals [j, j + 1) are all combined together. Thus, the appearance of these functions will be altered considerably, but the ideas of the multiresolution analysis are the same. This style of boundary handling is somewhat problematic in that, unless the function is truly periodic, it introduces artificial singularities by pasting the function together at the interval's endpoints. This can result in several large coefficients for wavelets centered near the boundaries that have no real interpretation in terms of the function f. Applying this transform to computing the discrete wavelet transform from data Y1 , , Y n involves "wrapping around" the original sequence (for i = n + 1, ,2n, define Yi = Yi-n with a similar treatment for negative i) and applying the usual filters to the expanded data. The decomposition relation can be applied in the same way, keeping in mind the imposed periodicity of the coefficients. This method has been used a great deal in statistical application. This is true partly because the implementation is very straightforward, and partly because the resulting empirical wavelet coefficients are independent with identical variances, whenever an orthogonal wavelet family is used and the noise is Gaussian (see Chapter 7 for more details).
Symmetric and Antisymmetric Boundary Handling Another technique, used extensively in kernel function estimation, involves reflecting the function of interest about the boundaries. A symmetric reflection would require extending the domain of the function beyond [0, 1] and defining f(x) = f( -x) for x E [-1,0) and f(x) = f(2 - x) for x E (1,2]. This has an advantage over periodic boundary handling in that it preserves the continuity of the function, though discontinuities in the derivative of f may be introduced. Another way in which the function can be reflected about the endpoints is antisymmetrically: f(x) = 2f(0) - f(-x) for x E [-1,0) and f(x) = 2f(l) - f(2 - x) for x E (1,2]. This is able to preserve continuity in both the function and its first derivative (and all odd derivatives). As with periodic boundary handling, these methods impose their own alterations of the usual multiresolution analysis. Applying the reflected boundary handling method to the discrete wavelet transform of data would require reflecting the data in the appropriate way, and then applying the decomposition filter directly. This method will result in a few more than 2 j wavelet coefficients at level j, since a few extra co-
Some Practical Issues
113
efficients (the number depending on the length of the filter) are kept at the ends of the usual set of 2j coefficients. Implementing this procedure is a little more involved than implementing the periodic boundary handling scheme. Having more than 2j wavelet coefficients at each level gives more wavelet coefficients than data points, so some dependencies are introduced. Unser (1996) discusses some of the practical problems that arise when implementing these boundary conditions. Care must be given in coding the reconstruction algorithm to ensure that the original data can be recovered exactly. These three methods are quite useful and produce the desired effect: coercing the multiresolution analysis built for L 2 (IR) to live on the unit interval. It is possible, however, to construct wavelets specifically for L 2 [0, 1]. Several approaches to this are briefly presented here.
Meyer Boundary Wavelets As described by Meyer (1992), usual wavelet functions can be adapted so as to have their natural domain on an interval, thereby giving an orthonormal basis for L 2 [0, 1]. This development is outlined here without going into much detail. As an example, let us consider applying this adaptation to the Daubechies orthogonal wavelets with compact support. Recall that as the dilation index 1 gets larger, the supports of the corresponding wavelets and scaling functions diminish, so that for large enough 1, each function ¢>j,k will have support including only one of the endpoints and 1. Thus for sufficiently large 1o, it is possible to label each of the scaling functions ¢>jo,k as being a "O-intersecting," an "interior" (meaning that the support of the function is entirely contained within [0, 1]), or a "I-intersecting" scaling function. If we consider only the restriction of these scaling functions on [0, 1], they will form a basis for a subspace Yjo C L 2 [0, 1]. If we consider making a similar restriction for 1 > 1o, we can similarly create another approximation space Yj. These spaces will naturally inherit the multiresolution structure generated on the real line by the original scaling functions, so that an analogous form of Definition 1.4 holds for L 2 [0, 1], with the added restriction that 1 > 1o. Thus a basis for the spaces Yj can be formed simply by restricting the support of the original scaling functions, but some work remains to make this basis orthogonal. Note that for any 1 > 1o, the interior scaling functions are still mutually orthogonal, and, in fact, each interior scaling function is orthogonal to each O-intersecting and each I-intersecting scaling function. To see this latter fact, note that for any interior scaling function ¢>j,k and any O-intersecting or I-intersecting scaling function (¢>j,k, ¢>j,k = ¢>j,k " the L 2[0, 1] inner product between the two is
°
l )
114
WAVELETS ON AN INTERVAL
since the original version of the interior wavelet vanishes outside of [0, 1]. By a similar argument, it can be shown that for j > jo, each Q-intersecting scaling function is orthogonal to each I-intersecting scaling function. Thus, it remains only to orthogonalize separately the set of O-intersecting and 1intersecting scaling functions, and this can be accomplished using the GramSchmidt procedure for functions. Detail spaces and wavelets are added to the construction by noting that a set of similarly restricted versions of the original (defined on IR) wavelets will spaa a space W j C L 2 [0, 1], which satisfies
There are more wavelet functions that have support overlapping [0,1] than the dimension of Wj , however, so that these restricted wavelets do not form a basis. Further, these wavelets are not all mutually orthogonal, and they are not all orthogonal to the basis functions of Vi. It is necessary then to work out the linear dependence among the restricted wavelets to reduce the number of functions to obtain a basis, then to apply the Gram-Schmidt procedure again to orthogonalize the set of O-intersecting and I-intersecting wavelets and to force them to be orthogonal to each function in Vi. This useful construction suffers from a problem of numerical instability inherent in the orthogonalization procedure, which makes it difficult to implement. Other problems with this procedure are pointed out by Cohen, Daubechies, and Vial (1993), as they introduce an alternative solution to the construction of orthogonal wavelet bases on the interval.
Orthogonal Wavelets on the Interval Another approach to adapting wavelets to live on the interval [0, 1] was proposed independently by Anderson, et al. (1993) and Cohen, Daubechies, and Vial (1993). A short description of the method was given in Cohen, et al. (1993). This section will describe only the main idea of this construction, with the interested reader referred to any of these articles for more details on the construction. Using the Daubechies family of compactly supported wavelets with filter length 2N (see Section 4.4); this scheme uses the fact that polynomials of degree up to N - 1 can be expressed only in terms of the appropriately translated scaling functions. Certainly, then, the restriction of all polynomials of degree N - 1 will be in Vi C L 2 [0, 1] for any j. As in Meyer's construction, we begin by choosing some integer jo large enough so that no scaling functions ¢>j,k
Some Practical Issues
115
have support overlapping both endpoints and construct nested approximation spaces Yjo C Yjo+l C ... C £2[0,1]. The "interior" scaling functions will be kept intact, but the boundary scaling functions are altered in a manner quite different from that used in the previous subsection. For any j > jo, an edge function can be created for any polynomial of degree I! by writing the monomial xl in terms of its scaling function representation:
Xl
= L(xl,¢>j,k)¢>j,k(X). k
A "left-edge" function ¢>l(x) is given by the above expression restricted to [0, 1] except that the sum is only over the k's that correspond to O-intersecting scaling functions. The set of left-edge functions for < I! < N - 1 is already mutually orthogonal. (The same is true of the set of "right-edge" functions, created in a similar manner.) Each of these sets is also orthogonal to the set of interior scaling functions, so it is only necessary to orthogonalize the functions in each edge set using Gram-Schmidt. Thus constructed, the union of these sets forms an orthonormal basis for Yj, and the construction is numerically stable. The W j detail space is still defined as the difference in successive approximation spaces, and a basis for it is made up of the corresponding "interior wavelets," along with a few edge wavelets that are constructed similarly to the edge scaling functions. A nice feature of this construction (not shared by Meyer's method) is that there are 2j basis functions for each Yj and also 2j for each Wj, just as in the simple periodic MRA of described earlier in this section. Implementing the discrete wavelet transform using either the method of this section or that of Meyer is considerably more involved than implementing the transform resulting from imposing the simpler periodic or symmetric conditions. The filters are different from level to level, so rather than just using a handful of coefficients, substantial tables of edge coefficients must be stored. Examples of such tables are given in Cohen, Daubechies, and Vial (1993). Still, in many applications, having an orthogonal wavelet transform with no unpleasant edge effects is well worth the additional effort in implementation.
°
6.4
When the Sample Size is Not a Power of Two
In our discussion of adjusting the wavelets to live on a finite interval, we have been assuming that the sample size of the data set is a power of two, i.e., n = 2 J for some positive integer J. This is not a particularly restrictive assumption in fields such as signal processing and image analysis, as such sample sizes are often due to the natural sampling rate. In statistics, however,
116
WHEN THE SAMPLE SIZE IS NOT A POWER OF TWO
we often have no control over the size of the sample, thus it is only in relatively rare situations that n is a power of two. Adapting the usual discrete wavelet decomposition and reconstruction algorithms with their inherent 2j structure to arbitrary-length data sets is not trivial. In considering taking the discrete wavelet transform of an ordered data set Y1 , .•. , Y n , there are a number of properties we would want the transform to have. Among the most important might be the following: 1. ease of implementation, 2. orthogonality, 3. adaptation to a finite interval. The orthogonality of the transform is perhaps more important in statistical applications than in other fields. Orthogonality ensures mutual independence of empirical coefficients (scaling function and wavelet) when the original data are independent with Gaussian errors. (This is discussed at length in Chapter 7.) There are, of course, other considerations, such as computational efficiency, exact reconstruction, and vanishing moments, but the three listed above are perhaps most important for typical applications to statistics. We might be willing to compromise somewhat on these conditions, but for statistical application, it is imperative that a wavelet transform be available for arbitrary sample sizes. The paper by Cohen, Daubechies, and Vial (1993) develops orthogonal wavelets on the unit interval. It concentrates its treatment on the 2j case (which is of primary interest to engineers), but the general methodology is also given to compute the discrete wavelet transform on the interval for any sample size. This satisfies Properties 2 and 3 above perfectly, but the implementation would be quite involved. Though this is certain to appear in standard wavelet software packages in the future, at the time of this writing it has not yet been implemented for arbitrary sample sizes either in the S+Wavelets module or in WaveLab. When the sample size is a power of two, a wide variety of boundary treatments and wavelet families is available by virtually all currently available wavelet software packages. A natural approach would be to precondition the original data set somehow to get a set of values with length 2 J for some positive integer J. The resulting preconditioned data could then be plugged in directly to any standard discrete wavelet transform routines. A comparison of some of these methods is given in Ogden (1997). A brief discussion is given here as well. In Section 6.1, it was noted that the fast Fourier transform algorithm is computationally efficient whenever the sample size n is highly composite. In practice, a common way to precondition the data when this is not true is to "pad with zeroes," i.e., to increase the size of the data set to the next larger
Some Practical Issues
117
power of two (or some other highly composite number), and then apply the FFT algorithm. This has been used in applying the discrete wavelet transform as well. This is certainly a reasonable solution, but it is somewhat problematic in that to some extent, it "dilutes" the signal near the end of the original data set, since coefficients will have zeroes averaged into their computation. Also, since the filters are not applied evenly (multiplying by a signal element constrained to have magnitude zero is equivalent to omitting the filter coefficient), the orthogonality of the transform is not strictly maintained. Another possibility would be to regard the data as being observations of a function with domain [0,1] with equal spacing lin, then interpolating the function to a grid with spacing 1/2 J for some positive integer J. The usual discrete wavelet transform would then be applied to the interpolated data points. This approach also seems reasonable, and it works tolerably well in practice, but it has some of the same problems as padding with zero. Orthogonality of the transform is lost, so correlations between empirical coefficients are introduced (which don't go to zero even asymptotically). A third possibility is to abandon the fast algorithms discussed in Section 6.2 and compute the top-level empirical scaling function coefficients by numerical integration, as in (6.8). This is related to the a trous (with holes) algorithm discussed by Dutilleux (1989) and Shensa (1992). Again, this procedure gives reasonable results, but the strict orthogonality is lost. For any finite sample size, this approach will introduce autocorrelations between coefficients, but the amount of autocorrelation becomes small as n gets large. Rather than manipulating the data set and applying a standard orthogonal transform to the result, one might apply the decomposition corresponding to a "nearly orthogonal" biorthogonal wavelet transform. This is discussed by Cohen, Daubechies, and Feauveau (1992) and in Chapter 8 of Daubechies (1992). Such biorthogonal wavelet transforms are readily adapted to the interval, and the discrete wavelet transform can be applied to data of any length. The resulting coefficients will not be strictly uncorrelated, but they may be treated as such for some applications.
CDwn
CHAPTER
SEVEN
Other Applications
With a basic understanding of wavelet theory and a knowledge of the practical issues involved in applying wavelets to observed data, we are now ready to extend the basic methods of Chapter 3 to more sophisticated techniques on a wide variety of applications. Perhaps the most common wavelet application in statistics is nonparametric regression, which is covered in some depth in Section 7.1. This will serve as a groundwork for other applications treated later in this chapter: density estimation, estimation of the spectral density in time series, and the general change-point problem. Extensions of these methods will be given in the context of nonparametric regression in Chapter 8.
7.1
Selective Wavelet Reconstruction
As stated in Section 3.3, the approach considered by Antoniadis, Gregoire, and McKeague (1994) consists of projecting the raw estimator f- onto the approximating space VJ for any choice of the smoothing parameter J, which represents a linear estimation procedure (in that it operates linearly on the data). In contrast to this, David Donoho and Iain)ohnstone, in a seminal series of papers (see References), offer a non-linear wavelet-based approach to nonparametric regression. This type of approach has received a great deal of attention from statisticians, signal processors, and image analysts alike. The Donoho-)ohnstone approach beginswith computing the discretewavelet transform of the data Y1 , ... , Y n , thereby creating a new data set of (noisy) empirical wavelet coefficients as outlined in Chapter 6. The basic idea behind selective wavelet reconstruction is to choose a relatively small number of wavelet coefficients with which to represent the underlying regression function f. It will be assumed throughout the treatment of this subject that the function f has been scaled to live on the unit interval [0, 1] and that the wavelet coefficients are computed according to some family of orthogonal wavelets with an appropriate method for dealing with the boundaries. For simplicity, it may be assumed that there are 2 J equally spaced data values for some positive integer J and that periodic boundary handling is used. More details about the discrete wavelet transform under these (and more general) conditions are included in Chapter 6.
120
SELECTIVE WAVELET RECONSTRUCTION
This selective reconstruction idea is based on the premise that virtually any regression function I can be well-represented in terms of only a relatively small number of wavelet components at various resolution levels, the same general idea that drives the use of wavelets in data compression. A more precise definition of what is meant by virtually all is given later in this chapter. A heuristic justification of this claim is presented here, along with an example to illustrate the "sparsity of representation" property of wavelets. Any smooth function I can be approximated well (in the £2 sense) by its projection onto a Vi space for relatively small j, requiring only a small number of coefficients. All the higher-level wavelet coefficients in the discrete wavelet decomposition could be regarded as being set to zero. Adding some unusual feature (such as a discontinuity in I' or in the function itself) would require additional higher-level wavelets components to represent the function I well. Such a localized phenomenon, however, would require only a small number of additional coefficients, the number of needed coefficients at level j decreasing as j increases. Adding any finite number of such unusual features would still give only a relatively small number of significantly nonzero coefficients, illustrating the idea of the sparse wavelet representation of functions. The justification of this approach begins with the decomposition of the function I into its wavelet components, as described in Chapter 1. If the function I were known, its wavelet coefficients could be computed according to
OJ,k =
l'
f(U)1/Jj,k(U) duo
The notation Bj,k is used in this statistical context to emphasize that each wavelet coefficient can be regarded as a parameter. Donoho and Johnstone (1994) point out that coefficients computed in such a manner can be used to answer the question "Is there a significant change in the function near t?" for t close to 2- j k. If there is a large change, it will be manifested by a large (positive or nee:ative) value of B,; "': if the function is nearlv flat near_2_-! ls-.Jhen the tion about I that is contained in the wavelet coefficients becomes more localized, allowing one to do a better job of pinpointing exactly where the change is taking place. The wavelet components corresponding to coefficients that are close to zero can be neglected in the reconstruction of the function, with only a negligible loss of information. The notion of selective wavelet reconstruction is illustrated in Figure 7.1 using the example function from Chapter 1. The example function is approximated using four different reconstructions, each using only a relatively small number of the largest (in absolute value) wavelet coefficients. The coefficients included in the reconstruction
Other Applications
121 8 coefficients
4 coefficients
....It! It! .... ~
....
~
....
l()
ci l()
d 0
ci
-2
0
-2
2
12 coefficients
0
2
16 coefficients
It!
....
....
....~
....~
It!
l()
l()
d
ci
~ 0
~ 0
-2
0
2
-2
0
2
Figure 7.1: Selective wavelet reconstructions (using only the largest coefficients in absolute value) for the example function in Figure 1.2
are chosen irrespective of their resolution level-magnitude is the only criterion. Generally speaking, the statistical methods of function estimation parallel this deterministic approach to function approximation. It is informative to compare Figure 7.1 with Figure 1.2 (the Fourier reconstruction) and also with Figure 1.13 (the projection of the example functions onto spaces of varying resolution). In nonparametric regression, the parameter values {Bj, k} are unknown, so they must be estimated from the data. Define the empirical wavelet coefficient corresponding to the true coefficient Bj,k as
a.l)
(In practice, the empirical coefficients are computed according to the algorithms discussed in Section 6.2, but they are written here in terms of a.l) to emphasize the correspondence between the empirical and true coefficients.) Given data values Y1 , ... , Yn , which are distributed as
122
SELECTIVE WAVELET RECONSTRUCTION
it is relatively straightforward to derive the approximate distribution of from a.l). In particular,
wt2 is normal with mean E [[
wt2
f\U),p;,k(U) dU]
n
tin
~ E[Yi] i(i-l)ln 'ljJj,k(U) du 1
n
l: f(i/n)'ljJj,k(i/n) + O( 2) n
i=l
r I l l
io f(U)'ljJj,k(U) du (j·k+ J,
The variance of
+ O(n) + 0(n 2 )
1
O(-). n
w; ~2 can be computed in a similar way: Var [ [
f\U),p;,k(U) dU]
l:var[Yi] ('1 r t
n
n
'ljJj,d u ) du
)
2
i(i-l)ln
i=l
-;;;: i r 'ljJj,k (u) du + O( n1 a
1
2
2 )
o
a2
-
n
1
+0(-). 2 n
The above results hold true provided that the wavelet 'ljJj,k and the function f are sufficiently smooth. In particular, this result holds if the mother wavelet has one continuous derivative and if f is piecewise continuous and piecewise smooth with a finite number of discontinuities. (This smoothness assumption of the wavelets is satisfied, for example, of the Daubechies wavelets with sufficiently large N. This can be seen by applying results in Daubechies and Lagarias (1991, 1992) to the Daubechies families ofwavelets.) These conditions could be relaxed and similar results would hold, but the assumptions given above are sufficient for the purposes of this discussion. See Ogden (1994) for more details. In a manner similar to the two computations given above and under the same assumptions, it can be shown that the empirical wavelet coefficients are (at least asymptotically as n --+ 00) independent.
Other Applications
123
Alternatively, the distribution of the empirical wavelet coefficients can be derived using a matrix representation of the discrete wavelet transform of data. This more closely represents the way the decomposition is done in practice; the above results are included to give a more intuitive idea of computing the wavelet transform of data. Let W represent the n x n orthogonal matrix associated with the orthonormal wavelet system of choice. Let Y (no subscript) denote the vector of data values: Y = (YI ... Y n )'. Then we can write 1
0·2)
W= vnWnY,
in which the vector W (no subscripts) is the vector of wavelet coefficients _ ((n)
W -
w-I,o,
(n)
wO,o
(n)
(n)
(n)
,w I,O ' WI,I , ... ,W J-I ,2J-1-I
)'
.
Here the extra coefficient denoted W~n{,° is actually the lowest-level scaling function coefficient. As noted in Section 6.3, this represents a "final smoothing" of the data. It is included here to allow an invertible n x n transformation. The factor vn is included in 0.2) so as to unify the two representations of the discrete wavelet transform 0.1) and 0.2). Due to the orthogonality of the matrix W n , it is clear from 0.2) that the vector W will have a multivariate normal distribution with variance-covariance matrix a 2 In / n, where In represents the n x n identity matrix. The mean of the vector W is the vector that would result from applying the transform in 0.2) to the mean vector (f(l/n), f(2/n), ... f(I))'. This vector is approximately equal to the vector of Bj,k 's, indexed the same way as w, the approximation error being of order 0 (1/ n ) . Since these two sets of means for the W vector are essentially equivalent, Bj,k will be used to indicate the result of either decomposition interchangeably. Since W n is an orthonormal transform, the data vector can be reconstructed exactly via the inverse transform
wt2
The empirical coefficient is thus an estimator of the true coefficient Bj,k, converging at the usual parametric rate:
0.3) which can be expressed (n) _ wj,k - Bj,k
1
+ vnZj,k,
124
SELECTIVE WAVELET RECONSTRUCTION
where the set of Zj,k'S are a set of (unobservable) n independent N(O, a 2 ) random variables. Thus, as pointed out by Donoho and Johnstone (1994), each empirical wavelet coefficient consists of a certain amount of noise, but only relatively few consist of significant signal. The noise in the original sequence Y1 , ... , Y n is spread out uniformly among all empirical wavelet coefficients. The natural question is to ask, "Which of the coefficients contain significant signal, and which are mostly noise?" Then, once we have chosen the set of coefficients containing significant signal, some attempt might be made to remove the noise component from each empirical coefficient. This is the heuristic idea underlying the Donoho-Johnstone method. Large "true" coefficients Bj,k will typically have large corresponding empirical coefficients w J,\ n2 ' and so it is natural to reconstruct the function using only the largest empirical coefficients in an attempt to estimate f. The idea of wavelet thresholding represents a very useful method for selective wavelet reconstruction using only noisy (empirical) coefficients.
Wavelet Thresholding A technique for selective wavelet reconstruction similar to the general approach presented here was proposed by Weaver, et al. (1991) to remove random noise from magnetic resonance images. Donoho and Johnstone (1994) develop the technique from a rigorous statistical point of view, by considering selective wavelet reconstruction as a problem in multivariate normal decision theory. DeVore and Lucier (1992) also developed the same approach independently. Since the largest "true" coefficients are the ones that should be included in a selective reconstruction, in estimating an unknown function it is natural to include only coefficients larger than some specified threshold value. Here (and throughout this chapter), a "large" coefficient is taken to mean one that is large in absolute value. For a given threshold value A, such an estimator can be written
0.4)
where fA represents the indicator function of the set A. This represents a "keep or kill" wavelet reconstruction, where the large coefficients (relative to the threshold A) are kept intact and the small coefficients are set to zero. This thresholding can be thought ?f as a nonlinear operator on the vector of coefficients, resulting in a vector B of estimated coefficients that are then plugged into the inverse transform algorithm. Such a thresholding scheme is designed to distinguish between empirical coefficients that belong in the reconstruction (corresponding, one would
Other Applications
125
hope, to true coefficients which contribute significant signal) and those that do not belong (corresponding to negligibly small true coefficients). In making this decision, we should account for the two factors that affect the precision of the estimators: the sample size n and the noise level a 2 • All other things being held equal, a coefficient is a strong candidate for inclusion if the sample size is large and/or if the noise level is small. Based on the result in 0.3), the thresholding will be performed on ..;nw~n2 a, since this quantity is normally J, distributed with variance one for all values of n and a. The thresholding estimator of the true coefficient OJ,k can thus be written
/
A
OJ"
a
r,::: 8A
_
k -
(..;nw;~2) a
'yn
,
0.5)
where the function 8>.. in 0.5) is the hard thresholding function
8f (x) = {
if Ixl > ,X otherwise.
x,
0,
(7.6)
This "keep or kill" hard thresholding operation is not the only reasonable way to estimate wavelet coefficients. Recognizing that each empirical coefficient consists of both a signal portion and a noise portion, it might be desired to attempt to isolate the signal portion by removing the noisy part. This idea leads to the soft thresholding function also considered by Donoho and Johnstone (1994): X
8f(x) =
0, {
-,x '
x+,x,
if x > 'x, if Ixl < ,x, if x < -,x,
0.7)
which can also be used in 0.5). When the soft thresholding operator is applied to a set of empirical wavelet coefficients, only coefficients greater than the threshold (in absolute value) are included in the reconstruction, but their values are "shrunk" toward zero by an amount equal to the threshold ,x. These two thresholding functions are displayed in Figure 7.2. Clearly, in using either type ofwavelet thresholding, the choice of a threshold is a fundamental issue. Choosing a very large threshold will make it very difficult for a coefficient to be judged significant and included in the reconstruction, consequently resulting in an oversmoothing. Conversely, choosing a very small threshold value will allow many coefficients to be included in the reconstruction, giving a wiggly, undersmoothed estimate. The proper choice of threshold involves a careful balance of these principles.
126
SELECTIVE WAVELET RECONSTRUCTION
/ -A
SOFT
Figure 7.2: The hard and soft thresholding functions
Figure 7.3 illustrates the effect ofvarying the threshold value on the resulting estimator. The underlying function for the simulated data in the left-hand column is a simple sine curve; that on the right-hand side is a piecewise constant function with jumps at 0.25 and 0.75. For each data set, three values of A were used as the hard thresholding operator was applied to wavelet coefficients at all levels.
Spatial Adaptivity One drawback of some of the standard methods for function estimation discussed in Chapter 2 is that they are not spatially adaptive. Some functions might require a greater amount of smoothing in some portions of the domain (where f is relatively flat, for example), and less smoothing in other places (where f has one or more finer-scale features). A spatially adaptive estimator is one with the ability to discern from the data where more smoothing is needed, and where less will suffice, and then to apply the needed amount of smoothing. Variations of some of the methods discussed earlier have been developed to make them more spatially adaptive. For a kernel regression estimator, it is common to use a variable bandwidth, replacing the fixed bandwidth A in (2.16) with one that varies with u: A(U ). For a point u at which f (u) is fairly smooth, the bandwidth should be relatively large, allowing a greater amount of averaging. Similarly, at locations where f (u) is less regular, a smaller band-
Other Applications
127
Threshold = 1.75
,•
Threshold = 1.75
• •
• • •
• •
• •••
•
C\I
o
C\I
.
• C\I I
o
• •
-.:tI
0.0
0.2
,
0.4
0.6
C\I I
• 0.8
1.0
Threshold = 2.75 • • ••
•• 0.0
•
....•• -
• 0.2
.(dj' ,k)'l/Jj' ,dx ), k
j'=jo+1 k
where 8>. represents either the hard or the soft thresholding function (7.6) or (7.7). The question remains of choosing the threshold A. Though on the surface this threshold selection problem has much to do with the corresponding thresholding problem in nonparametric regression, they are quite different in nature. Though there is some common ground, the problems should be considered separately, with threshold selection procedures only applicable for use in the problem for which they are designed. The global thresholding results of Section 7.1 are based upon the distribution of each empirical coefficient w)~2 having an independent normal distribution with mean equal to the corresponding "true" coefficient ()j,k and variance a 2 / n, which results from applying the signal-plus-noise model to the data: Yi rv N(f(i/n), a 2 ), i = 1, ... , n. This distributional assumption is, of course, not satisfied for the empirical wavelet coefficients computed according to (3.6) and (3.7), so the thresholding problem must be reevaluated in this context.
Other Applications
133
In addition to choosing a threshold A, another issue to be considered in wavelet density estimation is the choice of J, the highest level of resolution to be considered. With regression data, the discrete wavelet transform (DWT) algorithm yields coefficients at resolution only as high as level J - 1 = log2 n1, so a natural upper limit is in place. One approach to selection of the threshold and the maximum level of resolution is proposed by Donoho, et al. (1995). They suggest choosing J = [log2 n] - 1 and applying the threshold \ _ 2Clogn A
-
Vii
to the empirical coefficients, where C = SUPj,k sUPx 2-j/21'I/Jj,dx)l. Another method is due to Donoho, et al. (1993). They propose using J = [log2 n - log2(logn)] as the maximum level of resolution to be considered, and a level-dependent threshold Aj = K -JTjn, for a suitably chosen constant K. Level dependent thresholding will be treated at length for nonparametric regression in Chapter 8. It is possible to modify the density estimation problem to more closely coincide with Section 7.1 results by binning the data. This approach is also described by Donoho (1993), the theoretical arguments put forth in Donoho, et al. (1993). Assume that the data Xl, ... , X n are a random sample from a density f on [0, 1]. The unit interval is partitioned into M = 2[lo92 n ]-2 equally spaced intervals (actually, the number of intervals should be about n/4; taking M to be a power of two simplifies the computation), and N i is the number of observations falling into interval i, i = 1, ... ,M. Setting
gives that the Yi's are approximately (for large n) independent with approximate distribution
(see Donoho, et al. (1993)). By making this approximation, we can estimate the square root of the density by applying results from Section 7.1, then square the result to get a final estimate for f.
7.3
Spectral Density Estimation
Another basic area of statistical estimation is spectral density estimation in time series analysis. Here, we will give a brief description of the problem, then describe the applications of wavelets to this problem. Good references
134
SPECTRAL DENSITY ESTIMATION
on general time series analysis include Anderson (1971), Priestley (1981), and Wei (1990). The model for this section involves data Y1 , ... , Yn , which represent univariate observations that are equally spaced over time (or space). This type of data differs from the nonparametric regression data considered in Section 7.1 in that it is assumed here that the data have a constant mean, and the earlier assumption of independence is dropped. Thus, we consider here only weakly stationary time series, which means that E[Yi] = J-L for all i and that there exists a function R(i) such that
Cov(Yi, Yj) = R(1i - il). That is, stationarity requires that the covariance of any pair of observations depends only on the time between the observations. The function R is known as the autocovariance function (or, simply, the covariance function) of the time series. Note that R(i) = R( -i) for alIi E 7L Often, we focus attention instead on the autocorrelation function of the time series, defined as p(i)
= Corr(Yi, Yi+e).
This can be computed from the covariance function by
R(i)
p(i)
= R(O).
Note that R(O) is simply the variance of a single observation and also that
p(O) = 1. In time series analysis, the primary interest is often to study the periodic behavior of the data. For instance, economic time series often show significant seasonal (yearly) cycles. The Fourier transform has thus become a common tool in time series analysis. The frequency content of a time series can be analyzed through the spectral density function, which results from regarding the covariance function values ... ,R( -1), R(O), R(l), ... as a set of Fourier coefficients. The Fourier representation (4.21) gives an expression for the spectral density function: 00
f(w)
L
R(j)e ijw
j=-oo 00
L j=-oo
00
R(j) cos(jw)
+i
L j=-oo
R(j) sin(jw)
Other Applications
135
00
L
R(j) cos(jw)
j=-oo 00
R(O)
+ 2L
(7.9)
R(j) cos(jw),
j=1
the imaginary term dropping out since R(j) = R( -j) and sin(jw) = - sin( - jw) for all j E 7L Note that f (w) is defined for w E JR, but that it is periodic with period 211", so we need only consider f on, say, the interval [-11",11"]. Note further that f is symmetric about zero: f(w) = f( -w). Therefore, the spectral density is usually only considered on the interval [0,11"]. The argument w indicates the frequency value, thus the spectral density at any value w analyzes the frequency content of the time series at frequency w. Note that if the data are independent, then R(f) = for f f; 0, so the spec(72. Very Wiggly time series are characterized by tral density is flat: f (w) mostly high-frequency components, which is manifested by a spectrum that is small for small w and large for large w. Conversely, a very smooth time series will contain an abundance of low-frequency components and an absence of high-frequency content. Periodicities manifest themselves as one or more spikes in the spectral density. The locations of the spikes give information about the periodicities present in the time series. Four typical spectral density functions are plotted in Figure 7.6. The upper left-hand plot corresponds to a pure white noise (uncorrelated) time series; the others correspond to an autoregressive process. Briefly, a time series is autoregressive with order p, denoted AR(P), if it follows the model
=
°
in which .the Ei'S are independent N(O, (72) random variables. Wei (1990) gives expressions for the spectral density associated with general AR processes. The next two spectral densities in Figure 7.6 correspond respectively to an AR(l) process with positive rl (an abundance of low frequency content) and an AR(1) process with negative rl (an abundance of high frequency). The last process corresponds to a seasonal time series with period 12. The spikes correspond to the seasonal harmonic frequencies w = 27rj /12 for j = 1,2,3,4,5,6. Corresponding to Figure 7.6 are sets of simulated time series data, plotted in Figure 7.5. The high-frequency and low-frequency content of the respective AR(1) processes are somewhat evident from the time series plots. The seasonality of the last time series is somewhat obscured by the noise. The spectral density f (w) represents the complete theoretical information about a time series. In statistical practice, this must be estimated from the data. Given an estimate of the covariance function, a raw estimator of f (w )
136
SPECTRAL DENSITY ESTIMATION
White noise
AR(l), r = 0.5
C\I C\I
o
II
o
,....,
,.... I
o
20 40
60
80
o
120
20
=
AR(l), r -0.5 C")...--------'--'''------.,
40
60
80
120
Seasonal time series
C\I
C\I
o ,....,
,.... I
C\I I
o
20
40
60
80
o
120
20
40
60
80
120
Figure 7.5: Four simulated time series corresponding to the spectral densities in Figure 7.6: pure white noise, AR(l) with rl = 0.5, AR(l) with rl = -0.5, seasonal time series with period 12.
can be obtained by plugging flU) in place of R(j) in (7.9). A natural way to estimate R(j) = Cov(Yi, Yi+j) is simply n-j
A
R(j)
1,,",
=-
n
-
-
L..,,(Yi - Y)(Yi+j - Y), j
= 0, 1, ... , n -
1.
(7.10)
i=l
Defining R(j) this way, the estimate of the autocorrelation function p(j) is
A( .) _ fl(j) p J - fl(O)' Note that this is essentially the standard sample correlation coefficient computed on (Y1 , Y1+j ), (Y2 , Y2+ j ), .•. , (Yn - j , Yn ). Also, note that R(O) = fJ2 is the usual maximum likelihood estimator of the variance under the AR model. A plot of fJ(j) vs. j is known as the correlogram of the data. The sample spectral density function is obtained by plugging the estimate (7.10) of R(j) into the definition of f(w) in (7.9):
Other Applications
137
n-l
!(w)
= R(O) + 2
L R(j) cos(jw).
j=1
An alternative form of the sample spectrum is given in terms of the discrete Fourier transform of the original data (see Bloomfield, 1976):
!(w)
=~
n
L Y'le-
2
i (l-I)W
l=1
The sample spectral density function is typically computed only at the "natural frequencies" Wj = 21rjln, j = 0, ... , [nI2]. A plot of !(Wj) vs. Wj is known as the periodogram. The periodogram is, first of all, a raw estimate of the spectral density. Its initial purpose was to search for "hidden periodicities" in the data-if there is a (deterministic) periodic component in the data, perhaps obscured with noise, this will be manifested in the form of a sharp spike in the true spectral density at the frequency of the periodic component. It would be hoped that this spike would also appear in the periodogram. The periodograms that correspond to the example time series plotted in Figure 7.5 are plotted in Figure 7.7. The periodogram by itself is a useful and interesting diagnostic tool for time series data analysis. By comparing Figure 7.7 with Figure 7.6, it is possible to see a correspondence, but the very Wiggly nature of the periodogram renders it unfit for estimation of the true spectral density function, which is typically assumed to be mostly smooth, possibly with some sharp spikes. Priestley (1981) calls the periodogram "an extremely poor (if not useless) estimate of the spectral density function." It is not consistent (as n gets large), and its peculiar covariance structure ensures that the periodogram will have a wildly erratic behavior. In order to get a more appropriate estimate of the spectral density, many methods have been proposed to "smooth" the periodogram. One method, from Parzen (1974), involves fitting a parametric autoregressive model to the data and estimating the parameters. The resulting estimate is just the spectral density for an AR process, with the estimated parameter values plugged in. By applying the general methods described in Chapter 2, model-free estimates of the spectrum can be obtained. In particular, a kernel function K can be applied to smooth!(w) by averaging over neighboring frequencies just as was done for smoothing regression functions in (2.17): A
f(w)
= Jof
1f
1
1rA K
(w 1rA- T) f(T)dT. -
The kernel K in the above expression is known as a spectral window, since the averaging is being done in the frequency domain. This smoothed spectral
138
SPECTRAL DENSITY ESTIMATION
White noise
AR(l), positive r
C'! ,.... T'"'; ,....
C! ,....
C\I
Ol
ci co
ci
0.0
3.0
2.0
1.0
0.0
AR(l), negative r
1.0
2.0
3.0
Seasonal time series
" t, -t < Xk < t, Xk < -t, sollg(X)11 2 = E%=lmin2(IXk l,A). Notealsothat\7'g = - E%=II[-A,Aj(Xk), so that Stein's estimate of risk applied to this situation can be written for any set of observed data x = (Xl, ... ,Xd)':
146
SURE THRESHOLDING d
SURE(A;x)
d - 2· #{k : IXkl < A}
+ I: min2(lxkl, A) k=l d
-d + 2· #{k : IXkl > A}
+L
min2(lxkl, A),
(8.1)
k=l where #8 for a set 8 denotes the cardinality of the set. Here, EJLllit(A) (X) IlW == EJLSURE(A; X). The threshold level is set so as to minimize the estimate of risk for given data Xl, ... ,Xd:
A == arg mint~oSURE(t;x). Such a method can reasonably be expected to do well in terms of minimizing risk, since for large sample sizes the Law of Large Numbers will guarantee that the SURE criterion is close to the true risk. The SURE criterion is written in the form (8.1) to show its relation to Akaike's Information Criterion (AlC), introduced by Akaike (1973) for time series modeling: It consists of a function to be minimized (2:: ~= I min 2(I X k I, A)) and a penalty term consisting of twice the number of estimated parameters included in the reconstruction (only the observations with Ix k I > A will be nonzero after the shrinking). The computational effort involved with minimizing the SURE criterion is light-if the observations are re-ordered in order of increasing IXk I, then the criterion function SURE (t; x) is strictly increasing between adjacent values of the IXk I's. It is also strictly increasing between 0 and the smallest IXk I, as well as for t > maxk IXkl, so the minimum must occur at 0 or at one of the IXkl's. Thus, the criterion must only be computed for d + 1 values of t, and, in practice, there is no need to order the IXkl's. Figure 8.1 illustrates this method in action. This figure displays plots of vnlw)~21 for levels 10, 9, and 8 for the blocky function shown in Figure 5.10 normalized to have signal-to-noise ratio 5 with n = 2048. Signal-to-noise ratio (SNRatio) for a set of means Ill, ... , Ild with additive noise is defined to be the ratio of the standard deviation of the mean vector to the standard deviation of the noise. In the first column of plots, the absolute values of times the coefficients are plotted in increasing order. In the second column, the SURE criterion is plotted as a function of t, evaluated for each t = vnlwt21 at the current level. The dashed line in the first column of plots indicates the value of the threshold selected by the SURE criterion; all points below this line will be shrunk to zero, and all points above will be shrunk toward zero by that amount.
vn
Data Adaptive Wavelet Thresholding
Level 10 coefficients III .;t
=1=
l3C")
SURE(t;x) for level 10
0 0 0
:j: +
147
.....
x
.±:::.o wo
::l
lI:
~N
tO
:::>
en
0 0
N
0
0
200 400 600 800 Index
0
Level 9 coefficients
*E
1IlC") (I)
>
0 0 III
:::>
en
0 0
0
.....
0
100 200 300 400 Index
500
0
Level 8 coefficients
+
«Xl
+ :f:
to
:t ur 0: 0
::l
ca
>
.;t
N
....... -------
0
• 0
2 t
3
4
SURE(t;x) for level 8
0
.....
(I)
4
-0 W o lI:C")
::l
caN
III
3
SURE(t;x) for level 9
+ .;t
2 t
_J
...... - ...... --_ .
E
0 0 III
:::>0
enC") 0 0
.....
50
100 150 200 Index
250
0
2 t
3
4
Figure 8.1: Plots of wavelet coefficients and the SURE function for levels 810, from the blocky function with SNRatio = 5 and n = 2048.
The global thresholding procedures discussed in Chapter 7 applied thresholding only to higher-level coefficients, preferring to leave the lower-level coefficients (which would correspond to "macro" features of the function) intact. The data adaptive scheme described in this section is often applied to the coefficients at all levels, allowing the coefficients themselves to determine if any shrinking is needed. Indeed, for many examples, the SURE criterion chooses as the best threshold for the low-level coefficients. Looking at the plots on the left-hand side of Figure 8.1, a data analyst might point out that each level of wavelet coefficients consists of a few "large" coefficients and many "small" ones, and that the cut-off point may be a bit low, in the sense that there are quite a few seemingly "small" coefficients that will be included in the reconstruction. A related observation might result from look-
°
148
SURE THRESHOLDING
ing at the right-hand side of plots and noticing that (especially for j = 10) the SURE( t; x) function is relatively flat near the place that it achieves its minimum. This would indicate that there is a wide range of possible thresholds, the choice of which would make relatively little difference quantitatively (in terms of estimated risk), but may have a significant difference qualitatively (in terms of the relative smoothness of the resulting estimator). This apparent problem is also addressed by Donoho and Johnstone (1995), noting that the SURE method does not perform well in cases where the wavelet representation at any level is very sparse, i.e., when the vast majority of coefficients are (essentially) zero. This is due to the noise from the essentially zero coefficients overwhelming what little signal is contributed from the nonzero coefficients. Thus, Donoho and Johnstone suggest a hybrid scheme to get around this issue. The heuristic idea behind this hybrid method is to test the coefficients for sparsity at each level. If the set of coefficients is judged to be sparsely represented, then the hybrid scheme defaults to the universal threshold ..j2log d; otherwise the SURE criterion is used to select a threshold value. The criterion used is related to the usual sample variance of the data if the true mean were known to be zero:
The representation at the current level is judged to be sparse if d) 3/2 (1 8 2 < 1 + -,--O_g_2-=-_ d -
Vd
otherwise, the threshold is selected by SURE. Originally, to aid in proving the relevant theorems, the hybrid method proposed by Donoho and Johnstone (1995) broke the data Xl, ... , Xd randomly into two subsets of equal size, and each half-sample was used to choose a threshold for the other half-sample. Donoho and Johnstone made a note in the manuscript proofing stage that this subsampling is unnecessary, that the theory holds when the threshold is chosen from all coefficients at the current level. The result of applying this hybrid scheme is to get around the problem noted previously, allowing too many coefficients in the reconstruction and thereby producing an estimate that is far too noisy. When there are only a very few non-zero coefficients at a particular level, the scheme detects this, and applies the universal threshold, giving a much less noisy-looking reconstruction and simultaneously maintaining good MSE performance.
Data Adaptive Wavelet Thresholding True function
149
Noisy observations
a -r-------------, .... a ....
a
a LO I
a .... ~ -'--r---,--r-----r--.----,-J I
0.0
0.2
0.4
0.6
0.8
1.0
I
0.0
SURE estimator
0.2
0.4
0.6
0.8
1.0
SURE hybrid estimator
a ....
a
a ....
''-r--r---.---,----r----.-'
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 8.2: The blocky function with simulated data, n 2048, and SNRatio = 5, estimated by both the SURE and the hybrid SURE methods
Examples of the SURE-based thresholding procedures are shown in Figure 8.2 and Figure 8.3 for two example functions: the piecewise constant blocky function (with n = 2048 and a signal-to-noise ratio of 5) and a sine function (with n = 512 and a signal-to-noise ratio of 2). The differences between the regular SURE estimator and the hybrid are clearly demonstrated in the second example: Since the function is smooth, the true coefficients at higher levels of resolution are almost entirely white noise. The hybrid method recognizes this and thus shrinks most of these coefficients to zero.
8.2
Threshold Selection by Hypothesis Testing
The primary goal in data dependent threshold selection is the division of wavelet coefficients into a group of "small" coefficients (those consisting primarily of noise) and one of "large" coefficients (those containing significant signal). A reasonable way for a statistician or a data analyst to go about this is to utilize statistical tests of hypotheses, the large coefficient group consisting only of coefficients that "pass the test" of significance. This is the general approach taken by Abramovich and Benjamini (1995) and Ogden and Parzen (1996a, b).
150
THRESHOLD SELECTION BY HYPOTHESIS TESTING
Noisy observations
True function C') . . . - - - - - - - - - - - - ,
a
....,
0.0
0.2
0.4
0.6
0.8
1.0
0.0
C')
C')
N
N
a
a
....,
....,
C?
0.2
0.4
0.6
0.8
1.0
C? "'-r--.....---.----..---.---.-J 0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 8.3: A sine function with simulated data, n = 512, and SNRatio = 2, estimated by both the SURE and the hybrid SURE methods
For a data set of length n, we consider a set of n parameters {Bj,k,j = -1, ... , J - 1, k = 0, ... , 2j - I}, most of which are thought to be (essentially) zero. The maximum likelihood estimator for each Bj,k is the corresponding empirical coefficient and we saw in Section 7.1 that
w)1,
w)1
and that the 's are mutually independent when an orthonormal wavelet basis is used on Gaussian data. A test for the hypotheses
H o : Bj,k
=0
H a : Bj,k
-I 0
for any fixed j and k would naturally recommend rejecting the null hypothesis if IWj,k 1/ > Za, where Za represents the upper-a critical point of the standard normal distribution.
vn
Data Adaptive Wavelet Thresholding
151
The above test is certainly appropriate for any fixed choices of j and k, but it would be problematic to apply such a testing procedure to all n coefficients: If all of the null hypotheses are true (the true function f is identically zero on [0, 1]), we could expect no:. of the coefficients to be falsely declared significantly different from zero, where 0:. is the common level of the tests. The result of this will be too many coefficients included in the reconstruction, giving an undersmoothed estimate of f. This is precisely the problem that one faces in multiple comparisons in an ANOVA setting, except that in the wavelet case, all tests are independent. One way to account for this would be to apply the standard Bonferroni correction, and adjust the level of the tests so as to control the probability that any of the zero-coefficients are included in the reconstruction. For even moderate n, however, this will be overly conservative, making it very difficult for any coefficients to be judged significant and typically resulting in an oversmoothing. While neither of these extremes are particularly useful in themselves, there are a number of ways to strike a compromise.
Recursive Testing The general approach taken by Ogden and Parzen (1996a, b) operates on a level-by-level basis, as does the SURE approach. At any particular level, a single test is performed to determine if the set of coefficients at that level behave as white noise or if there is significant signal present. If it is determined that there is signal present, the most likely candidate (the largest coefficient in absolute value) is removed from consideration, and the test is repeated. Continuing recursively, at each level one will be left with two sets of coefficients: "large" coefficients thought to contain significant signal, and a set of "small" coefficients which is indistinguishable from pure white noise. More precisely, let Xl"", X d represent the empirical wavelet coefficients at level j = log2 d, as in Section 8.1. Suppose further that these coefficients have means J-LI' ..• , J-Ld respectively. Initially, interest is in testing the null hypothesis that all the means are zero vs. a general alternative that some of the J-Li'S are non-zero. Specifically, let I d represent a non-empty subset of the indices {I, ... , d}. Then the hypotheses could be expressed as
H o : J-LI
= ... = J-Ld = 0 (8.2)
Ha
: J-Li
=J 0
for all i E I d ;
J-Li
= 0 for all i ¢ I d •
A fundamental question that must be addressed in this approach is how to test the above set of hypotheses. The approach of Ogden and Parzen (1996b) proceeds as follows: If the cardinality of the set I d is not known, the standard likelihood ratio test for
152
THRESHOLD SELECTION BY HYPOTHESIS TESTING
these hypotheses would be based on the test statistic L:f=l Xl, which has a X 2 distribution with d degrees of freedom when the null hypothesis is true. Note that this is also the test statistic that would be used if it were known that I d = {I, ... , d}. This is not the most appropriate test statistic for this situation, especially because it is usually believed that very few, if any, of the J-Li'S are non-zero. The result of applying this test statistic would be poor power of detection when I d contains only a few coefficients, since the noise of the zero coefficients will tend to overwhelm the signal of the non-zero coefficients. If the cardinality of the set I d were known to be, say, m, then the standard likelihood ratio test statistic would be the sum of squares of the m largest Xi'S in absolute value. In practice, m is not known, so the Ogden-Parzen approach consists of a recursive testing procedure for I d containing only one element each time. Thus, the appropriate test statistic is the largest of the squared Xi'S. The a-critical point of this distribution is worked out to be
Xd
= {
.
.20 ell > .... c 0.
• .................................. ~ •. ¥ ..•••.••••.-
~ •....•••••••....•••, ••.•..••••..•
•
0>0
'0
•
:e0> o
•
•
t)
q
,....,
• 0.0
0.2
0.4
0.6
0.8
1.0
Figure 8.4: Coefficients at level 5 of a function (no noise added) with jumps at 0.25 and 0.75 coefficient per jump, they are flanked on both sides by other "large" coefficients. The approach of Ogden and Parzen (1996a) adapts standard change-point methods (which are closely related to classical goodness-of-fit techniques) to test the hypotheses given in (8.2). In the change-point problem with data Xl, ... , X d, nonparametric test statistics for the general hypotheses
H o : E[X 1] = E[X2 ] = ... = E[Xd ] H a : E[X 1] = ... = E[Xm ] "I E[Xm + l ] = ... = E[Xd] are based on the mean-corrected cumulative sum (CUSUM) process (see Csorgo and Horvath (1988) for a review of nonparametric change-point procedures). Typical functionals of this cumulative sum process are the maximum of the absolute value (Kolmogorov-Smirnov), the integral of the square (Cramer-von Mises), and a weighted integral of the squared process (Anderson-Darling). These test statistics in the change-point problem are the same as those used in goodness-of-fit situations. These tests are examples of omnibus tests that can be used to test the null hypothesis of equal means vs. a very wide variety of possible alternatives. The generality of the alternative hypothesis in this wavelet thresholding situation suggests that such an omnibus test would be appropriate. Thus, the approach of Ogden and Parzen (1996a) is to base the test for the hypotheses in (8.2) on the following process, which depends on the choice of a univariate function g: ~
< u < 1, o