ADVANCES IN IMAGING AND ELECTRON PHYSICS VOLUME 117
EDITOR-IN-CHIEF
PETER W. HAWKES CEMESÑCentr e National de la Rec...
93 downloads
269 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
ADVANCES IN IMAGING AND ELECTRON PHYSICS VOLUME 117
EDITOR-IN-CHIEF
PETER W. HAWKES CEMESÑCentr e National de la Recherche ScientiÞque Toulouse, France
ASSOCIATE EDITORS
BENJAMIN KAZAN Xerox Corporation Palo Alto Research Center Palo Alto, California
TOM MULVEY Department of Electronic Engineering and Applied Physics Aston University Birmingham, United Kingdom
Advances in
Imaging and Electron Physics EDITED BY
PETER W. HAWKES CEMESÑCentr e National de la Recherche ScientiÞque Toulouse, France
VOLUME 117
San Diego
San Francisco New York London Sydney Tokyo
Boston
∞ This book is printed on acid-free paper. C 2001 by ACADEMIC PRESS Copyright
All Rights Reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the Publisher. The appearance of the code at the bottom of the Þrst page of a chapter in this book indicates the PublisherÕs consent that copies of the chapter may be made for personal or internal use of speciÞc clients. This consent is given on the condition, however, that the copier pay the stated per copy fee through the Copyright Clearance Center, Inc. (222 Rosewood Drive, Danvers, Massachusetts 01923), for copying beyond that permitted by Sections 107 or 108 of the U.S. Copyright Law. This consent does not extend to other kinds of copying, such as copying for general distribution, for advertising or promotional purposes, for creating new collective works, or for resale. Copy fees for pre-2001 chapters are as shown on the title pages. If no fee code appears on the title page, the copy fee is the same as for current chapters. 1076-5670/01 $35.00 Explicit permission from Academic Press is not required to reproduce a maximum of two Þgures or tables from an Academic Press chapter in another scientiÞc or research publication provided that the material has not been credited to another source and that full credit to the Academic Press chapter is given.
Academic Press A Harcourt Science and Technology Company 525 B Street, Suite 1900, San Diego, California 92101-4495, USA http://www.academicpress.com
Academic Press Harcourt Place, 32 Jamestown Road, London NW1 7BY, UK http://www.academicpress.com International Standard Serial Number: 1076-5670 International Standard Book Number: 0-12-014759-9 PRINTED IN THE UNITED STATES OF AMERICA 01 02 03 04 QW 9 8 7 6 5 4 3 2 1
CONTENTS
CONTRIBUTORS . . . . . . . . . . . . . . . . . . . . . . . . . . PREFACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FUTURE CONTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . .
vii ix xi
Optimal and Adaptive Design of Logical Granulometric Filters EDWARD R. DOUGHERTY AND YIDONG CHEN
I. II. III. IV. V. VI. VII.
Introduction . . . . . . . . . . . . . . . . . . . . . Euclidean Granulometries . . . . . . . . . . . . . . . Logical Granulometries . . . . . . . . . . . . . . . . Adaptive Single-Parameter Disjunctive Granulometric Filters Adaptation in a Multiparameter Disjunctive Model. . . . . Granulometric Bandpass Filters. . . . . . . . . . . . . Logical Structural Filters . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
1 3 6 12 24 39 56 69
Introduction . . . . . . . . . . . . . . . . . . . . . . Warped Wavelets in Brief . . . . . . . . . . . . . . . . Multiresolution Approximation . . . . . . . . . . . . . . From WMRA and Warped Scaling Functions to Warped QMFs From Warped QMF to Warped Scaling Functions and WMRA . Warped Wavelets . . . . . . . . . . . . . . . . . . . . Construction of Iterated Warping Maps . . . . . . . . . . Computation of the Dyadic Warped Wavelet Transform . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
74 77 79 90 104 118 129 161 169 170
. . . . .
. . . . .
. . . . .
174 175 179 183 189
Dyadic Warped Wavelets GIANPAOLO EVANGELISTA
I. II. III. IV. V. VI. VII. VIII. IX.
Recent Developments in Stack Filtering and Smoothing JOSE« L. PAREDES AND GONZALO R. ARCE
I. II. III. IV. V.
Introduction . . . . . . . . . . . . . . . . . . Threshold Decomposition and Stack Smoothers . . . Mirrored Threshold Decomposition and Stack Filters. Integer Domain Filters of Linearly Separable PBFs . Analysis of WM Filters Using Threshold Logic . . .
v
. . . . .
. . . . .
. . . . .
. . . . .
vi
CONTENTS
VI. Recursive Weighted Median Filters and Their Nonrecursive WM Filter Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . VII. Weighted Median Filters with N Weights. . . . . . . . . . . . . VIII. Stack Filter Optimization . . . . . . . . . . . . . . . . . . . IX. Applications of Stack Filters . . . . . . . . . . . . . . . . . . X. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . .
195 200 205 223 237 238
Resolution ReconsideredÑCon ventional Approaches and an Alternative A. VAN DEN BOS AND A. J. DEN DEKKER
I. II. III. IV. V. VI. VII. VIII.
Introduction . . . . . . . . . . . Classical Resolution Criteria . . . . Other Resolution Criteria . . . . . Modeling and Parameter Estimation . Elements of Singularity Theory . . . Singularity of Likelihood Functions . Singularity and Resolution . . . . . Summary and Conclusions . . . . . References . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
242 245 248 264 277 292 325 353 355
INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
361
CONTRIBUTORS
Numbers in parentheses indicate the pages on which the authorsÕcontributions begin.
GONZALO R. ARCE (173), Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19716 A. VAN DEN BOS (241), Department of Physics, Delft University of Technology, 2600 GA Delft, The Netherlands YIDONG CHEN (1), National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892 A. J. DEN DEKKER (241), Department of Physics, Delft University of Technology, 2600 GA Delft, The Netherlands EDWARD R. DOUGHERTY (1), Department of Electrical Engineering, Texas A & M University, College Station, Texas 77843 GIANPAOLO EVANGELISTA (73), Audio Visual Communications Laboratory, Federal Institute of Technology, EPFL, Ecublens, CH-1015 Lausanne, Switzerland, and Department of Physical Sciences, University of Naples ÒFedericoIIÓ,Complesso Universitario MSA, I-80126 Naples, Italy JOSE« L. PAREDES (173), Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19716, and Department of Electrical Engineering, University of Los Andes, M« erida, Venezuela
vii
This Page Intentionally Left Blank
PREFACE
The four articles that make up this latest addition to this series deal with very new developments in signal and image processing and recent thinking about the notion of resolution. We begin with a long-standing concern of image analysis, namely, granulometry. A common problem is to measure properties of individual objects in a sceneÑthe number of such objects, their areas, their peripheries, and other such characteristics. In practice, this is often not straightforward: the shapes of the objects may vary, some objects may overlap, and the contrast may be different from one object to another. Granulometric Þltering was one of the earliest preoccupations of the founders of mathematical morphology, a subject to which E. R. Dougherty has contributed extensively. Here, he and Y.-d. Chen describe at length a procedure for the optimal and adaptive design of logical granulometric Þlters, with examples from characterrecognition and blood-cell analysis. Wavelets have attracted enormous attention since they entered the signal processing toolbox some years ago. At Þrst, it was not obvious that the orthogonal wavelets were fundamentally different from any other orthonormal basis in terms of which functions could be expanded. After all, Haar functions had been known since the beginning of the twentieth century. However, it soon became apparent that such wavelets possessed properties that made them very attractive in certain situations. It also became gradually apparent that the wavelets that were most studied had certain drawbacks as well, and it was to circumvent these that dyadic warped wavelets were introduced. G. Evangelista, to whom we owe many important developments on this subject, here presents this class of wavelets in full detail, thereby making these new ideas accessible to a wide audience. This long chapter forms a complete monograph on the subject. The median Þlter was Þrst used empirically in image processing in an attempt to smooth noisy images without undue blurring of sharp contrast changes (edges in particular). Theoretical studies of the median Þlter rapidly generalized this type of operation to the stack Þlter, the weighted median Þlter, and other more exotic extensions of the basic idea of replacing a pixel value by that of one of its neighbors. G. R. Arce has been very active in elucidating the underlying structure of these Þlters, and here he and J. L. Paredes describe in detail the principles of this family of Þlters and, in particular, recent developments in Þltering and smoothing based on threshold decomposition. The literature on these Þlters is very rich and scattered and this connected account is therefore especially welcome. ix
x
PREFACE
The Þnal chapter is a major contribution to our understanding of the notion of resolution. For most of us, this is dominated by the ideas of Ernst Abbe and of Lord Rayleigh, whose ÒcriterionÓis still the starting point for most accounts of optical resolution. For some years, however, it has been clear that excellent though the Rayleigh criterion may be as a starting point, it leaves several questions unanswered, and there have been many articles on the number of degrees of freedom of images and on related topics. A. van den Bos and A. J. van Dekker have examined these difÞculties with great care, and an alternative approach to resolution has emerged from their work, based on parametric models of the observation process and maximum likelihood methods of parameter estimation. This in turn leads them into singularity theory, and their concluding section discusses the relation between singularity and resolution. In this very complete account, all the necessary background mathematics is recapitulated, which ensures that this long chapter is a selfcontained presentation of this new and important material. In conclusion, let me thank all the contributors for the care they have taken to make their material accessible to a wide readership. A list of articles to appear in forthcoming volumes of these Advances follows. Peter Hawkes
FUTURE CONTRIBUTIONS
G. Abbate New developments in liquid-crystal-based photonic devices D. Antzoulatos Use of the hypermatrix M. Barnabei and L. Montefusco (vol. 119) Algebraic aspects of signal and image processing L. Bedini, E. Salerno, and A. Tonazzini (vol. 119) Discontinuities and image restoration I. Bloch Fuzzy distance measures in image processing R. D. Bonetto Characterization of texture in scanning electron microscope images G. Borgefors Distance transforms Y. Cho Scanning nonlinear dielectric microscopy R. G. Forbes Liquid metal ion sources E. F¬ orster and F. N. Chukhovsky X-ray optics A. Fox The critical-voltage effect L. Frank and I. Mullerov« ¬ a Scanning low-energy electron microscopy P. Hartel, D. Preikszas, R. Spehr, H. Mueller, and H. Rose (vol. 119) Design of a mirror corrector for low-voltage electron microscopes P. W. Hawkes Electron optics and electron microscopy: conference proceedings and abstracts as source material
xi
xii
FUTURE CONTRIBUTIONS
M. I. Herrera The development of electron microscopy in Spain K. Ishizuka Contrast transfer and crystal images I. P. Jones (vol. 119) ALCHEMI W. S. Kerwin and J. Prince (vol. 119) The kriging update model G. K¬ ogel Positron microscopy W. Krakow Sideband imaging C. L. Matson Back-propagation through turbid media J. C. McGowan (vol. 118) Magnetic transfer imaging S. Mikoshiba and F. L. Curzon Plasma displays K. A. Nugent, A. Barty, and D. Paganin (vol. 118) Noninterferometric propagation-based techniques E. Oesterschulze (vol. 118) Scanning tunneling microscopy M. A. OÕKeefe Electron image simulation N. Papamarkos and A. Kesidis The inverse Hough transform C. Passow Geometric methods of treating energy transport phenomena E. Petajan HDTV F. A. Ponce Nitride semiconductors for high-brightness blue and green light emission
FUTURE CONTRIBUTIONS
H. de Raedt, K. F. L. Michielsen, and J. T. M. Hosson Aspects of mathematical morphology H. Rauch The wave-particle dualism D. Saad, R. Vicente, and A. Kabashima Error-correcting codes G. Schmahl X-ray microscopy S. Shirai CRT gun design methods T. Soma Focus-deßection systems and their applications I. Talmon (vol. 119) Study of complex ßuids by transmission electron microscopy I. R. Terol-Villalobos (vol. 118) Morphological image enhancement and segmentation M. Tonouchi Terahertz radiation imaging T. Tsutsui and Z. Dechun Organic electroluminescenece, materials and devices Y. Uchikawa Electron gun optics D. van Dyck Very high resolution electron microscopy C. D. Wright and E. W. Hill Magnetic force microscopy M. Yeadon (vol. 119) Instrumentation for surface studies
xiii
This Page Intentionally Left Blank
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 117
Optimal and Adaptive Design of Logical Granulometric Filters EDWARD R. DOUGHERTY1 and YIDONG CHEN2 1
Department of Electrical Engineering, Texas A&M University, College Station, Texas 77843 2 National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892
I. II. III. IV.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . Euclidean Granulometries . . . . . . . . . . . . . . . . . . . Logical Granulometries . . . . . . . . . . . . . . . . . . . . Adaptive Single-Parameter Disjunctive Granulometric Filters. . . . . A. Transition Probabilities . . . . . . . . . . . . . . . . . . . B. Steady-State Distribution . . . . . . . . . . . . . . . . . . C. Comparison of Optimal and Adaptive Filters in a Homothetic Model V. Adaptation in a Multiparameter Disjunctive Model . . . . . . . . . A. State Transition Probability Equations . . . . . . . . . . . . . 1. Type-[I, 0] Model . . . . . . . . . . . . . . . . . . . . 2. Type-[I, 1] Model . . . . . . . . . . . . . . . . . . . . 3. Type-[II, 0] Model . . . . . . . . . . . . . . . . . . . . 4. Type-[II, 1] Model . . . . . . . . . . . . . . . . . . . . B. Numerical Analysis of Steady-State Behavior . . . . . . . . . . VI. Granulometric Bandpass Filters . . . . . . . . . . . . . . . . . A. Granulometric Spectral Theory . . . . . . . . . . . . . . . . B. Granulometric Spectral Theory for Univariate Disjunctive Granulometries . . . . . . . . . . . . . . . . . C. Adaptive Bandpass Filters Given a Known Point in the Passband . . D. Adaptive Bandpass Filters Given No Known Point in the Passband . VII. Logical Structural Filters. . . . . . . . . . . . . . . . . . . . A. Filter Representation . . . . . . . . . . . . . . . . . . . . B. Design of LSFs . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
1 3 6 12 13 14 21 24 25 25 28 29 30 31 39 40
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
43 49 52 56 57 59 69
I. Introduction This article concerns statistical design of Þlters for sieving granular random sets. The random set consists of a union of disjoint compact grains, and some of the grains are to be passed, while others are to be eliminated. There is a signal random set S and a noise random set N. Both signal and noise are composed of disjoint grains, and signal and noise are disjoint. The observed random set is S ∪ N . In fact, if the random sets originate as binary images, then the captured images may not Þt the model; however, upon application of 1 Volume 117 ISBN 0-12-014759-9
C 2001 by Academic Press ADVANCES IN IMAGING AND ELECTRON PHYSICS Copyright All rights of reproduction in any form reserved. ISSN 1076-5670/01 $35.00
2
EDWARD R. DOUGHERTY AND YIDONG CHEN
some segmentation algorithm they Þt the model. The segmentation algorithm is not part of the granulometric theory as such. That theory begins with the random sets that result from the segmentation procedure. The goal is to design a Þlter for which (S ∪ N ) provides an estimator of S. The goodness of the estimator is measured by some probabilistic error criterion for (S ∪ N ) as an estimator of S. The methods we employ are morphological: the decision whether to pass a grain is based on whether certain probes Þt into the grain. The task is to automatically design optimal (or at least good) probes. This will be done by parameterizing a family of probes and Þnding suitable parameters by either statistical optimization or adaptation. The germ for granulometric Þltering is in the work of Matheron (1975), who provided axioms to characterize certain families of sieving operators. Properties are postulated and these lead to a representation theory for the class of Euclidean granulometries. A granulometry is an operator family parameterized by a nonnegative real scalar t. Increasing amounts of the set are removed with increasing t. Measuring the volume removed by the operator as a function of t produces an increasing function known as a size distribution. Because sets are random, size distributions are increasing random functions. Upon normalization by total set volume, the size distribution becomes a probability distribution function called the pattern spectrum of the random set. Along with its random moments, the pattern spectrum is used for both shape and texture classiÞcation, especially the latter (Chakravarthy et al., 1993; Dougherty et al., 1992b, 1992c; Dougherty and Pelz, 1991; Dougherty and Sand, 1995; Maragos, 1989; Sand and Dougherty, 1992, 1998; Theera-Umpon and Gader, 2000; Vincent and Dougherty, 1994). Both theory and texture classiÞcation can be extended to gray-scale images (random functions) (Baeg et al., 1999; Chen and Dougherty, 1994; Chen et al., 1993; Dougherty, 1992; Kraus et al., 1993). Our interest here concerns Þltering in the sense of statistical estimationÑ in particular, the automatic design of Þlters (random-set estimators). Granulometries, as applied to random sets, are based on the morphological opening operator. Given an input set and a structuring element (probe), which is a deterministic set, a point is in the output of the opening if and only if there exists a translate of the structuring element that contains the point and is also a subset of the input set. Points that do not satisfy this structural criterion are removed from the set. If a grain is not sufÞciently large or appropriately shaped to contain a translate of the structuring element, then the entire grain is removed. If we desire a true sieve, one that either fully passes or fully eliminates a grain, then we can apply a reconstructive opening, which is simply the union of all grains not fully eliminated by the opening. The problem is to Þnd structuring elements that provide good opening estimators. The earliest applications of the approach were heuristic, with structuring elements designed by experts, and not in the context of a random model (Giardina and Dougherty, 1988; Serra,
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
3
1982). The earliest paper to approach optimal parametric design of opening structuring elements assumed a very simple model in which image and granulometric generators are geometrically similar (Dougherty et al., 1992a). Another early paper on optimal morphological-Þlter design considered the optimal number of iterations to employ during application of an alternating sequential Þlter, these being iterations of openings and closings (dual operation to opening) of increasingly large structuring elements (Schonfeld and Goutsias, 1991). This article is essentially based on four papers that address the following approaches: single-parameter granulometric Þlters formed as unions of reconstructive openings (Chen and Dougherty, 1996), multiple-parameter granulometric Þlters formed as unions of reconstructive openings (Chen and Dougherty, 1999), single-parameter reconstructive granulometric bandpass Þlters formed as unions of differences of granulometric Þlters (Chen and Dougherty, 1997), and logical structural Þlters (LSFs), which determine whether a grain passes based on Þtting and nonÞtting requirements for structuring elements (Dougherty and Chen, 1998). This does not exhaust the subject. In particular, the underlying bandpass theory is very general and assumes neither reconstructive granulometries nor granular random sets (Dougherty, 1997). However, there is a unity to both the theory and the application of logical granulometric and structural Þltering, and it appears appropriate to treat them in a single exposition.
II. Euclidean Granulometries In this section we review some basic properties of openings and Euclidean granulometries (Matheron, 1975). We refer the reader to Dougherty (1999) and Dougherty and Chen (1999) for concise and fairly complete accounts of the classical algebraic theories of binary openings and granulometries, respectively, and to Heijmans (1995) for the algebraic theory of openings in the context of lattice theory. For subsets S and B of d-dimensional Euclidean space ℜd , the opening of S by (the structuring element) B is deÞned Ŵ B (S) = By (1) B y ⊂S
where B y , the translate of B by y, is deÞned by B y = {x + y: x ∈ B}. In Eq. (1), we use the operator notation, Ŵ B (S), for opening. This notation is most convenient for discussing operator properties. In later sections, when we focus on the form of the structuring element, we will switch to the binary-operation notation, S ◦ B. The set of all subsets of ℜd is denoted by P.
4
EDWARD R. DOUGHERTY AND YIDONG CHEN
Opening satisÞes four basic operator properties. It is 1. 2. 3. 4.
Translation invariant: Ŵ B (Sx ) = Ŵ B (S)x Increasing: S1 ⊂ S2 ⇒ Ŵ B (S1 ) ⊂ Ŵ B (S2 ) Antiextensive: Ŵ B (S) ⊂ S Idempotent: Ŵ B Ŵ B = Ŵ B
These properties will play a fundamental role throughout this article. The kinds of operators we consider may satisfy all or some of them. The properties say a good deal about the manner in which an operator acts on a granular set. All operators we consider will be translation invariant. A set A is said to be open with respect to B if Ŵ B (A) = A. We say that A is B-open. If A is B-open, then Ŵ A (S) ⊂ Ŵ B (S) and Ŵ A (Ŵ B (S)) = Ŵ B (Ŵ A (S)) = Ŵ A (S)
(2)
for any set S. The preceding equation says that if A is B-open, then iteratively opening by A and B is equivalent to opening by A, independent of which is applied Þrst. A disk is open with respect to any disk possessing a smaller radius. Opening can be generalized in accordance with the four preceding basic properties. A mapping : P →P is called an algebraic opening if it is increasing, antiextensive, and idempotent. Further, is called a τ -opening if it is translation invariant. The invariant class of any mapping , denoted by Inv[], is the collection of all sets S for which (S) = S. Owing to idempotence, for any algebraic opening and any set S, (S) ∈ Inv[]. If is a τ -opening, then S ∈ Inv[] if and only if Sy ∈ Inv[] for all y, which means that Inv[] is closed under translation. In fact, an algebraic opening is a τ -opening if and only if its invariant class is closed under translation. A class B is called a base for , or a base for Inv[], if Inv[] is the class generated by B under translations and unions. This means that S ∈ Inv[] if and only if there exists a subfamily {Bi }i∈I of B and points yi such that (Bi ) yi (3) S= i∈I
Bases for τ -openings are not unique. Indeed, Inv[] is a base for itself. To see this, we need to show only that Inv[] is closed under unions since we already know it is closed under translation. If Si ∈ Inv[] for all i ∈ I, then (∪i Si ) ⊃ (Si ) = Si for all i, and therefore (∪i Si ) ⊃ ∪i Si . The reverse inclusion follows from antiextensivity, so that ∪i Si ∈ Inv[]. The intent is to Þnd small bases that determine Þlter behavior. τ -Opening representation is based on openings by structuring elements: a mapping : P →P is a τ -opening if and only if there exists a class of sets B
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
5
such that (S) =
Ŵ B (S)
(4)
B∈B
Moreover, B is a base for . The preceding representation provides a Þlterdesign paradigm. If a set is composed of a disjoint union of grains (connected components), then unwanted grains can be eliminated according to their sizes relative to the structuring elements in the base B. A key to good Þltering is selection of appropriately sized structuring elements, since we wish to minimize elimination of desired grains and maximize elimination of undesired grains. Convex sets play a signiÞcant role in the construction of set operators. A set S is convex if, for any two points x, y ∈ S, the line segment between x and y is a subset of S. For any t > 0, t S is convex if S is convex. If A and B are convex sets, so is the opening of A by B. The following convexity property of openings is basic to many applications: if A is convex, then tA is A-open for any t ≥ 1. The converse is not generally valid; however, a key theorem states that it is valid under the assumption that A is compact: if A is a compact set, then t A is A-open for any t ≥ 1 if and only if A is convex. Convex sets have the property that, for t > r, Ŵt A (S) ⊂ Ŵr A (S). This is because t A = (t/r )r A is open with respect to r A. Having introduced basic properties of openings, we now turn to the classical granulometric theory. A one-parameter family {t }, t > 0, of τ -openings is called a granulometry if, for r ≥ s > 0, Inv[r ] ⊂ Inv[s ]. {t } is called a Euclidean granulometry if, for t > 0, t satisÞes the Euclidean property, which states that t (S) = t1 (S/t). The simplest Euclidean granulometry is a parameterized class of openings, {Ŵt B }. The most general Euclidean granulometry takes the form t (S) =
Ŵr B (S)
(5)
B∈G r ≥t
where G is a collection of sets called a generator of the granulometry. If the sets in G are compact (which we assume), then the double union of the preceding equation reduces to the single union t (S) =
Ŵt B (S)
(6)
B∈G
if and only if the sets in G are convex, in which case we shall say the granulometry is convex. The single union represents a parameterized τ -opening. If G consists of connected sets, and S1 , S2 , . . . are mutually disjoint compact
6
EDWARD R. DOUGHERTY AND YIDONG CHEN
sets, then t
∞ i=1
Si
=
∞
t (Si )
(7)
i=1
A granulometry that distributes over disjoint unions of compact sets will be called distributive. We restrict our attention to Þnite-generator convex Euclidean granulometries n Ŵt Bi (S) (8) t (S) = i=1
where G = {B1 , B2 , . . . , Bn } is a collection of compact, convex sets and t > 0, and where, for t = 0, we deÞne 0 (S) = S. For any Þxed t, t is a τ -opening and tG = {t B1 , t B2 , . . . , t Bn } is a base for t , meaning that set U ∈ Inv[t ] if and only if U can be represented as a union of translates of sets in tG. According to the size and shape of the components (grains) relative to the structuring elements, some components are eliminated, whereas others are either diminished or passed in full. The larger the value of t, the more grains are sieved from the set.
III. Logical Granulometries If our purpose is to Þlter a granular subset of ℜd by fully passing some grains and eliminating others, then the original deÞnition of a granulometry must be altered because, as deÞned, a granulometry typically diminishes passed grains. True sieving is accomplished by applying reconstruction. Given a set operator , the reconstructive operator induced by is deÞned by passing in full any component not completely eliminated by and eliminating any component eliminated by . We denote reconstruction by = . The reconstructive granulometry {t } induced by the granulometry {t } is deÞned by applying reconstruction to each t . Some grains pass the sieve; some do not. A reconstructive granulometry is a granulometry, but it is not Euclidean. Reconstructive openings belong to the class of connected operators (Crespo et al., 1995; Crespo and Schafer, 1997; Heijmans, 1999). These are operators that either pass or completely eliminate grains in both the set and its complement. Boundaries between S and S c cannot be broken or changed; they are either left as is or deleted. To discuss reconstruction in a more general parametric setting, we note that the representation of Eq. (8) can be generalized by separately parameterizing each structuring element, rather than simply scaling each by a common
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
7
parameter. The result is a family {r } of multiparameter τ -openings of the form n S ◦ Bk [rk ] (9) r (S) = k=1
where r1 , r2 , . . . , rn are parameter vectors governing the convex, compact structuring elements B1 [r1 ], B2 [r2 ], . . . , Bn [rn ] composing the base of r and r = (r1 , r2 , . . . , rn ). With our emphasis now on the form of the structuring element, we have switched to binary-operation notation for opening. To keep the notion of sizing, we require (here and subsequently) the sizing condition that rk ≤ sk implies Bk [rk ] ⊂ Bk [sk ] for k = 1, 2, . . . , n, where order in the vector lattice is deÞned by (t1 , t2 , . . . , tm ) ≤ (v1 , v2 , . . . , vm ) if and only if t j ≤ v j for j = 1, 2, . . . , m. A homothetic model arises in Eq. (9) when rk = rk is a positive scalar and there exist primitive sets B1 , B2 , . . . , Bn such that Bk [rk ] = rk Bk for k = 1, 2, . . . , n. Regarding the deÞnition of a granulometry, r is a τ -opening because any union of openings is a τ -opening. The Euclidean condition is not satisÞed. As for the required ordering of the invariant classes, since the parameter is now a vector, the condition does not apply as stated. The obvious generalization is to order the lattice composed of vectors r in the usual componentwise fashion and rewrite the condition in the following form: if r ≥ s > 0, then Inv[r ] ⊂ Inv[s ]. This restatement means that the mapping r → Inv[r ] is order preserving, and we shall say that any family {r } for which it holds is invariance ordered. If r is a τ -opening for any r and {r } is invariance ordered, then we call {r } a granulometry. The family deÞned by Eq. (9) is not necessarily a granulometry because it need not be invariance ordered. It is not generally true that r ≥ s > 0 implies r ⊂ s . As it stands, the family {r } deÞned by Eq. (9) is simply a collection of τ -openings over a parameter space. This is not to say that such operators are not useful, for indeed they are. However, they are not granulometries. Although Eq. (9) does not generally yield a granulometry without reconstruction, a salient special case occurs in the homothetic model when each generator set is multiplied by a separate scalar. Then, for any n-vector r = (r1 , r2 , . . . , rn ), rk > 0, for k = 1, 2, . . . , n, the Þlter takes the form r (S) =
n k=1
S ◦ rk Bk
(10)
To avoid useless redundancy, we suppose that no generator set is open with respect to another generator set. For any r = (r1 , r2 , . . . , rn ) for which there exists rk = 0, we deÞne r (S) = S. {r } is a multivariate granulometry (even without reconstruction), shares many statistical properties with univariate
8
EDWARD R. DOUGHERTY AND YIDONG CHEN
granulometries, and can be used for enhanced texture classiÞcation (Batman and Dougherty, 1997). Although the family {r } deÞned by Eq. (9) is not a granulometry, the induced reconstructive family {r } is a granulometry (since it is invariance ordered), and we call it a disjunctive granulometry. Moreover, reconstruction can be performed termwise rather than on the union: n n
S ◦ Bk [rk ] r (S) = (11) S ◦ Bk [rk ] = k=1
k=1
If the union of Eq. (9) is changed to an intersection and all conditions qualifying Eq. (9) hold, then the result is a family of multiparameter operators of the form r (S) =
n k=1
S ◦ Bk [rk ]
(12)
Each operator r is translation invariant, increasing, and antiextensive but, unless n = 1, r need not be idempotent. Hence, r is not generally a τ -opening and the family {r } is not a granulometry. Each induced reconstruction r is a τ -opening (is idempotent) but the family { r } is not a granulometry because it is not invariance ordered. However, if reconstruction is performed termwise, then the resulting intersection of reconstructions is invariance ordered and a granulometry. A conjunctive granulometry is a family of operators of the form r (S) =
n k=1
S ◦ Bk [rk ]
(13)
In the conjunctive case, the equality of Eq. (11) is softened to an inequality: the reconstruction of the intersection is a subset of the intersection of the reconstructions. Combining conjunction and disjunction yields the more general reconstructive granulometry r (S) =
mk n k=1 j=1
S ◦ Bk, j [rk, j ]
(14)
If Si is a component of S and xi,k, j and yi are the logical variables determined by the truth values of the equations Si ◦ Bk, j [rk, j ] = ⭋ and r (Si ) = ⭋ [or, equivalently, Si ◦ Bk, j [rk, j ] = Si and r (Si ) = Si ], respectively, then y possesses
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
9
Figure 1. A logical granulometry.
the logical representation yi =
mk n
xi,k, j
(15)
k=1 j=1
We call {r } a logical granulometry (Dougherty and Chen, 1997). Component Si is passed if and only if there exists k such that, for j = 1, 2, . . . , m k , there exists a translate of Bk, j [rk, j ] that is a subset of Si . Figure 1 shows a logical granulometry with the input image on the left. The upper row is a conjunctive operator using vertical and horizontal linear structuring elements, the lower row is a conjunctive operator using diagonal structuring elements, and the union of the two conjunctive outputs gives the logical granulometric output on the right. We wish to characterize optimal logical granulometric Þlters with respect to their ability to pass desired grains, the signal, and eliminate undesired grains, the noise (clutter). The signal-union-noise model is deÞned by S ∪ N , where S= N =
I i=1
J j=1
C[si ] + xi (16) D[n j ] + y j
I and J are random natural numbers; si = (si1 , si2 , . . . , sim ) and n j = (n j1 , n j2 , . . . , n jm ) are independent random vectors identically distributed to random vectors s and n; C[si ] and D[n j ] are random connected, compact grains governed by si and n j , respectively; and xi and y j are random translations governing grain locations constrained by grain disjointness. Error results from signal grains erroneously removed and noise grains erroneously passed. Optimization with respect to a logical granulometry {r } is
10
EDWARD R. DOUGHERTY AND YIDONG CHEN
achieved by Þnding r to best estimate S by r (S ∪ N ). This is accomplished by minimizing the expected area, e[r] = E[ν[r (S ∪ N )S]]
(17)
where ν and denote area (Lebesgue measure) and symmetric difference, respectively. Because grains are disjoint, the signal and noise are disjoint and r (S ∪ N ) = r (S) ∪ r (N ). Because C[s] is a random set depending on the multivariate distribution of s, the parameter set MC[s] = {r: r (C[s]) = C[s]}
(18)
is a random set composed of parameter vectors r for which r passes the random primary grain C[s]. Mc[s] and M D[n] , called the signal and noise pass sets, are the regions in the parameter space where signal and noise grains, respectively, are passed. We often write MC[s] and M D[n] as M S and M N , respectively. As functions of s and n, M S = M S (s1 , s2 , . . . , sm ) and M N = M N (n 1 , n 2 , . . . , n m ). Filter error corresponding to the parameter r can be expressed via the signal and noise pass sets in conjunction with the densities f S and f N for s and n. The error consists of noise grains passed and signal grains not passed. If we let χ A denote the indicator function for set A, then ⎞⎤ ⎡ ⎛ ⎞⎤ ⎡ ⎛ D[n j ]⎠⎦ C[si ]⎠⎦ + E ⎣ν ⎝ e[r] = E ⎣ν ⎝ =E
I i=1
r∈M / C[si]
r∈M D[n j]
J ν (C[si ]) 1 − χMC[s j] (r) + E ν(D[n j ])χM D[n j] (r)
j=1
= E[I ]E ν (C[s]) 1 − χMC[s] (r) + E[J ]E ν(D[n])χM D[n] (r) = E[I ] · · · ν[C](s) f S (s) ds + E[J ] · · · ν[D](n) f N (n) dn {s:r∈M / C[s] }
{n:r∈M D[n] }
(19) where in the last expression we have written ν[C](s) instead of ν(C[s]), and similarly for the noise. These integrals can pose serious computational difÞculties. A special situation occurs when M S and M N are random rectangles. For two parameters, if M S is a rectangle of dimensions M S,1 × M S,2 with lowerleft corner situated at the origin, then (r1, r2) ∈ M S if and only if r1 ≤ M S,1
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
11
and r2 ≤ M S,2 . When inclusion of r in both pass sets can be expressed in terms of half-line inclusions of the component parameters, we say the model is separable and Eq. (19) reduces to −1 −1 MS,1 (r2 ) (r1 ) M S,2 ν[C](s1 , s2 ) f S (s1 , s2 ) ds2 ds1 e[r] = E[I ] −∞
+ E[J ]
−∞
∞
M N−1,1 (r1 )
∞
M N−1,2 (r2 )
ν[D](n 1 , n 2 ) f N (n 1 , n 2 ) dn 2 dn 1
(20)
The obvious reduction occurs for a single parameter r. If {r } is induced by a Euclidean granulometry {r } deÞned by Eq. (6), then, for any compact set X, the {r }-size (granulometric size) of X is deÞned by M X = sup{r : r (X ) = ⭋}, where assuming a Þnite generator, the maximum is attained because X is compact. For the univariate granulometry {r }: M S = sup MC[s] , M N = sup M D[n] , and the domains of integration for the Þrst and second integrals reduce to M S > r and M N ≤ r , respectively. Example III.1 For a mathematically straightforward example, consider the model of Eq. (16), and let the primary signal grain C[s] be a randomly rotated ellipse with random axis lengths 2u and 2v, the primary noise grain D[n] be a randomly rotated rectangle with random sides of length 2w and 2z, grain placement be constrained by disjointness, and the Þlter be generated by the opening r (S) = S ◦ r B, where B is the unit disk. This means that r (S) =
S ◦ r B. Then M S = min{u, v}, M N = min{w, z}, ν[C[u, v]] = π uv, and ν[D[w, z]] = 4wz. With f denoting probability densities and assuming the four sizing variables are independent, r ∞ ∞ r e[r ] = π E[I ] uv f (u) f (v) du dv + uv f (u) f (v) du dv 0
+ 4E[J ]
r
0
∞ ∞
r
wz f (w) f (z) dw dz
0
(21)
r
Suppose u and v are gamma distributed with parameters α and β, and w and z are exponentially distributed with parameter b. For model parameters α = 12, β = 1, b = 0.2, and E[I ] = E[J ] = 20, minimization of Eq. (21) occurs for r = 5.95 and e[5.95] = 1036.9. Because the total expected area of the signal is 18,086.4, the percentage of error is 5.73%. Example III.2 For a conjunctive example, let the primary signal grain C[s] be a nonrotated cross with each bar of width 1 and random length 2w ≥ 1 and the primary noise grain D[n] be a nonrotated cross with each bar of width 1, one bar of length z ≥ 1, and the other bar of length 2z. Constrain grain placement
12
EDWARD R. DOUGHERTY AND YIDONG CHEN
by disjointness and deÞne the Þlter by r (S) = (S ◦ r E) ∩ (S ◦ r F), where E and F are unit-length vertical and horizontal lines, respectively. Then M S = 2w, M N = z, ν[C[w]] = 4w − 1, ν[D[z]] = 3z − 1, and ∞ r/2 (4w − 1) f (w) dw + E[J ] (3z − 1) f (z) dz (22) e[r ] = E[I ] 0
r
Under the disjointness assumption, Eq. (19) is applied directly in terms of the probability models governing signal and noise. If grains are not disjoint, then segmentation needs to be performed and the model applies to the segmented grains. Segmentation is usually performed by the morphological watershed operator (Beucher and Meyer, 1992; Meyer and Beucher, 1990). Given the distributions of C[s] and D[n], it is necessary to Þnd the distributions of (C[s]) and (D[n]). Finding the output random-set distribution for the watershed is generally very difÞcult and involves statistical modeling of grain overlapping. For many granular images, when there is overlapping it is often very modest, with the probability of increased overlapping diminishing rapidly. The watershed produces a segmentation line between grains and its precise geometric effect depends on the random geometry of the grains and the degree of overlapping, which is itself random. Even when input grain geometry is very simple, output geometry can be very complicated (as well as dependent on overlap statistics). This problem has been addressed in the context of disjunctive granulometric optimization for circular grains (Dougherty and Cuciurean-Zapan, 1997).
IV. Adaptive Single-Parameter Disjunctive Granulometric Filters To circumvent the mathematical obstacles in deriving an optimal Þlter, as well as the difÞculty of estimating process statistics, we can use adaptive approaches to obtain a (hopefully) close-to-optimal Þlter. In adaptive design, a sequence of realizations, S1, S2, S3, . . . , is made and the Þlter is applied to each realization. r is adapted based on a criterion of goodness relating r (Sn ) and S. Adaptations yield a random-vector time series r0, r1, r2, r3, . . . resulting from transitions rn → rn+1, where rn is the state of the process at time n and r0 is the initial state vector. There are conditions on the scanning process, the form of the Þlter, and the adaptation protocol that result in rn being a Markov chain whose state space is the parameter space of r. When so, adaptive Þltering is characterized by the behavior of the Markov chain rn, which can be assumed to possess a single irreducible class. Convergence of the adaptive Þlter means existence of a steady-state distribution. Filter characteristics are the stationary characteristics of the Markov chain (mean function, covariance function, etc.) in the steady
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
13
state. Our adaptive estimate of the optimal Þlter is the Þlter r , where r is the mean vector of rn in the steady state. Owing to ergodicity in the steady state, r can be estimated from a single realization (from a single sequence of realizations S1, S2, S3, . . .). The size of the time interval over which rn needs to be averaged in the steady state for a desired degree of precision can be computed from the steady-state variance of rn. A. Transition Probabilities Here we consider adaptive design in the context of a single-parameter disjunctive granulometry n S ◦ r Bi (23) r (S) = i=1
We initialize the Þlter r and scan S ∪ N to successively encounter grains. The adaptive Þlter will be of the form r (n) , where n corresponds to the nth grain encountered. When a grain G Òarrives,Óthere are four possibilities: a. b. c. d.
G G G G
is a noise grain and r (n) (G) = G is a signal grain and r (n) (G) = ⭋ is a noise grain and r (n) (G) = ⭋ is a signal grain and r (n) (G) = G
(24)
In the latter two cases, the Þlter has acted as desired; in the Þrst two, it has not. Consequently, we employ the following adaptation rules: i. r → r + 1 ii. r → r − 1 iii. r → r
if condition a occurs if condition b occurs if condition c or d occurs
(25)
Each arriving grain determines a step and we treat r (n) as the state of the system at step n. Since all grain sizes are independent and there is no grain overlapping, r (n) determines a discrete state-space Markov chain over a discrete parameter space. Three positive stationary transition probabilities are associated with each state r: i. pr,r +1 = P(N )P(r (G) = G) ii. pr,r −1 = P(S)P(r (G) = ⭋) iii. pr,r = P(S)P(r (G) = G) + P(N )P(r (G) = ⭋)
(26)
where P(S) and P(N ) are the probabilities of a signal grain and a noise grain arriving, respectively. P(S) and P(N ) depend on the protocol for selecting grains. We will discuss these subsequently.
14
EDWARD R. DOUGHERTY AND YIDONG CHEN
The transition probabilities can be expressed in terms of granulometric measure: i. pr,r +1 = P(N )P(M N ≥ r ) ii. pr,r −1 = P(S)P(M S < r ) iii. pr,r = P(S)P(M S ≥ r ) + P(N )P(M N < r )
(27)
For clarity, we develop the theory with r a nonnegative integer and transitions of plus or minus one; in fact, r need not be an integer and transitions could be of the form r → r + ε and r → r − ε, where ε is some positive constant. B. Steady-State Distribution Equivalence classes of the Markov chain are determined by the distributions of M S and M N . To avoid trivial anomalies, we assume distribution supports are intervals with endpoints a S < b S and a N < b N , where 0 ≤ a S , 0 ≤ a N , and it may be that b S = ∞ or b N = ∞. We assume a N ≤ a S < b N ≤ b S . Non-null intersection of the supports ensures that the adaptive Þlter does not trivially converge to an optimal Þlter that totally restores S. There are four cases regarding state communication: (1) Suppose a S ≤ 1 and b N = ∞: then the Markov chain is irreducible since all states communicate (each state can be reached from every other state in a Þnite number of steps). (2) Suppose 1 ≤ a S and b N = ∞: then, for each state r ≤ a S , r is accessible from state s if s < r , but s is not accessible from r; on the other hand, all states r ≥ a S communicate and form a single equivalence class. (3) Suppose a S < 1 and b N < ∞: then, for each state r ≥ b N , r is accessible from state s if s > r , but s is not accessible from r; on the other hand, all states r ≤ b N communicate and form a single equivalence class. (4) Suppose 1 ≤ a S < b N < ∞: then states below a S are accessible from states below themselves, but not conversely; states above b N are accessible from states above themselves, but not conversely; and all states r such that a S ≤ r ≤ b N communicate and form a single equivalence class. In sum, the states between a S and b N form an irreducible equivalence class C of the state space and each state outside C is transient. With certainty, the chain will eventually enter C and once inside C will not leave. Thus, we focus our attention on C. Within C, the chain is irreducible and aperiodic. If it is also positive recurrent, then it will be ergodic and possess a stationary (steady-state) distribution. If b N is Þnite, then the state space is Þnite and the Markov chain must be positive recurrent and have a stationary distribution. We need to Þnd and prove the existence of a stationary distribution when b N = ∞. Without loss of generality, we assume a S < 1 so that the chain is irreducible and aperiodic over the state space {0, 1, 2, . . .}. We will take an engineering approach to arrive
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
15
at the stationary distribution via the state probabilities, and then comment on the existence of the distribution. Let pr (n) be the probability that the system is in state r at step n. According to the ChapmanÐKolmogorov equation and the transition probabilities of Eq. (27), pr (n + 1) = P(N )P(M N ≥ r − 1) pr −1 (n) + P(S)P(M S < r + 1) pr +1 (n) + [P(S)P(M S ≥ r ) + P(N )P(M N < r )] pr (n)
(28)
Let λk = P(N )P(M N ≥ k), and μk = P(S)P(M S < k). For r = 0, there are only two possible transitions, r → r and r → r + 1, and μ0 = 0. Since P(S) + P(N ) = 1 P(M S < r ) = 1 − P(M S ≥ r )
(29)
P(M N < r ) = 1 − P(M N ≥ r ) we obtain pr (n + 1) − pr (n) = P(N )P(M N ≥ r − 1) pr −1 (n) + P(S)P(M S < r + 1) pr +1 (n) − (P(S)P(M S < r ) + P(N )P(M N ≥ r )) pr (n) = λr −1 pr −1 (n) + μr +1 pr +1 (n) − (λr + μr ) pr (n)
(30)
for r ≥ 1. For r = 0, p−1 (n) = 0 and μ0 = 0 yield the initial state equation p0 (n + 1) − p0 (n) = μ1 p1 (n) − λ0 p0 (n)
(31)
Equations (30) and (31) are the forward Kolmogorov equations for the system. In the steady state, the left-hand sides of both equations become 0, so these equations form the system 0 = μ1 p1 − λ0 p0 (32) 0 = λr −1 pr −1 + μr +1 pr +1 − (λr + μr ) pr (r ≥ 1) Using the boundary condition ∞ k=0
pk = 1
and solving iteratively yields ⎧ λ0 ⎪ ⎪ ⎪ ⎨ p1 = μ1 p0 r λk−1 ⎪ ⎪ ⎪ p = p r 0 ⎩ μ k=1
k
(33)
(34) (r ≥ 1)
16
EDWARD R. DOUGHERTY AND YIDONG CHEN
where p0 =
1+
1 !r
∞ r =1
λk−1 k=1 μk
(35)
so long as p0 > 0, which we prove next. Solution of the forward Kolmogorov equations in the steady state to arrive at the stationary distribution is justiÞed here because it leads to the same system that must be solved to rigorously demonstrate existence and derive the stationary distribution. This solution will be complete once we establish convergence of the sum in Eq. (35). To establish this convergence, note that P(S) > 0, P(N ) > 0, P(M N ≥ k − 1) → 0 as k → ∞, and P(M S < k) → 1 as k → ∞. Hence, there exists k0 and q < 1 such that P(N )P(M N ≥ k − 1) λk−1 = 0, the mean tends toward c; for D = 0, the mean is (c + b)/2. From Eqs. (40) and (58), the steady-state expected error is E[e[r ]] =
b r =c
r 3 − c3 b3 − r 3 E[T ] P(S) 3 + P(N ) 3 pr d − c3 b − a3
(61)
24
EDWARD R. DOUGHERTY AND YIDONG CHEN
It can be shown that (Chen and Dougherty, 1996) ' P(S) P(N ) min 3 , E[T ](b3 − c3 ) d − c3 b3 − a 3 ' P(N ) P(S) , E[T ](b3 − c3 ) ≤ E[e[r ]] ≤ max 3 d − c3 b3 − a 3
(62)
The optimal Þlter has an error bounded by the expected steady-state error for the adaptive Þlter. For the special case D = 0, the two errors must agree because all Þlters whose parameters lie in the single recurrent class of the Markov chain have equal error. For D = 0, E[e[r ]] = E[T ]P(N )
b3 − c3 b3 − a 3
(63)
In this section we have concentrated on steady-state analysis and ignored transient analysis. Moreover, we have not taken advantage of birthÐdeath modeling for Markov chains. We defer to Chen and Dougherty (1996) for discussions of both transient analysis and birthÐdeathmodeling. We will continue to concentrate on the steady state and leave transient analysis to the references.
V. Adaptation in a Multiparameter Disjunctive Model Modeling is more complicated when there is more than one parameter. The mathematical description is more involved and there are various adaptation protocols that can be adopted. Here we restrict ourselves to a two-parameter disjunctive granulometry {r } induced by {r } with r = (r1 , r2 ). In accordance with r being a sizing parameter and increasing r resulting in diminishing Þlter outputs, r is decreasing relative to r: if r′ = (r1′ , r2′ ), r1′ ≤ r1 , and r2′ ≤ r2 , then r ≤ r′ . When a grain G arrives, the possibilities of Eq. (24) remain, with r replacing r. For two parameters, we employ the following generic adaptation rules: i. r1 → r1 + 1 and/or r2 → r2 + 1 ii. r1 → r1 − 1 and/or r2 → r2 − 1 iii. r1 → r1 and r2 → r2
if condition a occurs if condition b occurs if condition c or d occurs
(64)
Assuming that grain arrivals and primary-grain realizations are independent, (r1 , r2 ) determines a two-dimensional, discrete-state-space Markov chain. The
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
25
protocol is generic because an actual protocol depends on interpretation of both and/ors, which depends on the form of the granulometry. We consider some prototypical models. These depend on the form of the Þlter and whether or not we have information as to which parameter caused the structuring element not to Þt inside a grain. Although it is possible to create more reÞned models when more information is available or when other transition arrangements are deÞned, study of the four two-parameter systems discussed next provides sufÞcient understanding for analyzing other two-parameter systems. If there are more than two parameters, then more complex Markov systems arise (Chen and Dougherty, 1999). To avoid a trivial optimal solution, we assume non-null intersection between signal and noise pass sets. Each transition diagram will show an equivalence class C of communicating states and omit transient states. With certainty, the chain will enter C (assumed nonempty) and not leave once inside. Within C the chain is aperiodic. If it is also positive recurrent, then it will be ergodic and possess a stationary distribution.
A. State Transition Probability Equations 1. Type-[I, 0] Model If r is an opening by one two-parameter structuring element, then, unless there is information as to which parameter causes nonÞtting, adaptation must proceed solely on the basis of whether or not a translate of the structuring element Þts within the grain. If adaptation is limited to allow only one parameter to transition at each step, then, when a noise grain is passed, a randomly selected parameter is incremented and, when a signal grain is not passed, a randomly selected parameter is decremented. This deÞnes the type-[I, 0] model. The following transition probabilities result: i. p(r1 , r2 ), (r1 +1, r2 ) = 12 P(N )P((r1 , r2 ) ∈ M N )
ii. p(r1 , r2 ), (r1 , r2 +1) = 12 P(N )P((r1 , r2 ) ∈ M N )
iii. p(r1 , r2 ), (r1 −1, r2 ) = 21 P(S)P((r1 , r2 ) ∈ M S )
(65)
1 2
iv. p(r1 ,r2 ), (r1 , r2 −1) = P(S)P((r1 , r2 ) ∈ M S )
v. p(r1 , r2 ), (r1 , r2 ) = P(S)P((r1 , r2 ) ∈ M S ) + P(N )P((r1 , r2 ) ∈ M N )
To illustrate the type-[I, 0] model, suppose both signal and noise grains are rectangular with signal-grain and noise-grain widths and heights randomly
26
EDWARD R. DOUGHERTY AND YIDONG CHEN
Figure 7. Signal and noise size distribution ranges for a type-[I, 0] model.
distributed over the respective labeled cross-shaded regions in Figure 7. Let the structuring element be a rectangle with sides r1 and r2 . When r1 < a and r2 < c, signal grains will always pass. States in this region are transient because both r1 and r2 can only transition upward throughout the region. When r1 > b or r2 > d, noise grains will never pass, so states in the region deÞned by these inequalities are also transient. Nontransient states occupy the remaining portion of the Þrst quadrant. A transition diagram corresponding to Figure 7 is shown in Figure 8. There are no diagonal transitions because only one parameter can adapt at each stage. As long as the signal and noise parameters are bounded, the transition diagram is Þnite. Transition probabilities at boundary states must be treated carefully. Figure 8 gives one type of boundary conÞguration; others will occur for different models. Let pr1 ,r2 (n) denote the probability that the system is in state (r1 , r2 ) at step n and let p(r1 ,r2 ),(r1 ±1,r2 ±1) denote the transition probability. For a type-[I, 0] system, internal (nonboundary) states are typiÞed by the internal states of the transition diagram of Figure 8. According to the ChapmanÐKolmogorov equation, at any internal state, pr1 ,r2 (n + 1) = p(r1 −1,r2 ),(r1 ,r2 ) pr1 −1,r2 (n) + p(r1 ,r2 −1),(r1 ,r2 ) pr1 ,r2 −1 (n) + p(r1 +1,r2 ),(r1 ,r2 ) pr1 +1,r2 (n) + p(r1 ,r2 +1),(r1 ,r2 ) pr1 ,r2 +1 (n) + p(r1 ,r2 ),(r1 ,r2 ) pr1 ,r2 (n)
(66)
Let λ1,r1 ,r2 , λ2,r1 ,r2 , μ1,r1 ,r2 , and μ2,r1 ,r2 denote the transition probabilities (i) through (iv) of Eq. (65), respectively. Since P(S) + P(N ) = 1, the transition
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
27
Figure 8. Type-[I, 0] model: transition diagram corresponding to Figure 7.
probability for case (v) of Eq. (65) is rewritten as p(r1 ,r2 ),(r1 ,r2 ) = P(S)P((r1 , r2 ) ∈ M S ) + P(N )P((r1 , r2 ) ∈ M N ) = P(S)[1 − P((r1 , r2 ) ∈ M S )] + P(N )[1 − P((r1 , r2 ) ∈ M N )] = 1 − P(S)P((r1 , r2 ) ∈ M S ) − P(N )P((r1 , r2 ) ∈ M N ) = 1 − (λ1,r1 ,r2 + λ2,r1 ,r2 + μ1,r1 ,r2 + μ2,r1 ,r2 )
(67)
Substitution into Eq. (66) yields the state-probability increment for internal states: pr1 ,r2 (n + 1) − pr1 ,r2 (n) = λ1,r1 −1,r2 pr1 −1,r2 (n) + λ2,r1 ,r2 −1 pr1 ,r2 −1 (n) + μ1,r1 +1,r2 pr1 +1,r2 (n) + μ2,r1 ,r2 +1 pr1 ,r2 +1 (n) − (λ1,r1 ,r2 + λ2,r1 ,r2 + μ1,r1 ,r2 + μ2,r1 ,r2 ) pr1 ,r2 (n) (68) Boundary states depend on the form of C, which depends on the distributions of signal and noise parameter vectors, and must be treated separately.
28
EDWARD R. DOUGHERTY AND YIDONG CHEN
2. Type-[I, 1] Model For this model, again consider a two-parameter opening, but now assume that if a signal grain is erroneously not passed, it is known which parameter has caused the erroneous decision. There are three possibilities: (1) r1 , but not r2 , causes the structuring element not to Þt inside the grain; (2) r2 , but not r1 , causes the structuring element not to Þt; (3) both r1 and r2 cause the structuring element not to Þt. Given a signal grain G, the three conditions can be rigorously stated in the following manner: (1) r (G) = ⭋ and there exists r1′ such that, for r′ = (r1′ , r2 ), r′ (G) = G; (2) r (G) = ⭋ and there exists r2′ such that, for r′ = (r1 , r2′ ), r′ (G) = G; (3) there does not exist an r1′ or an r2′ satisfying the preceding conditions. No such convenient characterization exists when a noise grain erroneously passes, since both parameters must be set so that the structuring element can Þt. The type[I, 1] model is illustrated by the rectangular signal and noise grains envisioned for Figure 7, along with r1 × r2 rectangular structuring elements. The three cases correspond to the structuring-element width, height, or both being too great. Referring to Figure 8, suppose r is in state (4, 3) and an arriving signal grain is not passed. Since the minimum value of r2 is 3 for a signal grain, lack of Þt must be caused by r1 and the transition is (4, 3) → (3, 3). Hence, the two left-most columns of Figure 8 cannot be re-entered. A similar comment applies to the bottom-most two rows. The appropriate transition diagram is the one shown in Figure 9. Note that boundary-state transitions are different. A fully general description of the type-[I, 1] model involves a description of the transition probabilities analogous to Eq. (65). The resulting equations are
Figure 9. Type-[I, 1] model: transition diagram.
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
29
very cumbersome. If the model is separable, then the transition probabilities are given by i. p(r1 , r2 ), (r1 +1, r2 ) = 21 P(N )P((M N ,1 ≥ r1 ) ∩ (M N ,2 ≥ r2 ))
ii. p(r1 , r2 ), (r1 , r2 +1) = 12 P(N )P((M N ,1 ≥ r1 ) ∩ (M N ,2 ≥ r2 ))
iii. p(r1 , r2 ),(r1 −1, r2 ) = P(S)P((M S,1 < r1 ) ∩ (M S,2 ≥ r2 ))
+ 21 P(S)P((M S,1 < r1 ) ∩ (M S,2 < r2 ))
iv. p(r1 , r2 ), (r1 , r2 −1) = P(S)P((M S,1 ≥ r1 ) ∩ (M S,2 < r2 ))
(69)
+ 21 P(S)P((M S,1 < r1 ) ∩ (M S,2 < r2 ))
v. p(r1 ,r2 ), (r1 , r2 ) = P(S)P((M S,1 ≥ r1 ) ∩ (M S,2 ≥ r2 ))
+ P(N )P((M N ,1 < r1 ) ∩ (M N ,2 < r2 ))
The only difference between a type-[I, 0] system and a type-[I, 1] system is that μ1,i, j and μ2,i, j are different. For the type-[I, 1] system, let λ1,r1 ,r2 , λ2,r1 ,r2 , μ1,r1 ,r2 , and μ2,r1 ,r2 denote transition probabilities (i) through (iv) of Eq. (69), respectively. The ChapmanÐKolmogorov equation again yields Eq. (68). Once again, boundary equations are derived separately. 3. Type-[II, 0] Model Now suppose r is a union of two openings, each by a structuring element Bi [ri ] depending on a single parameter. The type-[II, 0] model results if there is no information about which structuring element causes nonÞtting. The transition probabilities are given by Eq. (65). Because the Þlter is formed as a union of openings, Þttings of B1 [r1 ] and B2 [r2 ] are checked separately, the model is separable, and the transition-probability equations reduce to i. p(r1 , r2 ), (r1 +1, r2 ) = 21 P(N )P M NB1 ≥ r1 ∪ M NB2 ≥ r2 ii. p(r1 , r2 ), (r1 , r2 +1) = 21 P(N )P M NB1 ≥ r1 ∪ M NB2 ≥ r2 iii. p(r1 , r2 ), (r1 −1, r2 ) = 21 P(S)P M SB1 < r1 ∩ M SB2 < r2 (70) iv. p(r1 , r2 ), (r1 , r2 −1) = 12 P(S)P M SB1 < r1 ∩ M SB2 < r2 v. p(r1 , r2 ), (r1 , r2 ) = P(S)P M SB1 ≥ r1 ∪ M SB2 ≥ r2 + P(N )P M NB1 < r1 ∩ M NB2 < r2
where M SB is the granulometric size of the primary signal grain relative to the structuring element B. The adaptive Þlter has unbounded pass sets and therefore the state space may extend to inÞnity. A typical state transition diagram is shown in Figure 10. For the type-[II, 0] model, internal- and boundary-state analysis is similar to that for the type-[I, 0] model, but the state space may be inÞnitely extended, even when signal and noise parameter distributions are Þnite.
30
EDWARD R. DOUGHERTY AND YIDONG CHEN
Figure 10. Type-[II, 0] model: typical transition diagram.
4. Type-[II, 1] Model This model occurs when Þtting information is fed back. If a signal grain erroneously does not pass, then neither structuring element Þts. Hence there must be a randomization regarding the choice of parameter to decrement. A typical transition diagram is shown in Figure 11. The transition probabilities are expressed via granulometric size by i. p(r1 , r2 ), (r1 +1, r2 ) = P(N )P M NB1 ≥ r1 ∩ M NB2 < r2 ii. p(r1 , r2 ), (r1 , r2 +1) = P(N )P M NB1 < r1 ∩ M NB2 ≥ r2 iii. p(r1 , r2 ), (r1 +1, r2 +1) = P(N )P M NB1 ≥ r1 ∩ M NB2 ≥ r2 iv. p(r1 , r2 ), (r1 −1, r2 ) = 21 P(S)P M SB1 < r1 ∩ M SB2 < r2 v. p(r1 , r2 ), (r1 , r2 −1) = 12 P(S)P M SB1 < r1 ∩ M SB2 < r2 vi. p(r1 , r2 ), (r1 , r2 ) = P(S)P M SB1 ≥ r1 ∪ M SB2 ≥ r2 + P(N )P M NB1 < r1 ∩ M NB2 < r2
(71)
31
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
Figure 11. Type-[II, 1] model: typical transition diagram.
The state space is not inÞnite when the noise density is bounded since only Þtting structuring elements get adapted for passing noise grains. As seen from Figure 11, internal-state analysis for a type-[II, 1] system is different owing to diagonal transitions. Let λ1,r1 ,r2 , λ2,r1 ,r2 , λ3,r1 ,r2 , μ1,r1 ,r2 , and μ2,r1 ,r2 denote transition probabilities (i) through (v) of Eq. (71), respectively. The ChapmanÐKolmogorov equation yields pr1 ,r2 (n + 1) − pr1 ,r2 (n) = λ1,r1 −1,r2 pr1 −1,r2 (n) + λ2,r1 ,r2 −1 pr1 ,r2 −1 (n) + λ3,r1 −1,r2 −1 pr1 −1,r2 −1 (n) + μ1,r1 +1,r2 pr1 +1,r2 (n) + μ2,r1 ,r2 +1 pr1 ,r2 +1 (n) − (λ1,r1 ,r2 + λ2,r1 ,r2 + λ3,r1 ,r2 + μ1,r1 ,r2 + μ2,r1 ,r2 ) pr1 ,r2 (n)
(72)
Left and bottom boundary conditions are partitioned into three cases (equations not shown): r1 = r2 = 2, r1 = 1 and r2 ≥ 2, and r2 = 1 and r1 ≥ 2. Whether right and upper boundary states exist depends on the signal and noise parameter distributions.
B. Numerical Analysis of Steady-State Behavior Convergence for two-parameter adaptive systems is characterized by the existence of steady-state probability distributions. If it exists, the steady-state
32
EDWARD R. DOUGHERTY AND YIDONG CHEN
distribution is deÞned by the limiting probabilities pr1 ,r2 = lim pr1 ,r2 (n)
(73)
n→∞
Because of complicated boundary-state conditions, it is extremely difÞcult to obtain general solutions for the systems discussed, although we are assured of the existence of a steady-state solution when the state space is Þnite. Setting p(n + 1) − p(n) = 0 does not appear to work. The difÞculty of Þnding a solution is supported by a queuing interpretation. A typical Markovian queuing network consists of N nodes, n 1 , n 2 , . . . , n N , each node being a Markov queue. Jobs may arrive at a node from other nodes or from an external source (open network) or jobs may arrive at a node only from other nodes and the total job number in the system is Þxed (closed network). The arrival rate of outside jobs at node i follows a Poisson distribution with parameter γi and there is a probability ri j that a job may complete service at node i and then go to node j. The service rate at node i is exponentially distributed with mean μi . The arrival and service rates are not functions of the system state. All arrivals are independent and all servers at a given node are identical. Job arrivals in the adaptive opening scheme correspond to arriving noise grains that should not pass but do; service completions correspond to arriving signal grains that should pass but do not. A key difference is that the arrival and service rates of an adaptive opening Þlter are related to the state of the system (the numbers of jobs in all queues). In the typical Markovian queuing network these rates are constant, or at least are not dependent on the numbers of jobs in other queues. Also, after a job in the queuing network of an adaptive opening completes its service, it leaves the system. There is no event that upon reduction of one structuring element requires an increase in another structuring element. In the Markovian queuing network, interaction between queues does not happen as a job completes service in a queue. In the adaptive opening network, a job arrival or departure from one queue affects the arrival and service rates of all other queues because Þtting and nonÞtting depend on all sizing parameters. The steady-state system equation for the Markovian queuing network with constant parameters γi and μi for queue i is given by N i=1
γi pn;i− +
=
N i=1
N N j=1 i=1
γi pn +
N i=1
μi ri j pn;i−, j+ +
μi (1 − rii ) pn
N
μi ri0 pn;i+
i=1
(74)
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
33
where pn denotes p(n 1 , n 2 , . . . , n N ); pn;i− denotes pn except that its ith component is n i − 1; pn;i+ denotes pn except that its ith component is n i + 1; and pn;i−, j+ denotes pn except that its ith component is n i − 1; and its jth component is n j + 1 (Jackson, 1957, 1963). For k = 2, ri j = 0, and ri0 = 1, Eq. (74) reduces to Eq. (72), except that λ and μ must be functions of the state, λn and μn , respectively. Each node (queue) in the network behaves as if it were an independent M/M/1 system with a Poisson input λi that satisÞes the equation λi = γi +
N
(75)
λ j r ji
j=i
for i = 1, 2, . . . , N . Thus, the steady-state probability associated with each state has a product form, p(n 1 , n 2 , . . . , n N ) = where pi (n i ) = p0,i
%
N
pi (n i )
(76)
i=1
λi μi
&k
(77)
is the classical solution for the ith queue with n i jobs in the queue. The product form of the Markovian queuing network has been greatly studied and generalized to many other queuing disciplines. The difÞculty for adaptive disjunctive granulometries is that the arrival/service rates are not constant at a given queue and therefore the product solution does not apply. Given these theoretical difÞculties, we now proceed numerically, throughout assuming signal and noise parameters to be uniformly distributed over the square regions from [7, 7] to [16, 16] and from [5, 5] to [14, 14], respectively. The large overlap between uniform distributions will result in optimal Þlters with relatively large errors, but it will serve our purposes by creating rather dispersed steady-state distributions whose geometry can easily be apprehended. We assume the arrival probabilities of signal and noise grains to be P(S) = 2/3 and P(N ) = 1/3. We Þrst treat the type-[I, 1] model because of the simple form of its steadystate distribution. Numerical simulation yields the steady-state density of Figure 12. The probability mass is concentrated about the center of mass (8.126, 8.216). In practice, the adaptive Þlter is selected by scanning realizations. The expected error in the steady state can be computed by using the numerically computed steady-state density. We denote this error by E ss [e[r]]. From the numerical density of Figure 8, E ss [e[r]] = 0.305E[T ]. This error
34
EDWARD R. DOUGHERTY AND YIDONG CHEN
Figure 12. Numerically computed steady-state distribution for the type-[I, 1] model.
slightly exceeds the error at the point closest to the center of the probability mass, e[(8, 8)] = 0.292E[T ]. In the case of rectangular signal grains, noise grains, and single structuring element of dimensions r1 × r2 , Eq. (20) becomes r1 r2 ∞ ∞ e[r] = E[A] x y f S (x, y) d y d x + E[B] x y f N (x, y) d y d x 0
= E[T ] +
0
P(S)
r1
μ(1,1) 0 S P(N ) ∞ μ(1,1) N
r1
r1 ∞
r2
r2
r2
x y f S (x, y) d y d x 0
x y f N (x, y) d y d x
(78)
where μ(1,1) and μ(1,1) are the uncentered mixed second-order moments of f S S N and f N , respectively. If signal and noise parameters are uniformly distributed over the rectangles determined by points (a1 , a2 ) and (b1 , b2 ), and (c1 , c2 ) and (d1 , d2 ), respectively, then the algebraic expression for e[r] can be derived; however, because the expression is cumbersome, we omit it. The error surface has been computed by employing the distributions used for obtaining the adaptive Þlter. Minimal error occurs at (7, 7) with e[(7, 7)] = 0.246E[T ]. The increase in E ss [e[r]] = 0.305E[T ] over e[(7, 7)] represents the cost of adaptation. Given the large overlap between the signal- and noise-parameter distributions, a cost of 0.059E[T ] is reasonable. The center of the steady-state
35
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
0.04 0.03
15
0.02 0.01 10 0
5 5 10
15 Figure 13. Numerically computed steady-state distribution for the type-[I, 0] model.
probability mass is at (8, 8) and the points nearby to (7, 7) have increased errors e[(7, 8)] = e[(8, 7)] = 0.270E[T ] and e[(8, 8)] = 0.292E[T ]. The numerically computed steady-state distribution for the type-[I, 0] model (using the same signal and noise distributions) is shown in Figure 13. From the type-[I, 0] state diagram, we see that the steady-state distribution has a band of probability mass concentrated in a curve running internally to the transition diagram. Existence of such mass is reasonable since either an appropriate width or height of the structuring rectangle can discriminate signal from noise. Adaptation is likely to yield a suitable structuring rectangle whose dimensions lie along the concentration of steady-state probability mass. For this example, adaptation will likely lead to a rectangle of dimensions r1 × r2 , where r1 ≈ 9.5 and r2 ≤ 10 or r2 ≈ 9.5 and r1 ≤ 10. The numerically computed steady-state distribution for the type-[II, 1] model is shown in Figure 14. The shape is similar to that of the distribution for the type-[I, 1] model; however, its center of mass is at (12.86, 12.86). As in the type-[I, 1] model, there is a cumbersome expression for the error. The optimal Þlter occurs for either r = (7, 15) or r = (15, 7), with minimum error 0.293E[T ]. Owing to directional symmetry throughout the model, the existence of two optimal parameter vectors should be expected. The numerically approximated expected adaptive Þlter error is E ss [e[r]] = 0.327E[T ]. For the numerical steady-state distribution, the center of mass is at (12.86, 12.86). At Þrst glance, this appears to differ markedly from the two optima; however,
36
EDWARD R. DOUGHERTY AND YIDONG CHEN
Figure 14. Numerically computed steady-state distribution for the type-[II, 1] model.
e[(13, 13)] = 0.327E[T ], which is close to the optimal value (and, to three decimal places, agrees with the expected Þlter error in the steady state). Because of strongly overlapping uniform signal and noise distributions, there is a large region in the parameter plane for which e[r] is fairly stable. The state space for the type-[II, 0] model extends to inÞnity and so will the steady-state distribution. Figure 15 shows the numerically derived steady-state
Figure 15. Numerically computed steady-state distribution (up to r1 ≤ 100 and r2 ≤ 100) for the type-[II, 0] model.
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
37
Figure 16. Plasma-cell image of multiple myeloma.
distribution up to r1 ≤ 100 and r2 ≤ 100. The probability mass is concentrated along an L-shaped region that splits the state space. A Þlter designed by adaptation is likely to have structuring elements with parameters r1 and r2 forming a point in this region. From the Þgure, we see that one of the structuring elements will likely have a size near 10 with the other being greater than 10, the exception being a concentration of mass in the region of (12, 12). Intuitively, if r1 = 10 and r2 is very large, then signal and noise are discriminated by the Þrst structuring element; if adaptation leads to r1 ≈ r2 , then Þltering is accomplished by both structuring elements working in tandem. Application V.1 (Locating Cancerous Cells) Diagnosis of many cancers depends on morphological identiÞcation of different cells. Consider the plasmacell image of multiple myeloma shown in Figure 16. Plasma cells may constitute 15 to 90% of the cells in bone marrow. The cells in Figure 16 are mostly mature but some are forming cancerous giant cells. To process the image automatically, we can use a watershed algorithm to Þnd and separate cells. It is not necessary to identify each cell at this stage or to identify small noise blobs, since a τ -opening will be adaptively designed to locate large tumor cells, thereby correctly classifying noise blobs with noncancerous cells. In Figure 17, the boundary image for the realization of Figure 16 is shown superimposed on the original image. To train an adaptive Þlter, we label each cell as signal (black) or noise (gray), as shown in Figure 18. We use a disjunctive granulometry with four elliptical structuring elements at rotations 0◦ , 45◦ , 90◦ , and 135◦ , each ellipse having axis lengths a and b, these being the two Þlter
38
EDWARD R. DOUGHERTY AND YIDONG CHEN
Figure 17. Boundary image superimposed on original image shown in Figure 16.
parameters. We employ type-[I, 1] adaptation assuming an ellipseÕs major axis is longer than its minor axis. Figure 18 is used Þve times for adaptation and the empirical means of a and b are found to be 35.3 and 27.9, respectively. The result of applying the Þlter with a = 36 and b = 28 to the image of Figure 16 is shown in Figure 19.
Figure 18. Image showing each cell labeled as signal (black) or noise (gray).
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
39
Figure 19. The result of applying the Þlter with a = 36 and b = 27.9 to the image in Figure 16.
VI. Granulometric Bandpass Filters A bandpass approach to granulometric Þltering can be taken by decomposing the image into components related to its granulometric structure. According to the general theory, granulometric spectral components of a random set S are deÞned relative to the manner in which S is sieved by a granulometry {t } (Dougherty, 1997). For each t ≥ 0, there is a spectral component St ⊂ S such that the spectral components are mutually disjoint and the family {St } forms a partition of S, S = ∪t St . A granulometric bandpass Þlter results from passing certain spectral components and not passing others: there is a pass set ⊂ [0, ∞) and St (79) (S) = t∈
Optimization of granulometric bandpass Þlters is accomplished by Þnding a pass set yielding a Þlter that minimizes the error between the Þltered and ideal images. Here we discuss the bandpass theory as it applies to univariate disjunctive granulometric bandpass Þlters. The general theory permits grains to be decomposed into their granulometric spectral content; in the reconstructive setting, no matter the structuring elements, grains are either retained in full or eliminated in full, which thereby places each grain into a single spectral
40
EDWARD R. DOUGHERTY AND YIDONG CHEN
component. Optimal bandpass Þlters are analyzed in the reconstructive setting using granulometric size, and the Markovian adaptation theory is extended to the bandpass setting. So that the probabilistic analysis is not overly complicated, the adaptive theory is developed for a single passband. Two adaptive models are studied. In one, it is assumed there is a known point in the passband; in the other, no such assumption is made. The Þrst case requires more prior knowledge, possesses an analytic solution, and results in passband parameters having less variability in the steady state of the adaptation process.
A. Granulometric Spectral Theory Given a Euclidean granulometry {t }, the size distribution of a compact set S is deÞned by (t) = ν[S] − ν[t (S)]
(80)
The size distribution gives the volume of the set removed by t . It is increasing and continuous from the left (Matheron, 1975). We let (0) = 0 and assume at least one generating set for {t } contains more than a single point, so that (t) = ν[S] for sufÞciently large t. The normalization, (t) = (t)/ν[S] is called the pattern spectrum of S relative to {t }. If S is a random set, then (t) and (t) are random functions. The expectation of the size distribution is the mean size distribution (MSD), M(t) = E[(t)]. In bandpass analysis, the key role is played by the derivative H(t) = M′ (t), which is called the granulometric size density (GSD). Because M(t) need not be differentiable in the ordinary sense, the derivative may be taken in a generalized sense. In most physically realizable situations the MSD is continuously differentiable. We assume that H(t) is of bounded variation and continuous from the left (which creates no meaningful theoretical or practical constraint). The MSD and the GSD serve as partial descriptors of the random set inducing them, in much the same way as the power spectral density partially describes a wide-stationary random function, and they play a role analogous to that of the power spectral density in designing optimal Þlters. The continuous opening spectrum of S relative to the granulometry {t } is deÞned by St = [t (S) − t+τ (S)] (81) τ >0
for t ≥ 0. The collection {St } of spectral components forms a partition of S. For Þxed τ , the difference t (S) − t+τ (S) gives the band within S formed by subtracting the Þlter t+τ (S) from the Þlter output t (S). The intersection is taken so that the bands are continuously parameterized. A discrete spectrum
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
41
is generated by letting Sk = k (S) − k+1 (S)
(82)
for k = 0, 1, 2, . . . . Figure 20 shows discrete spectra for a triangle from three granulometries: (a) an opening by disks, (b) an opening by horizontal lines, and (c) a τ -opening using horizontal and vertical lines. Both discrete and continuous spectra were considered in the Þrst proposed morphological spectral theory for openings (Haralick et al., 1995). That theory applied only to specialized granular images and did not explicitly involve the GSD. Discrete bandpass approaches have been considered that involve the GSD (Dougherty, 1997; Sivakumar and Goutsias, 1997). Our interest here is in continuous spectra and representation of optimal bandpass Þlters in terms of the GSDs of the signal and the noise. A subset of the nonnegative real axis is called a countable-interval subset if it can be represented as a countable union of disjoint intervals i , where singleton point sets are treated as intervals of length zero. Without loss of generality we assume that i < j implies, for all t ∈ i and r ∈ j , that t < r and that there exists s ∈ such that t < s < r . This means that i is to the left of j and that i and j are separated by the complement of . The granulometric bandpass Þlter(GBF) corresponding to the countable-interval subset (relative to the spectrum {St }) is deÞned according to Eq. (79). and c are the pass and fail sets for the Þlter. Given the task of restoring signal S from observed image S ∪ N , Þlter optimization involves Þnding an optimal pass set relative to the spectral decomposition {(S ∪ N )t }: Þnd a pass set yielding a Þlter deÞned according to Eq. (79) having minimum error. Given a granulometry {t }, an optimal pass set and corresponding optimal Þlter are denoted by and , respectively. Let H S and H N denote the GSDs for the signal and noise relative to {t } and assume H S and H N are continuously differentiable except on sets without limit points (a mathematical restriction offering no practical constraint). If S and N are disjoint, then an optimal pass set is given by = {t: H S (t) ≥ H N (t)}
(83)
where the derivatives may involve delta functions and the inequality is interpreted in the usual manner wherever impulses are involved (Dougherty, 1997). Design of an optimal GBF relative to the granulometry {t } involves Þnding the GSDs of the signal and noise processes and then solving the differential inequality. The general theory is based on the Lebesgue decomposition of the MSD into absolutely continuous and singular parts. The optimal pass set of Eq. (83) results from minimizing the error expression for an arbitrary Þlter deÞned by a pass set. This error expression involves the MSDs of the signal and the noise.
42
EDWARD R. DOUGHERTY AND YIDONG CHEN
s0 s1 s2 s3 s4
(a)
s0 s1 s2 s3 s4 s5 s6 (b)
s0 = s1 = O s2 s3 s4 s5 s6 (c)
Figure 20. Discrete spectra for a triangle from three granulometries: (a) an opening by disks, (b) an opening by horizontal lines, and (c) a τ -opening using horizontal and vertical lines.
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
43
If both MS and MN are absolutely continuous (in particular, if they are both continuously differentiable), then the general error representation for the optimal Þlter reduces to (84) e[ ] = ! H S (t) dt + ! H N (t) dt
c
Moreover, if there exists a sequence 0 = t0 < t1 < t2 < . . . such that the optimal pass set is = [t1 , t2 ) ∪ [t3 , t4 ) ∪ [t5 , t6 ) ∪ . . .
(85)
then (S ∪ N ) =
∞ [t2k−1 (S ∪ N ) − t2k (S ∪ N )]
(86)
k=1
This shows the bandpass nature of the Þlter. There are other decompositions of leading to error representations. For instance, if = [t1 , t2 ] ∪ . . . ∪ [t2m+1 , t2m+2 ]
(87)
with t2m+2 < ∞, then the representation of Eq. (86) holds with m + 1 in place of ∞. Or, if = [t1 , t2 ] ∪ . . . ∪ [t2m+1 , ∞)
(88)
then (S ∪ N ) =
m k=1
[t2k−1 (S ∪ N ) − t2k (S ∪ N )] ∪ 2m+1 (S ∪ N )
(89)
Other Þlter representations exist for different decompositions of (for instance, if the Þrst interval for begins at 0).
B. Granulometric Spectral Theory for Univariate Disjunctive Granulometries The differential determinant of the optimal Þlter in Eq. (83) and the corresponding error representation of Eq. (84) have been stated as consequences of the general theory, which involves both generalized derivatives and measuretheoretic concepts. However, under simplifying assumptions, it is possible to give an elementary derivation of both in the case of univariate disjunctive granulometries applied to granular random sets. In particular, consider the
44
EDWARD R. DOUGHERTY AND YIDONG CHEN
random set X=
c k=1
X k + zk
(90)
where X 1 , X 2 , . . . , X C are identically distributed to X (which plays the role of a primary grain for X), C is a random positive integer independent of the grains, and z 1 , z 2 , . . . , z C are locations randomized up to the constraint that the union forming X is disjoint. Then (t) = {ν[X k ]: M X k < t} =
c
ν[X k ]T[X k ; t]
(91)
k=1
where T[X k ; t] = 1 if M X k < t and T[X k ; t] = 0 if M X k ≥ t. Taking expectations yields M(t) = μ X E[ν[X]T[X; t]]
(92)
where μ X = E[C]. Let the random set X depend on the parameter vector W. Then ν[X](w) f w (w) dw (93) M(t) = μ X {w:MX (w) < t}
where the integral has the dimensionality of W. Consider a pass set of the form given in Eq. (85), but do not assume that is optimal. Assume that the MSDs of the signal and the noise are differentiable, which either holds for practical situations or can provide a close-as-desired approximation in practical situations. The error of the corresponding GBF comes from signal grains not passed and noise grains passed. Denote these two errors by e[ S ] and e[ N ], respectively. Then N ∞ ν[Nk ](T[X k ; ti+1 ] − T[X k ; ti ]) e[ N ] = E i=1 k=1
= = =
∞
M N (tk+1 ) i=1 ∞ tk+1
− M N (tk )
H N (t) dt
i=1
!
tk
H N (t) dt
(94)
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
45
where the Þrst equality follows from the deÞnition of the Þlter, the second from Eq. (92), and the third by the fundamental theorem of calculus. A similar representation holds for e[ S ] in terms of the fail set and the noise GSD. From Eq. (94) and the analogous expression for e[ S ], we see that e[ N ] and e[ S ] are minimized by choosing the pass set in accordance with the differential determinant of Eq. (83), and that choice gives the error representation of Eq. (84). Note that under the assumptions the inclusion or lack of inclusion of any countable number of points in the pass set has no effect on Þlter error. In its most general form, the integral of Eq. (93) giving M(t) is intractable, which makes the derivative problematic; however, for certain stochasticgeometric models, it simpliÞes and H(t) can be found. Suppose X depends on n + 1 random parameters; X = X(W, U1 , . . . , Un ); W is independent of U1 , . . . , Un ; MX depends only on W; and MX (W ) is an increasing function of W. Then we can write MX (W, U1 , . . . , Un ) = r (W ), where r is an increasing function, and ∞ r −1 (t) ∞ ··· ν[X](w, u 1 , . . . , u n ) f W (w) M(t) = μ X 0
0
0
× f (u 1 , . . . , u n ) du 1 · · · du n dw
If f W is a continuous function of w, then ∞ f W (r −1 (t)) ∞ ··· ν[X](r −1 (t), u 1 , . . . , u n ) H(t) = μ X ′ −1 r (r (t)) 0 0 × f (u 1 , . . . , u n ) du 1 . . . du n
(95)
(96)
In the special case when r is the identity, MX (W, U1 , . . . , Un ) = W and H(t) = μ X f W (t)E[ν[X]|W =t ]
(97)
where ν[X]|W =t means the area of X is evaluated for W = t. This result is intuitive: the derivative of the MSD at t is the expected area of the primary grain when W is Þxed at t, weighted by the inÞnitesimal probability mass of W at t. If X depends only on a single random parameter W and MX (W ) = W , then we get the reduction H(t) = μ X f W (t)ν[X](t)
(98)
Example VI.1 Let X be a nonrotated ellipse with horizontal and vertical axes given by W and U, respectively. Let t be a τ -opening with structuring elements being horizontal and vertical lines of unit length. Then MX (W, U ) = max{W, U } and t t wu f W,U (w, u) dw du (99) M(t) = πμ X 0
0
46
EDWARD R. DOUGHERTY AND YIDONG CHEN
If W and U are independent, then the integral splits because f W,U = f W fU and if both densities are continuous functions, then t t H(t) = πμ X t f W (t) u fU (u) du + fU (t) w f W (w) dw (100) 0
0
Example VI.2 Consider a signal consisting of squares possessing random angle of rotation and random radius R, so that X = X(R, ). Let t be an ordinary opening by a disk of radius t (and t be the induced reconstructive Þlter). Then Eq. (98) applies, MX (R, ) = R, and HS (t) = 4μ S t 2 f S (t), where μ S is the expected number of signal squares. Let the noise consist of ellipses possessing random angle of rotation, random minor axis 2W, and random major axis 4W. With t being ordinary opening by a disk of radius t, H N (t) = 2πμ N t 2 f N (t). The pass set is determined by the inequality μ S f S (t) ≥ (π/2)μ N f N (t). Now, suppose μ S = μ N , R is normally distributed with mean 20 and standard deviation 3, and W possesses a bimodal distribution that is the sum of two Gaussians, one having mean 10, standard deviation 2, and mass 3/4, the other having mean 30, standard deviation 2, and mass 1/4. These are depicted in Figure 21, together with the curve for (π/2) f N (t). Referring to the Þgure, we can see that the pass set consists of all t in the interval determined by f S (t) ≥ (π/2) f N (t). With t ′ = 14.336 and t ′′ = 26.332 being the left and right endpoints of the passband, the optimal granulometric bandpass Þlter is (S ∪ N ) = t ′ (S ∪ N ) − t ′′ (S ∪ N )
(101)
The optimal Þlter has a single passband, a consequence of both the Þlter form
Figure 21. Curves for f N , f S , and (π/2) f N (t) in Example VI.2.
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
47
Figure 22. Realization of the noisy process S ∪ N .
and the form of the densities. Figure 22 shows a realization of the noisy process S ∪ N and Figure 23 shows the Þltered image (S ∪ N ) corresponding to the realization when the integer passband [14, 27) is employed. If we consider an arbitrary single-passband Þlter with passband [r1 , r2 ], denoted by r1 ,r2 , its error, according to Eq. (84), is given by r1 ∞ r2 H N (t) dt (102) H S (t) dt + H S (t) dt + e r1 ,r2 = 0
r2
r1
Figure 24 shows the error as a function of r1 and r2 . The error is minimized with r1 = 14.336 and r2 = 26.332: e[14.336,26.332 ] = 0.0323E[T ], where T is total image area.
Relative to logical granulometries as a class, we have considered only singlevariable GSDs for univariate disjunctive granulometries. Given a logical granulometry {t }, the multivariate size distribution and pattern spectrum for a compact set S are deÞned by (t) = ν[S] − ν[t (S)] and (t) = (t)/ν[S], respectively. For disjunctive granulometries, the multivariate pattern spectrum is a probability distribution function. For conjunctive granulometries, the multivariate pattern spectrum is not a probability distribution function. It is not true that (t1 , t2 , . . . , tn ) → 0 as t1 → 0 or t2 → 0 or . . . or tn → 0. Even if (t1 , t2 , . . . , tn ) has continuous partial derivatives of the second order, its mixed partial derivative, which would be a probability density were it a probability distribution function, need not be nonnegative. Treating S as a random
48
EDWARD R. DOUGHERTY AND YIDONG CHEN
Figure 23. Filtered image (S ∪ N ) corresponding to the realization when the integer passband [14, 27) is employed.
set, (t) and (t) are random functions. The MSD is M(t) = E[(t)]. Appropriately deÞning the granulometric size density depends on determining a satisfactory spectral theory for bandpass analysis. While this has not yet been accomplished for multivariate disjunctive granulometries, it has been
Figure 24. Error as a function of r1 and r2 .
49
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
accomplished for multivariate conjunctive granulometries. For a multivariate conjunctive granulometry {t } generated by n structuring elements, the GSD of a random set S relative to {t } is given by H(t1 , t2 , . . . , tn ) = (−1)n+1
∂ n M(t1 , t2 , . . . , tn ) ∂t1 ∂t2 · · · ∂tn
(103)
where we assume the mixed partial derivative exists and is continuous (Dougherty, 2001). We will not pursue this matter here, noting only that even under the assumption of a continuous mixed partial derivative, one cannot avoid dealing with measure-theoretic issues.
C. Adaptive Bandpass Filters Given a Known Point in the Passband To extend the adaptation theory to the bandpass setting, we focus on a single passband (multiple passband theory is conceptually straightforward but computationally cumbersome). For a univariate disjunctive granulometry {t }, a single passband Þlter has the representation r1 ,r2 (S) = r1 (S) − r2 (S)
(104)
where r1 ≤ r2 . The Þlter bandwidth is r2 − r1 . Once again, the possibilities of Eq. (24) occur. There are two parameters, r1 (n) and r2 (n), to adapt. We have the following generic adaptation rules: i. r1 → r1 + 1 and/or r2 → r2 + 1 ii. r1 → r1 − 1 and/or r2 → r2 − 1 iii. r1 → r1 and r2 → r2
if condition a occurs if condition b occurs if condition c or d occurs
(105)
The adaptation rules are designed so that when a signal grain erroneously does not pass, the passband is expanded; when a noise grain erroneously passes, the passband is reduced; and when a grain is correctly passed or not passed, the passband is not changed. The and/or represents a choice of rules. Assuming grain arrivals and primary-grain realizations are independent, the parameter pair (r1 , r2 ) determines a two-dimensional, discrete-state-space Markov chain. Suppose there is a point ω0 known to be in the passband [r1 , r2 ). Then the transition rules governing adaptation are i. ii. iii. iv. v.
r1 → r1 + 1, r2 → r2 r1 → r1 − 1, r2 → r2 r1 → r1 , r2 → r2 − 1 r1 → r1 , r2 → r2 + 1 r1 → r1 , r2 → r2
for a noise grain G with MG ≤ ω0 and r1 ≤ MG for a signal grain G with MG ≤ ω0 and MG < r1 for a noise grain G with MG > ω0 and r2 ≥ MG for a signal grain G with MG > ω0 and MG > r2 otherwise (106)
50
EDWARD R. DOUGHERTY AND YIDONG CHEN
There is a partitioning of the events in terms of whether a signal grain or a noise grain is encountered and whether or not the granulometric size of the encountered grain is smaller or greater than ω0 . Taking the partition relative to ω0 allows us to treat the left and right passband endpoints as independent processes: r1 transitions when MG ≤ ω0 and r2 transitions when MG > ω0 . Let pr,r ′ denote the probability of the transition r → r ′ when a grain is encountered; M S and M N the granulometric measures resulting from the signal and clutter primary grains, respectively; and P(S) and P(N ) the probabilities of encountering a signal grain and a noise grain, respectively. From the transition rules, the transition probabilities governing r1 are i. pr1 ,r1 +1 = P(N )P(r1 ≤ M N < ω0 ) ii. pr1 ,r1 −1 = P(S)P(M S < r1 ) iii. pr1 ,r1 = 1 − P(S)P(M S < r1 ) − P(N )P(r1 ≤ M N < ω0 )
(107)
The transition probabilities governing r2 are i. pr2 ,r2 +1 = P(S)P(M S ≥ r2 ) ii. pr2 ,r2 −1 = P(N )P(ω0 ≤ M N < r2 ) iii. pr2 ,r2 = 1 − P(S)P(M S ≥ r1 ) − P(N )P(ω0 ≤ M N < r2 )
(108)
No matter what transitions are made, r1 ≤ ω0 ≤ r2 . Let λr1 = pr1 ,r1 +1 , μr1 = pr1 ,r1 −1 , αr2 = pr2 ,r2 +1 , βr2 = pr2 ,r2 −1 , and pr (n) be the probability of the chain being in state r. The ChapmanÐKolmogorov equations give the state transition probabilities pr1 (n + 1) = pr1 −1,r1 pr1 −1 (n) + pr1 +1,r1 pr1 +1 (n) + pr1 ,r1 pr1 (n) (1 < r1 ≤ ω0 ) = λr1 −1 pr1 −1 (n) + μr1 +1 pr1 +1 (n) − (λr1 + μr1 ) pr1 (n) + pr1 (n) p1 (n + 1) = μ2 p2 (n) − λ1 p1 (n) + p1 (n)
(109)
for r1 . For r2 , pr2 (n + 1) = pr2 −1,r2 pr2 −1 (n) + pr2 +1,r2 pr2 +1 (n) + pr2 ,r2 pr2 (n)
(r2 ≥ ω0 )
= αr2 −1 pr2 −1 (n) + βr2 +1 pr2 +1 (n) − (αr2 + βr2 ) pr2 (n) + pr2 (n) p1 (n + 1) = βω0 +1 pω0 +1 (n) − λω0 pω0 (n) + pω0 (n)
(110)
These are the same type of equations that we had for univariate disjunctive granulometries. There exists a steady-state distribution for r1 since the Markov chain is Þnite and there exists a steady-state solution for r2 based on the analysis for univariate disjunctive granulometries. The steady-state solutions arising
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
from Eqs. (109) and (110) are ⎧ r 1 −1 ⎪ λk ⎪ ⎪ = p p ⎪ 1 ⎨ r1 μ k+1 k=1 ⎪ 1 ⎪ ⎪ ⎪ ⎩ p1 = 1 + ω0 !r1 −1
(111)
λk k=1 μk+1
r1 =2
and
(2 ≤ r1 ≤ ω0 )
51
⎧ r 2 −1 ⎪ αk ⎪ ⎪ = p p ⎪ ω0 ⎨ r2 β k=ω0 k+1 ⎪ 1 ⎪ ⎪ !r2 −1 ⎪ ⎩ pω0 = 1 + ∞ r2 =ω0 +1
(r2 ≥ ω0 + 1)
(112)
αk k=ω0 βk+1
Example VI.3 Let the signal and the noise be the same as in Example VI.2. For adaptation, we employ the weighted random point selection protocol to sample the grain image. If we assume that the expected number of signal grains equals the expected number of noise grains, μ S /μ N = 1, then P(S) = 0.461 and P(N ) = 0.539. Let r be an ordinary opening by a disk of radius r. The granulometric-size density functions f MS and f M N are derived from the sizing parameter densities f R and f W governing the signal radii and noise axes, respectively. From Eq. (49), f MS (r ) = $ ∞ 0
4r 2 f R (r ) 4r 2 f R (r ) = E[ν[S]] 4s 2 f R (s) ds
(113)
where S is the primary grain of the signal process. From Eqs. (42) and (43), E[A] E[T ] E[B] P(N ) f M N (w) = 2πw2 f W (w) E[T ] P(S) f MS (r ) = 4r 2 f R (r )
(114) (115)
The known point in the passband is ω0 = 20. Numerical calculation according to Eqs. (111) and (112) yields the steady-state densities in Figure 25. The mean and standard deviation of r1 in the steady state are 14.249 and 0.734, respectively; the mean and standard deviation of r2 in the steady state are 26.478 and 0.743, respectively. The expected error of the adaptive Þlter as it is averaged over all states can be computed from Eq. (102), where, owing to independence of the r1 and r2 chains, pr1 ,r2 = pr1 pr2 . Computation yields E[e[r1 ,r2 ]] = 0.0430E[T ]. From the perspective of Þlter design, E[e[r1 ,r2 ]] is not crucial; rather, it is the Þlter error when the steady-state mean is used as the Þlter parameter. It would be poor design to simply stop at some arbitrary
52
EDWARD R. DOUGHERTY AND YIDONG CHEN
Figure 25. Steady-state densities, in Example VI.3, resulting from numerical calculation according to Eqs. (111) and (112), along with marginal densities, for Example VI.4, derived from the estimated joint density.
adaptation step in the steady state; the steady-state mean should be employed as the parameter for the designed Þlter. Minimum error occurs when optimal values are used. The operational cost of adaptation can be obtained from Figure 24 by Þnding the increased error resulting from using nonoptimal parameter values. In the present case, adaptive design has done very well; indeed, the integer (digitized) passband is the same for optimal and adaptive design.
D. Adaptive Bandpass Filters Given No Known Point in the Passband Suppose we do not a priori know a point in the passband. If a signal grain is erroneously not passed, it is because the grain is either too large or too small. If a noise grain is erroneously passed, we know only that it is in the passband and that one or the other of the endpoints must be adjusted to decrease the probability of passage. Hence, we adopt the following transition rules: i. r1 → r1 − 1, r2 → r2 ii. r1 → r1 , r2 → r2 + 1 iii. r1 → r1 , r2 → r2 − 1 or r1 → r1 + 1, r2 → r2 iv. r1 → r1 , r2 → r2
for a signal grain G and MG < r1 for a signal grain G and MG ≥ r1 for a noise grain G and r1 ≤ MG < r2 otherwise (116)
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
53
Figure 26. Typical transition diagram for adaptive bandpass Þlters given no known point in the passband.
The following transition probabilities result from randomly deciding which endpoint to adjust when a noise grain falls in the passband: i. p(r1 ,r2 ),(r1 −1,r2 ) = P(S)P(M S < r1 ) ii. p(r1 ,r2 ),(r1 ,r2 +1) = P(S)P(M S ≥ r1 )
iii. p(r1 ,r2 ),(r1 +1,r2 ) = 12 P(N )P(r1 ≤ M N < r2 ) iv. p(r1 ,r2 ),(r1 ,r2 −1) = 21 P(N )P(r1 ≤ M N < r2 )
(117)
v. p(r1 ,r2 ),(r1 ,r2 ) = 1 − P(S)[P(M S < r1 ) + P(M S ≥ r1 )] − P(N )P(r1 ≤ M N < r2 )
where r1 ≤ r2 . Adaptation is more problematic because r1 and r2 cannot be separated into chains whose transitions can be treated independently. A typical transition diagram is shown in Figure 26. The diagram consists of internal and boundary states. These must be treated differently. The notation in the diagram indicates the various kinds of generic transitions that can occur and the corresponding transition parameters deÞned by αr1 ,r2 = p(r1 ,r2 ),(r1 −1,r2 ) , βr1 ,r2 = p(r1 ,r2 ),(r1 ,r2 +1) , μr1 ,r2 = p(r1 ,r2 ),(r1 ,r2 −1) , and λr1 ,r2 = P(r1 ,r2 ),(r1 +1,r2 ) .
54
EDWARD R. DOUGHERTY AND YIDONG CHEN
The ChapmanÐKolmogorov equations yield the state probability increments for internal states: pr1 ,r2 (n + 1) − pr1 ,r2 (n) = αr1 +1,r2 pr1 +1,r2 (n) + βr1 ,r2 −1 pr1 ,r2 −1 (n) + λr1 −1,r2 pr1 −1,r2 (n) + μr1 ,r2 +1 pr1 ,r2 +1 (n) − (αr1 ,r2 + βr1 ,r2 + λr1 ,r2 + μr1 ,r2 ) pr1 ,r2 (n)
(118)
Letting n → ∞ in Eq. (118) (as well as similar equations for boundary states) yields equations in terms of the limiting probabilities Pr1 ,r2 = lim Pr1 ,r2 (n) n→∞
(119)
The result is that all equations have 0 on their left-hand sides and limiting probabilities on their right-hand sides. If the limiting equations could be solved for the limiting probabilities, this would provide the desired steady-state distribution. Since these equations are of greater difÞculty than the corresponding unsolved equations for the Markovian queuing network previously discussed, we will employ simulation to arrive at estimated steady-state probabilities. Example VI.4 Consider the same setting as in Examples VI.2 and VI.3. Since we are unable to obtain an analytic solution for the state probability increment equations, a Monte Carlo simulation of the Markov chain has been run and the joint probabilities of being in states (r1 , r2 ) have been estimated from the simulation. The estimated joint density is shown in Figure 27. The marginal densities derived from the estimated joint density are shown in Figure 25 along with the steady-state densities for Example VI.3. For the present example, the
Figure 27. Estimated joint density in Example VI.4.
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
55
Figure 28. Image of disjoint binary grains. Signal (black); noise (gray).
mean and standard deviation of r1 are 14.842 and 1.101, respectively; the mean and standard deviation of r2 are 26.759 and 0.822, respectively. Expected Þlter error averaged over the steady-state distribution is E[e[r1 ,r2 ]] = 0.0503E[T ]. Expected Þlter error averaged over the steady state when there is no known point in the passband exceeds expected error when there is a known point in the passband. Comparison with Example VI.3 shows increased standard deviation with no known point in the passband. Application VI.1 (Bandpass Filtering of Silver-Halide T-Grain Crystals) Consider the electron micrograph of silver-halide T-grain crystals in emulsion shown in Figure 2 and, in Figure 3, the edge image shown superimposed over the original micrograph. The edge image is Þlled to produce an image of disjoint binary grains and these are labeled either black for signal or gray for noise in Figure 28. We assume the granulometry is generated by a single opening using a disk structuring element of radius r and that the bandpass Þlter will have a single passband between r1 and r2 . We do not assume we know beforehand a point in the passband. Running the adaptive procedure empirically over the sample data produces the joint distribution of r1 and r2 shown in Figure 29. The mean and standard deviation of r1 are 7.19 and 1.79, respectively; the mean and standard deviation or r2 are 12.82 and 2.01, respectively. Using r1 = 8 and r2 = 13 for the bandpass Þlter yields the Þltered image of Figure 30. Small and large grains whose granulometric measures are outside the passband have been eliminated.
56
EDWARD R. DOUGHERTY AND YIDONG CHEN
Figure 29. Joint distribution of r1 and r2 resulting from the adaptive procedure.
VII. Logical Structural Filters We have concentrated thus far on adaptation for disjunctive granulometries. This is natural since these are the ones historically rooted in MatheronÕs original theory. It is possible to go on and study optimal and adaptive design for the general class of logical granulometries, including those that are conjunctive. Rather than do so, we will instead go to an even more general class of Þlters,
Figure 30. Filtered image resulting from use of r1 = 8 and r2 = 13 for the bandpass Þlter.
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
57
one that includes logical granulometries. The key idea is that logical granulometries depend on the existence of structuring elements that Þt inside a grain. In particular, for a connected set G, the reconstructive opening of G is deÞned according to G ◦ B = G if G ◦ B = ⭋, and G ◦ B = ⭋ if G ◦ B = ⭋. We can take a complementary approach and deÞne the complementary reconstructive opening of G by B by G ◦ B0 = ⭋ if G ◦ B = ⭋, and G ◦ B0 = G if G ◦ B = ⭋. In this context, we will denote the reconstructive opening itself by G ◦ B1 . In terms of translates, we have G ◦ B1 = G if there exists a translate of B that is a subset of G, and G ◦ B1 = ⭋ otherwise; G ◦ B0 = ⭋ if there is a translate of B that is a subset of G, and G ◦ B0 = G otherwise. The complementary reconstructive and reconstructive openings satisfy
G ◦ B1 = G − G ◦ B0
(120)
G ◦ B = G − G ◦ B
(121)
0
1
Owing to distributivity, if a compact set S is decomposed into a union of its maximally connected components, S = ∪i Si , then the reconstructive opening of S by a convex structuring element B is given componentwise by
Si ◦ B1 (122)
S ◦ B1 = i
A similar statement holds for the complementary reconstructive opening
S ◦ B0 . The component families of S ◦ B1 and S ◦ B0 partition the component family of S, meaning that the component families of S ◦ B1 and
S ◦ B0 are disjoint subfamilies of the component family of S and their union equals the component family of S. Reconstructive opening is a τ -opening, but complementary reconstructive opening is only translation invariant, antiextensive, and idempotent. It is not increasing.
A. Filter Representation For Þxed convex, compact structuring elements B1 , B2 , . . . , Bn and a set X of binary n-vectors x1, x2,. . ., xm, with xi = (xi1 , xi2 , . . . , xin ) a logical structural Þlter (LSF) is deÞned by (S) =
n m
S ◦ Bk x jk
(123)
j=1 k=1
Owing to componentwise representation, if S is decomposed into its maximally connected components Si , then (S) =
n m j=1 k=1
i
Si ◦ Bk x jk =
n m i
j=1 k=1
Si ◦ Bk x jk
(124)
58
EDWARD R. DOUGHERTY AND YIDONG CHEN
For any connected component Si of S, (Si ) = Si if there exists a vector x j such that for every x jk = 1 there is a translate of Bk that is a subset of Si , and for every x jk = 0 there is no translate of Bk that is a subset of Si ; otherwise, (Si ) = ⭋. DeÞne the logical variables Z ik and Yi to be the truth values of the statements ÒThereexists a translate of Bk that is a subset of Si Ó and Ò(Si ) = Si ,Órespectively. Then Yi has the logical representation Yi =
n m
x
Z ikjk
(125)
j=1 k=1
where the exponent x jk is interpreted to mean that the term is uncomplemented if x jk = 1 and complemented if x jk = 0. In this framework, the Þlter can be expressed as (S) = Si (126) Yi =1
Relative to Z i1 , Z i2 , . . . , Z in , the representation of Yi in Eq. (125) is in disjunctive normal form. Applying logic reduction yields a reduced expression, which in turn provides reduced expressions for (S) in Eqs. (123) and (124). For instance, consider three structuring elements and X = {011, 101, 110, 111}. Logically, for i = 1, 2, 3, Yi is the median of the variables Z i1 , Z i2 , Z i3 and has the reduced expression (127)
Yi = Z i1 Z i2 + Z i1 Z i3 + Z i2 Z i3 The LSF is given by (S) = ( S ◦ B1 1 ∩ S ◦ B2 1 ) ∪ ( S ◦ B1 1 ∩ S ◦ B3 1 ) ∪ ( S ◦ B2 1 ∩ S ◦ B3 1 )
(128)
We call it the median LSF for the base {B1 , B2 , B3 }. For this Þlter, there is a reduction void of complementary reconstructive openings. If such a reduction exists, the LSF is a positive LSF. Every logical operator on n variables has a corresponding LSF deÞned by a base of n structuring elements. For instance, extending the case of the three-variable median just discussed, we can consider the median LSF for a base of n structuring elements (n odd). In this case, X consists of all binary n-vectors having a majority of components being 1-valued, and the reduced form of is given by the union of all intersections of the form S ◦ Br,1 1 ∩ · · · ∩ S ◦ Br,(n+1)/2 1 , where the (n + 1)/2 structuring elements are elements of B = {B1 , B2 , . . . , Bn }: (S) =
{Br,1 ,Br,2 ,...,Br,(n+1)/2 }⊂{B1 ,B2 ,...,Bn }
(n+1)/2 k=1
S ◦ Br,k 1
(129)
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
59
The maximum and minimum LSFs are respectively deÞned by (S) = (S) =
n
S ◦ Bk 1
(130)
k=1
n
S ◦ Bk 1
(131)
k=1
For any pair of disjoint subsets S H = {B H,1 , B H,2 , . . . , B H,a } and S M = {B M,1 , B M,2 , . . . , B M,b } of B, the (SH, SM)-hit-and-miss LSF is deÞned by a b
S ◦ B H,l 1
S ◦ B M,l 0 (132) (S) = l=1
l=1
The Þlter selects those components that contain translates of B H,1 , B H,2 , . . . , B H,a but do not contain translates of B M,1 , B M,2 , . . . , B M,b . More shapeselecting Þlters are formed by unions of hit-and-miss structural Þlters. In fact, it is immediate from the deÞnition that every LSF is a union of hit-and-miss structural Þlters. Application VII.1 (Character Recognition) Consider the text image of Figure 31a printed in the Helvetica font. Figures 31bÐjshow: (b) reconstructive opening by a short horizontal line; (c) reconstructive opening by a vertical line; (d) intersection of parts b and c, which extracts any character made up in part of a short horizontal line and a vertical line; (e) reconstructive opening by a half-circle (half-o); (f) complementary reconstructive opening by a vertical line; (g) intersection of parts e and f, which extracts any character similar to ÒcÓbut with no vertical bar attached to it; (h) complementary reconstructive opening by a half-circle; (i) intersection of parts b, f, and h, which extracts character ÒzÓ;( j) the Þnal LSF output, which is the union of parts d, g, and i. Using the logical representation of Eq. (125), and suppressing the component subscript i, let Z 1 , Z 2 , and Z 3 correspond to the structuring elements short horizontal line, vertical line, and half-circle, respectively. According to the logic used to produce Figure 31j, the LSF is deÞned by Y = Z 11 Z 21 + Z 10 Z 31 + Z 10 Z 21 Z 30 = Z 21 + Z 10 Z 31
(133)
the second equality following from logic reduction. The reduced expression indicates that the Þnal result of Figure 31j can be obtained by the union of Figures 31b and 31g. B. Design of LSFs Our goal is Þnding bases for LSFs having good performance. For a given base n B with n structuring elements, there are 22 LSFs. A parameterized LSF is
60
EDWARD R. DOUGHERTY AND YIDONG CHEN
Figure 31. Application VII.1: character recognition.
obtained by extending the representation of Eq. (13) to allow complementation. Hence, we consider the parameterized base Br = {B1 [r1 ], B2 [r2 ], . . . , Bn [rn ]} and corresponding parameterized LSF r (S) =
n m j=1 k=1
S ◦ Bk [rk ]x jk
(134)
We continue to require the sizing condition that rk ≤ sk implies Bk [rk ] ⊂ Bk [sk ], where rk ≤ sk if and only if each component of rk is less than or equal to the corresponding component of sk. Logic reduction can be applied to the LSF representation of Eq. (134). If there is no complementation, reduction and relabeling of structuring elements
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
61
yields a logical granulometry. Hence, a logical granulometry is a parameterized positive LSF. Upon parameterization, the maximum and minimum LSFs become disjunctive and conjunctive granulometries, respectively. LSF optimization in the signal-union-noise model is characterized in exactly the same way as for logical granulometries. Pass and fail sets are deÞned in the same way, and so is the error. The mathematical difÞculties of optimal LSF design are prohibitive. Thus, we apply adaptive design. For adaptive design, we restrict ourselves to homothetic LSFs, which are of the form t (S) =
n m
S ◦ tk Bk x jk
(135)
j=1 k=1
where t = (t1 , t2 , . . . , tn ) and where B = {B1 , B2 , . . . , Bn } is an a priori family of primitive structuring shapes. There are three reasons for focusing on homothetic LSFs: (1) adaptation rules depend on only the size of the structuring elements, not their shapes; (2) it is possible to provide an automatic set of adaptation rules given the vector collection X; and (3) for the applications we have in mind, classiÞcation depends on discriminating grains based on a set of primary shape criteria, with grain separation depending on the sizes of the primary shapes and the manner in which they are logically combined to form an LSF. One can apply adaptive Þlter design for the more general class of LSFs, in which case structuring elements adapt in shape as well as size (as we have done for disjunctive granulometries); however, adaptation transitions lose the systematic character they have for homothetic LSFs. For adaptive design, we Þx X, initialize t for t and scan S ∪ N to successively select grains. The adaptive Þlter is of the form t(n) , where n corresponds to the nth grain encountered. When a grain G is encountered during scanning, there are the four possibilities of Eq. (24) depending on t(n) (G). The adaptation protocol is easier to understand if the Þlter is put into logical form in accordance with Eq. (125), where we again suppress the component index i. Subsequent to logic reduction, t(n) is of the form nj m
S ◦ t jk B jk w jk t(n) (S) =
(136)
j=1 k=1
where B jk ∈ B and t jk is the scaling factor for B jk at the nth adaptation step. The notation is complicated by the union of intersections. In fact, for each pair ( j, k) there exists Bu ∈ B such that t jl B jl = tu Bu . Relative to the logic of Eq. (125), with Z (t jk ) in place of Z k and the uncomplemented and complemented factors
62
EDWARD R. DOUGHERTY AND YIDONG CHEN
grouped, the Þlter can be expressed as Y = Rj =
m
(137)
Rj
j=1
rj k=1
Z (t jk )1
sj
Z (t jl )0
(138)
l=1
Given a connected component G, let Q be the random variable deÞned by Q = 1 if G is a signal grain and Q = 0 if G is a noise component. Then Y is an estimator of Q and the parameters need to transition in accordance with whether or not Y equals Q. If Y = Q, then the Þlter has acted as desired and the parameters are not adjusted. We need to consider the two cases in which Y = Q. Because the procedure will be implemented digitally, we assume that transition increments and decrements have size 1. For Q = 0 and Y = 1, there is a subcollection of {R1 , R2 , . . . , Rm } whose variables are 1-valued, whereas they should be 0-valued. Without loss of generality, suppose these variables are R1 , R2 , . . . , Rq . At least one variable needs to be altered in such a way as to reßect the fact that the Þlter response would be correct if they were all 0-valued. For our adaptation protocol, one variable, say R j , is randomly chosen, and one factor of R j is randomly selected for parameter transition. If Z (t jk )1 is selected, then there is the single transition t jk → t jk + 1; if Z (t jl )0 is selected, then there is the single transition t jk → t jk − 1. For Q = 1 and Y = 0, the variables R1 , R2 , . . . , Rm are all 0-valued, but at least one should be 1-valued. One is selected at random, say R j , and one of its factors is randomly selected for variable transition. If Z (t jk )1 is selected, then there is the single transition t jk → t jk − 1; if Z (t jl )0 is selected, then there is the single transition t jk → t jk + 1. The stated adaptation protocol is conservative. Its intent is to have t slowly transition into a steady state and stay there with as little variation as possible. Faster convergence to the steady state can be achieved by selecting more than one summand or more than one factor for transition. Adapting more than a single component of t decreases time in transient states at the cost of increased oscillation in the parameter vector. We illustrate the adaptation protocol for the case of a two-set generator with the LSF in disjunctive normal form. There are four possible products: Z (t1 )1 Z (t2 )1 , Z (t1 )1 Z (t2 )0 , Z (t1 )0 Z (t2 )1 , and Z (t1 )0 Z (t2 )0 . Excluding the null and identity Þlters there are 14 possible logical sums of the type given in Eq. (137). If Y = 1 and Q = 0, then one of the 1-valued products forming Y needs to be altered. Depending on the form of Y and the factor chosen, there are four cases: for Z (t1 )1 Z (t2 )1 , t1 → t1 + 1 or t2 → t2 + 1; for Z (t1 )1 Z (t2 )0 ,
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
63
t1 → t1 + 1 or t2 → t2 − 1; for Z (t1 )0 Z (t2 )1 , t1 → t1 − 1 or t2 → t2 + 1; for Z (t1 )0 Z (t2 )0 , t1 → t1 − 1 or t2 → t2 − 1. If Y = 0 and Q = 1, then one of the products needs to be altered. Depending on the form of Y and the factor chosen, there are four cases and these are simply opposite of the case for Y = 1 and Q = 0. Returning to the general case, the adaptive procedure results in the parameter vector t(n) at step n being a Markov chain. Once we obtain the transition probabilities, the ChapmanÐKolmogorov equations are derived from these similarly to the derivation for disjunctive Þlters. From a general perspective, if a signal component is encountered, then the probability of error is given by P(t(n) ∈ / MC[s] ); for a noise component, the probability of error is P(t(n) ∈ / M D[n] ). These probabilities are dependent on the form of the LSF. For a given parameter vector t(n) the probability of a non-null transition is given by / M D[n] P(t(n + 1) = t(n)) = P(S)P t(n) ∈ / MC[s] + P(N )P t(n) ∈ (139) If a signal component is encountered, then there exists a non-null transition if and only if all summands in Eq. (137) are 0. The probability of this event is given by P(t ∈ / MC[s] ), where we have suppressed notation of the adaptation step n. Let (140) / MC[s] p S (t) = P t ∈ DeÞne ζ 1 by ζ 1 ( j) = 1 if Z (ti )1 is a factor of R j and ζ 1 ( j) = 0 if Z (ti )1 is not a factor of R j . DeÞne ζ 0 by ζ 0 ( j) = 1 if Z (ti )0 is a factor of R j and ζ 0 ( j) = 0 if Z (ti )0 is not a factor of R j . Let P(ti → ti ± 1|S) denote the probability that ti transitions up or down (and there are no other transitions) given the observation of a signal component. Let P(t → t|S) denote the probability of a null transition given a signal component. Then P(ti → ti + 1|S) = p S (t) P(ti → ti − 1|S) = p S (t)
m 1 ζ 1 ( j) m j=1 n j m ζ 0 ( j) 1 m j=1 n j
P(t → t|S) = 1 − p S (t)
(141)
(142) (143)
For a noise component the situation is more complicated because only a 1-valued factor is adjusted. Let p N (t; j1 , . . . , jq ) denote the probability that when the parameter vector is t and a noise component is encountered, the j1 , j2 , . . . , jq summands of Y and 1-valued are the remaining summands are
64
EDWARD R. DOUGHERTY AND YIDONG CHEN
0-valued. Then P(ti → ti + 1|N ) = P(ti → ti − 1|N ) =
p N (t; j1 , j2 , . . . , jq )
1≤ j1 < j2 < ··· < jq ≤m
l=1
p N (t; j1 , j2 , . . . , jq )
1≤ j1 < j2 < ··· < jq ≤m
P(t → t|N ) = 1 −
q ζ 1 ( jl )
n jl
q ζ 0 ( jl ) l=1
n jl
p N (t; j1 , j2 , . . . , jq )
(144)
(145) (146)
1≤ j1 < j2 < ··· < jq ≤m
The unconditioned transition probability for ti → ti ± 1 is given by P(ti → ti ± 1) = P(S)P(ti → ti ± 1|S) + P(N )P(ti → ti ± 1|N ) (147) For general LSFs the ChapmanÐKolmogorov equations are even more complicated than for multiparameter disjunctive granulometries, and hence there appears to be little hope in solving them analytically. If we assume that components are uniformly bounded, then the state space for the chain is Þnite and a steady state is ensured. If we proceed numerically, a selected Þlter results from taking t to be the center of mass of the empirical steady-state distribution for t(n): we estimate the state probabilities from the number of times t(n) visits its possible vector values as it adapts during training when in the steady state, form an empirical steady-state distribution from these estimates, and take t to be the center of mass of this distribution. Application VII.2 (Blood-Cell Analysis) Figure 32 shows a normal blood smear image and Figure 33 shows a binarized version of the image created by thresholding and hole Þlling. The image is to be processed to determine the sizes and shapes of blood cells, but before measurement, a Þlter must be applied to remove overlapping cells and noise. We apply an adaptively designed LSF with Þve structuring elements. The unit-width structuring elements B1 , B2 , B3 , and B4 are a vertical, a horizontal, a 45◦ , and a −45◦ line, respectively, and B5 is a unit disk. Guided by the heuristic that a disk can be used to Þlter out small noise and that overlapping cells can be eliminated by making sure that no long line Þts, we employ the Þlter Y = Z 11 (t1 )Z 20 (t2 )Z 30 (t2 )Z 40 (t2 )Z 50 (t2 )
(148)
where Zi corresponds to Bi and, owing to symmetry of cell overlap, a single parameter, t2, governs lengths of the lines. Training is accomplished by using a synthetic blood-cell model that produces images similar to binarized blood-cell images. An LSF has been designed using 5000 simulated cells, and Figure 34 shows the result of applying the designed Þlter to the image of Figure 33.
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
65
Figure 32. Normal blood smear image for blood-cell analysis.
Application VII.3 (Comparative Genomic Hybridization) In comparative genomic hybridization (CGH) analysis (Piper et al., 1995), a test genome and a reference genome are simultaneously hybridized to normal metaphase target chromosomes (Fig. 35). We will use an adaptive LSF with four linear
Figure 33. Binarized version of the image shown in Figure 32, created by thresholding and hole Þlling.
66
EDWARD R. DOUGHERTY AND YIDONG CHEN
Figure 34. Result of applying an LSF designed using 5000 simulated cells to the image in Figure 33.
structuring elements to identify overlapping chromosomes. The image is Þrst binarized by using a gray-scale watershed to Þnd chromosome boundaries and then Þlling the boundaries (Fig. 36). The unit-width structuring elements B1 , B2 , B3 , and B4 are a vertical, a horizontal, a 45◦ , and a −45◦ line, respectively. Using the heuristic that crossing chromosomes can be Þt by pairs of perpendicular lines (vertical and horizontal, or 45◦ and −45◦ ), whereas noncrossing chromosomes are less easily so Þt, we employ the LSF Y = Z 11 (t)Z 21 (t) + Z 31 (t)Z 41 (t)
(149)
where Z i (t) corresponds to Bi and, owing to symmetry, we use a single parameter governing structuring-element length. Considering a component formed by overlapping chromosomes to be a signal grain and letting W be a binary random variable that is 1-valued if and only if a grain consists of crossing chromosomes, we have the following adaptation protocol: if Y = W, then t → t; if Y = 0 and W = 1 (misrecognized signal grain), then r → r − 1; if Y = 1 and W = 0 (misrecognized noise grain), then r → r + 1. A synthetic chromosome model that produces binary images closely resembling real chromosome images is used to train the Þlter. Training is accomplished with 5000 grains from the simulation program. The output of the trained Þlter applied to the image of Figure 36 is shown in Figure 37. As just described, the form of the logical LSF has been chosen according to geometric heuristics. If there is only a single parameter to adapt, then there are 216 possible LSF forms that can be applied when four linear structuring
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
67
Figure 35. Comparative genomic hybridization: test genome and reference genome simultaneously hybridized to normal metaphase target chromosomes.
elements are used, with Eq. (149) giving one of them. Rather than depend on heuristics, we should, when it is computationally feasible, employ an optimal search methodology to arrive at the best adaptive LSF. Adaptation is run for each of the 216 LSFs (using 5000 training chromosomes). An error function ε[t] is computed, giving the percentage of grains misclassiÞed by the Þlter with parameter t. From the steady-state probabilities, pss (t), and the error function, we compute the expected Þlter error relative to grains, E[ε[t]] =
M
ε[t] pss (t)
t=0
Figure 36. Binarized image with Þlled boundaries.
(150)
68
EDWARD R. DOUGHERTY AND YIDONG CHEN
Figure 37. Output of the trained Þlter applied to the image in Figure 36.
where M is the maximum parameter value. The optimal Þlter is the one with minimum expected steady-state error. For CGH analysis, an exhaustive search of the 216 Þlter shows the best one to be Y = Z 11 (t)Z 21 (t)Z 31 (t)Z 41 (t)
(151)
for which t has steady-state mean 23.5 and standard deviation 0.69, and for which the expected error is 1.6% (80.0/5000). For the heuristically selected LSF form of Eq. (149), t has steady-state mean 32.9 and standard deviation 1.38, and expected error 2.4% (120.1/5000). Its error is 50% greater than that of the optimal LSF. The result of the optimal search can be appreciated by referring to Figure 38, in which the four structuring elements have equal length. If the Þlter in Eq. (149) is chosen, to eliminate the nonoverlapping chomosome (noise grain in this model), then the sizing parameter must be sufÞciently large that the product Z 1 Z 2 = 0; however, increasing the parameter reduces the chance of Þtting an
Figure 38. Result of the optimal search: four structuring elements of equal length.
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
69
TABLE 1 Top 10 Performing LSF Forms Filter form
Expected error
Mean sizing parameter
Z 11 Z 21 Z 31 Z 41 Z 11 Z 21 Z 41 Z 11 Z 21 Z 31 Z 41 + Z 21 Z 30 Z 40 Z 11 Z 21 Z 31 Z 41 + Z 11 Z 21 Z 30 Z 40 Z 11 Z 21 Z 41 + Z 11 Z 21 Z 30 Z 11 Z 21 Z 41 + Z 11 Z 20 Z 30 Z 41 Z 11 Z 21 Z 31 Z 41 + Z 11 Z 30 Z 40 Z 11 Z 21 Z 31 + Z 11 Z 21 Z 41 Z 11 Z 21 Z 31 Z 11 Z 21 Z 31 Z 41 + Z 11 Z 30 Z 40 + Z 21 Z 30 Z 40
80.02 80.96 82.96 83.17 84.25 85.67 88.36 90.03 90.05 90.27
23.50 23.65 23.84 23.74 23.66 23.81 23.91 23.60 23.66 23.69
± ± ± ± ± ± ± ± ± ±
0.69 0.71 0.74 0.76 0.78 0.80 0.84 0.71 0.70 0.81
overlapping chromosome grain (signal grain), which in turn increases the error. However, for the Þlter of Eq. (151), the nonoverlapping chromosome does not pass the Þlter because Z 3 does not Þt anywhere within the chromosome. Table 1 lists the top 10 performing LSF forms. The Þlter of Eq. (149) is rated in 103d place.
References Baeg, S., Batman, S., Dougherty, E. R., Kamat, V., Kehtarnavaz, N. D., Kim, S., Popov, A., Sivakumar, K., and Shah, R. (1999). Unsupervised morphological granulometric texture segmentation of digital mammograms. Electron. Imaging 8(1). Batman, S., and Dougherty, E. R. (1997). Size distributions for multivariate morphological granulometries: Texture classiÞcation and statistical properties. Opt. Eng. 36(5). Beucher, S., and Meyer, F. (1992). The morphological approach to segmentation: The watershed transformation, in Mathematical Morphology in Image Processing, edited by E. R. Dougherty. New York: Dekker. Chakravarthy, B., Grivas, D., and Skolnick, M. (1993). Morphological analysis of pavement surface condition, in Mathematical Morphology in Image Processing, edited by E. R. Dougherty. New York: Dekker. Chen, Y., and Dougherty, E. R. (1994). Gray-scale morphological granulometric texture classiÞcation. Opt. Eng. 33(8). Chen, Y., and Dougherty, E. R. (1996). Adaptive reconstructive τ -openings: Convergence and the steady-state distribution. Electron. Imaging 5(3). Chen, Y., and Dougherty, E. R. (1997). Optimal and adaptive reconstructive granulometric bandpass Þlters. Signal Processing 61. Chen, Y., and Dougherty, E. R. (1999). Markovian analysis of adaptive reconstructive multiparameter τ -openings. Math. Imaging Vision 9(1).
70
EDWARD R. DOUGHERTY AND YIDONG CHEN
Chen, Y., Dougherty, E. R., Totterman, S., and Hornak, J. (1993). ClassiÞcation of trabecular structure in magnetic resonance images based on morphological granulometries. Magn. Reson. Med. 29(3). Crespo, J., and Schafer, R. W. (1997). Locality and adjacency stability constraints for morphological connected operators. Math. Imaging Vision 7(1). Crespo, J., Serra, J., and Schafer, R. W. (1995). Theoretical aspects of morphological Þlters by reconstruction. Signal Processing 47(2). Dougherty, E. R. (1992). Euclidean gray-scale granulometries: Representation and umbra inducement. Math. Imaging Vision 1(1). Dougherty, E. R. (1997). Optimal binary morphological bandpass Þlters induced by granulometric spectral representation. Math. Imaging Vision 7(2). Dougherty, E. R. (1999). Translation-invariant set operators, in Nonlinear Filters for Image Processing, edited by E. R. Dougherty and J. T. Astola. Bellingham, WA: SPIE and IEEE Presses. Dougherty, E. R. (2001). Optimal conjunctive granulometric bandpass Þlters. Mathematical Imaging and Vision, 14(1). Dougherty, E. R., and Chen, Y. (1997). Logical granulometric Þltering in the signal-unionclutter model, in Random Sets: Theory and Applications, edited by J. Goutsias, R. Mahler, and C. Nguyen. New York: Springer-Verlag. Dougherty, E. R., and Chen, Y. (1998). Logical structural Þlters. Opt. Eng. 37(6). Dougherty, E. R., and Chen, Y. (1999). Granulometric Þlters, in Nonlinear Filters for Image Processing, edited by E. R. Dougherty and J. T. Astola. Bellingham, WA: SPIE and IEEE Presses. Dougherty, E. R., and Cuciurean-Zapan, C. (1997). Optimal reconstructive τ -openings for disjoint and statistically modeled nondisjoint grains. Signal Processing 56. Dougherty, E. R., Haralick, R. M., Chen, Y., Agerskov, C., Jacobi, U., and Sloth, P. H. (1992a). Estimation of optimal τ -opening parameters based on independent observation of signal and noise pattern spectra. Signal Processing 29. Dougherty, E. R., Newell, J. T., and Pelz, J. B. (1992b). Morphological texture-based maximumlikelihood pixel classiÞcation based on local granulometric moments. Pattern Recogn. 25(10). Dougherty, E. R., and Pelz, J. (1991). Morphological granulometric analysis of electrographic imagesÑSize distribution statistics for process control. Opt. Eng. 30(4). Dougherty, E. R., Pelz, J., Sand, F., and Lent, A. (1992c). Morphological segmentation by local granulometric size distributions. Electron. Imaging 1(1). Dougherty, E. R., and Sand, F. (1995). Representation of linear granulometric moments for deterministic and random binary Euclidean images. Vis. Commun. Image Representation 6(1). Giardina, C. R., and Dougherty, E. R. (1988). Morphological Methods in Image and Signal Processing. Englewood Cliffs, NJ: Prentice-Hall. Haralick, R. M., Katz, P. L., and Dougherty, E. R. (1995). Model-based morphology: The opening spectrum. CVGIP: Graphical Models Image Processing 57(1). Heijmans, H. J. (1995). Morphological Operators. New York: Academic Press. Heijmans, H. J. (1999). Introduction to connected operators, in Nonlinear Filters for Image Processing, edited by E. R. Dougherty and J. T. Astola. Bellingham, WA: SPIE and IEEE Presses. Jackson, J. R. (1957). Jobshop-like queueing systems. Manage. Sci. 5. Jackson, J. R. (1963). Networks of waiting lines. Operational Res. 10(1). Kraus, E., Heijmans, H. J., and Dougherty, E. R. (1993). Gray-scale morphological granulometries compatible with spatial scaling. Signal Processing 34. Maragos, P. (1989). Pattern spectrum and multiscale shape representation. IEEE Trans. Pattern Anal. Machine Intell. 11.
DESIGN OF LOGICAL GRANULOMETRIC FILTERS
71
Matheron. (1975). Random Sets and Integral Geometry. New York: Wiley. Meyer, F., and Beucher, S. (1990). Morphological segmentation. Vis. Commun. Image Representation 1(1). Piper, J., Rutovitz, D., Sudar, D., Kallioniemi, A., Kallioniemi, O.-P., Waldman, F., Gray, J., and Pinkel, D. (1995). Computer image analysis of comparative genomic hybridization. Cytometry 19. Sand, F., and Dougherty, E. R. (1992). Asymptotic normality of the morphological patternspectrum moments and orthogonal granulometric generators. Vis. Commun. Image Representation 3(2). Sand, F., and Dougherty, E. R. (1998). Asymptotic granulometric mixing theorem: Morphological estimation of sizing parameters and mixture proportions. Pattern Recogn. 31(1). Schonfeld, D., and Goutsias, J. (1991). Optimal morphological pattern restoration from noisy binary images. IEEE Trans. Pattern Anal. Machine Intell. 13. Serra, J. (1982). Image Analysis and Mathematical Morphology. New York: Academic Press. Sivakumar, K., and Goutsias, J. (1997). Discrete morphological size distributions and densities: Estimation techniques and applications. Electron. Imaging 6(1). Theera-Umpon, N., and Gader, P. (2000). Counting white blood cells using morphological granulometries. Electron. Imaging 9(2). Vincent, L., and Dougherty, E. R. (1994). Morphological segmentation for textures and particles, in Digital Image Processing Methods, edited by E. R. Dougherty. New York: Dekker.
This Page Intentionally Left Blank
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 117
Dyadic Warped Wavelets GIANPAOLO EVANGELISTA Audio-Visual Communications Laboratory, Federal Institute of Technology, EPFL, Ecublens, CH-1015 Lausanne, Switzerland; and Department of Physical Sciences, University of Naples ÒFederico II,ÓComplesso Universitario MSA, I-80126 Naples, Italy
I. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . II. Warped Wavelets in Brief . . . . . . . . . . . . . . . . . . . . . . III. Multiresolution Approximation. . . . . . . . . . . . . . . . . . . . A. General Multiresolution Approximation . . . . . . . . . . . . . . . B. Frequency Warping Operators . . . . . . . . . . . . . . . . . . . 1. Warped Fourier Analysis . . . . . . . . . . . . . . . . . . . . C. Warped Multiresolution Approximation . . . . . . . . . . . . . . . D. Globally Warped Wavelets . . . . . . . . . . . . . . . . . . . . IV. From WMRA and Warped Scaling Functions to Warped QMFs. . . . . . . A. Warped Riesz Bases . . . . . . . . . . . . . . . . . . . . . . . B. Warped Scaling Functions. . . . . . . . . . . . . . . . . . . . . C. Warped Two-Scale Equations . . . . . . . . . . . . . . . . . . . D. Warped Quadrature Mirror Filters . . . . . . . . . . . . . . . . . V. From Warped QMF to Warped Scaling Functions and WMRA . . . . . . . A. L2 (R) Orthogonality of the Scaling Function System. . . . . . . . . . B. Satisfaction of the GMRA Axioms . . . . . . . . . . . . . . . . . VI. Warped Wavelets. . . . . . . . . . . . . . . . . . . . . . . . . . A. Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . VII. Construction of Iterated Warping Maps. . . . . . . . . . . . . . . . . A. Realizable Discrete-Time Warping Maps: The Laguerre Transform. . . . B. Laguerre Warped Filter Banks and Discrete-Time Wavelets . . . . . . . 1. Two-Channel Laguerre Warped Filter Bank . . . . . . . . . . . . 2. Discrete-Time Warped Wavelets . . . . . . . . . . . . . . . . . C. Schr¬ oderÕs Equation and Generalized K¬ onigsÕModels . . . . . . . . . 1. Realization of Warping Maps by Iterated Laguerre Maps: The Constant Parameter Case . . . . . . . . . . . . . . . . . . . . . . . . 2. Realization of Warping Maps by Iterated Laguerre Maps with Variable Parameter . . . . . . . . . . . . . . . . . . . . . . VIII. Computation of the Dyadic Warped Wavelet Transform. . . . . . . . . . IX. Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
74 77 79 80 82 83 87 88 90 90 96 96 99 104 105 111 118 126 129 130 134 134 137 140
.
144
. . . .
150 161 169 170
73 Volume 117 ISBN 0-12-014759-9
C 2001 by Academic Press ADVANCES IN IMAGING AND ELECTRON PHYSICS Copyright All rights of reproduction in any form reserved. ISSN 1076-5670/01 $35.00
74
GIANPAOLO EVANGELISTA
I. Introduction Since its introduction in 1982 (Morlet et al., 1982), the wavelet paradigm has provided new ways of looking at signals. The underlying premise is that signal decompositions in terms of a trend plus ßuctuations (details) at several scales may be more interesting or more efÞcient than conventional representations based on local Fourier analysis. This is particularly apparent in image processing where, depending on the distance from the camera and the zoom factor, the same object may appear at different scales in different pictures. Machine analysis of images must take this aspect into consideration, for example by including variable resolution capabilities. The same concept also has relevance in the representation of audio signals. The analysis of most features of sound requires mixed time and frequency characterization. The human hearing sense is particularly gifted at classifying acoustic patterns according to both their time of occurrence and the brightness of their local frequency spectrum. Gabor expansions and the short-time Fourier transform (STFT) were among the Þrst tools employed to capture the timeÐfrequency picture in terms of a spectrogram. Since time and frequency are conjugate variables, the uncertainty principle states that we cannot achieve indeÞnitely accurate resolution in both time and frequency domains. Local Fourier analysis is based on a uniform frequency resolution, while our perception is based on a nonlinear frequency scale, as reßected by both cochlear models and psychoacoustic theories. In Western music, musical tones are arranged on a tempered scale (i.e., tones that are spaced by one octave, corresponding to power-of-2 frequency ratios, are given the same name); tones within the same octave progress as powers of √ 12 2. While the tempered scale is to be considered a compromise between complexity and accuracy, it does reßect the Òpower-of-somethingÓorganization of wavelet representations. In wavelet analysis, signals are correlated to several scaled versions of a unique bandpass function, the wavelet. Scaling a bandpass function changes both its passband and its center frequency. Hence, the wavelet representation has nonuniform frequency resolution. Since wavelets are wavelets (i.e., they are localized in time), scaling also changes the time resolution while preserving the uncertainty product. As a result, higher frequencies are represented with higher time resolution and lower frequencies with lower time resolution. One of the Þrst formal contributions to the theory of wavelets (Grossmann and Morlet, 1984) was the deÞnition of the integral wavelet transform, in which scale and shift are continuous parameters and a 1D signal leads to a 2D representation. The integral transform is redundant and computationally not efÞcient. Redundancy can be removed by suitable sampling leading to a discrete 2D representation. This very idea generated efÞcient algorithms
DYADIC WARPED WAVELETS
75
as well. Time-scale sampling led to the construction of bases adjusted on a nonuniform timeÐfrequency scale (Daubechies, 1988) and to fast iterative algorithms for their computation based on multirate Þlter banks (Mallat, 1989) for the dyadic case where scale is sampled to powers of 2. Excellent reference books in the Þeld are those of Daubechies (1992), Mallat (1998), Strang and Nguyen (1996), and Vetterli and Kovacevic (1995), among others. However, more research was necessary because the dyadic wavelet bases constrain the frequency resolution to one octave, which is clearly not sufÞcient in many applications, including audio. Attempts to improve the frequency resolution of wavelet bases by a rational factor, while feasible from a computational point of view, led to difÞcult design procedures in order to satisfy regularity constraints (Blu, 1998). In practice, the frequency resolution factor is limited to ratios of small integers. The invariance in shape of the wavelets at different scales is also lost. Other attempts led to wavelet packets (Coifman and Wickerhauser, 1992), which disrupt the scale concept by subdividing each dyadic analysis band into uniform bands. Recently, frequency warping was introduced in wavelet analysis to provide wavelets with arbitrary frequency resolution (Baraniuk and Jones, 1993, 1995). Frequency warping techniques allow for arbitrary choice of analysis bands by means of a deformation of the frequency axis. In fact, by deforming the frequency axis according to a suitable map, one can adjust the signal to the analysis bands of the transform (e.g., smaller bands are mapped into octaves). This can be achieved in an arbitrary fashion, which is an important point if we want to adapt the signal representation to either objective or perceptual bands associated to a single feature. For example, in audio processing one may be interested in adjusting the transform on a perceptual scale (e.g., Bark or ERB). However, frequency warping a continuous-time signal is not exactly a computable and efÞcient operation. Fortunately, there exist computationally attractive methods for frequency warping discrete-time signals. These are based on the discrete Laguerre transform, whose computation is achieved by means of a chain of digital all-pass Þlters in a structure based on rational transfer functions (Braccini and Oppenheim, 1974; Broome, 1965; Oppenheim et al., 1971; Oppenheim and Johnson, 1972). This transform can be shown to be the unique solution for implementing one-to-one warping maps in digital structures. In Evangelista and Cavaliere (1998d) the author applied the Laguerre transform to the discrete-time wavelet transform and the pitch-synchronous wavelet transform (Evangelista, 1993, 1994) in order to obtain, respectively, design-constrained ßexible-bandwidth wavelet analysis and pitch-synchronous adapted analysis of quasi-periodic inharmonic signals. The Laguerre transform yields a one-parameter family of frequency warping maps. Although the family is quite rich, it does not allow for true arbitrary bandwidth selection when cascaded with the wavelet transform. In particular,
76
GIANPAOLO EVANGELISTA
one cannot exactly achieve the desired power-of-something character of ideal wavelet transforms. In Evangelista and Cavaliere (1998c) the author showed that iterated frequency warping by means of Laguerre maps leads to wavelets with arbitrary frequency resolution, whose bandwidths can be assigned by selecting a set of parameters. Interestingly enough, the resulting wavelets are still based on a dyadic scheme, which is a natural setting for iterated band splitting. The computation of the transform is achieved by intertwining the discrete-time Laguerre transform with a two-channel critically sampled quadrature mirror Þlter bank. This deÞnes an equivalent frequency warped Þlter bank whose resampling operators are modiÞed to warped operators, equivalent to resampling in the Laguerre domain. The general scheme preserves the pruned-tree structure typical of ordinary dyadic wavelets, in which all the Þlter banks are suitably warped. In this article, the case where the choice of the design parameters leads to orthogonal and complete wavelets with arbitrary resolution factor 0 < a < 1 is examined in detail. Single-step continuous-time frequency warping requires an a-homogeneous self-similar warping map and is otherwise a trivial generalization of the wavelet transform, obtained by global unitary equivalence. The construction of continuous-time wavelets based on iterated frequency warping requires an extension of many theorems in wavelet analysis. We begin by considering the general multiresolution approximation in which shift operators are replaced by generic unitary operators. As a particular case we consider a group of frequency-dependent generalized shift operators. We show that these operators are unitarily equivalent to ordinary shift operators via the warping operator and, together with the dilation operator, generate the warped multiresolution approximation. Next, the theory of warped Riesz bases is presented together with a generalization of shift-invariant subspaces. The introduction of the dilation operator brings us to warped scaling functions and warped twoscale equations. We show that the warped multiresolution approximation leads to orthogonal and complete frequency warped dyadic wavelets. Conversely, warped scaling functions deÞned by means of inÞnite products of iteratively warped mirror Þlters lead to a warped multiresolution approximation. Next we move to the construction of the iterated warping map. We show that iterated warping is most conveniently associated with conjugacy and Schr¬ oderÕs equation, which leads to a generalized form of self-similarity. This property is also useful for showing convergence of the parameter sequence via an extension of K¬ onigsÕtheorem (Shapiro, 1998) to parametric maps on the real axis. We show that the Laguerre family of maps satisÞes all the properties required to deÞne a multiresolution approximation on arbitrary scale, from which the mentioned computational structure follows. Like the globally frequency warped wavelets, the dyadic warped wavelets are not obtained by scaling and translating a unique mother wavelet. While
DYADIC WARPED WAVELETS
77
invariance in shape by scaling is preserved, the wavelets at each scale level are obtained by repeated application of the generalized frequency-dependent shift. As a result, time-shift is replaced by a continuous-time all-pass Þltering operation resulting in a discrete-time all-pass Þltering operation for the computation of the analysis coefÞcients. The article is organized as follows. In Section II we outline the construction of warped wavelets and illustrate some of the alternatives in the use and deÞnition of iterated warped wavelets. In Section III we illustrate a general construction of multiresolution approximation based on dilation and generalized shift operators. This brings us naturally to the deÞnition of a unitary warping operator and to its application to Fourier and wavelet analysis. In Section IV we show that the warped multiresolution approximation based on warped Riesz bases leads to warped forms for the two-scale equations and to the deÞnition of warped quadrature mirror Þlters and warped scaling functions. In Section V we show that, under proper conditions, warped quadrature mirror Þlters generate well-deÞned warped scaling functions in L2(R) forming an orthogonal scaling system. There we show that the axioms of the warped multiresolution approximation are indeed satisÞed. In Section VI we deÞne warped wavelets from warped scaling functions and investigate their regularity. In Section VII we provide procedures for the construction of warping maps oriented to the discrete computation of warped wavelet expansions. In this context we show that the discrete-time Laguerre transform is a fundamental tool. We detail properties of the Laguerre warped Þlter banks and discrete-time warped wavelets. An extension of K¬ onigsÕmodels for the determination of the eigenfunctions of the composition operator is presented whose application to the iterated Laguerre maps is crucial for checking the consistency of the theory. In Section VIII we are concerned with the computation of the warped wavelet expansion from zero-level signal approximations. This leads to the construction of asymptotically scale a warped wavelets based on variable-parameter iterated Laguerre maps.
II. Warped Wavelets in Brief Recently, frequency warping has been considered as a gateway to wavelets with arbitrary frequency resolution. However, the success of dyadic wavelets rests upon the availability of a fast algorithm for the computation of the transform. Our iterated frequency warping construction allows for the deÞnition of warping maps resulting from the interaction of discrete-time frequency warping and dyadic upsampling. The frequency warping operation considered in this article is obtained by the application of a unitary linear operator in L2(R) whose action on a function f
78
GIANPAOLO EVANGELISTA
is deÞned in the Fourier domain as follows: ( dŴ F[W f ](ω) = F[ f ](Ŵ(ω)) dω
(1)
where F denotes the L2(R) Plancherel extension of the Fourier transform operator. The operator W includes frequency axis deformation by means of a continuously differentiable map Ŵ(ω) and energy equalization by multiplication by the square root of the derivative Ŵ ′ (ω). Clearly, application of global frequency warping is always possible in order to transform wavelet bases into other bases whose frequency resolution is controlled by the map Ŵ(ω). Unfortunately, frequency warping of continuous-time signals cannot be exactly computed. We have to resort to an incremental warping method in which the map Ŵ(ω) results from an inÞnite iteration of discrete-time frequency warping operations. Not all maps Ŵ(ω) are nicely embedded in the iterative algorithm for computing signal expansions over dyadic wavelet bases. In this article we show that the following condition 1 Ŵ(a −1 Ŵ −1 (ω 2
+ 2kπ)) = 21 Ŵ(a −1 Ŵ −1 (ω)) + 2kπ
(2)
is sufÞcient for embedding scale a wavelets in the dyadic scheme. One should recognize that Eq. (2) imposes a strong structure on the map Ŵ(ω). In fact, the function θ (ω) = 12 Ŵ(a −1 Ŵ −1 (ω)) is well suited as characteristic of a discretetime warping operator acting, in the Fourier domain, from L2([−π, +π]) onto itself. By iteration of the frequency-halved inverse map θ −1 (ω/2) = ÷ given in Ŵ(aŴ −1 (ω)) one can deÞne an auxiliary warped scaling function ϕ, the Fourier domain as follows: ∞ H 12 Ŵ(a k Ŵ −1 (ω)) ÷ ÷ = (ω) = F[ϕ](ω) √ 2 k=0
where H is the frequency response of the low-pass section of a quadrature mirror Þlter bank. If Ŵ(ω) = ω and a = 21 one can easily recognize that the warped scaling function reverts to the scaling function associated to ordinary wavelet bases. The pure translates ϕ(t ÷ − m), m ∈ Z, are shown to form an orthogonal Riesz basis for a linear space that can be used as reference space in a generalized multiresolution approximation scheme. However, one should recognize that frequency resolution increase is not obtained by pure dilation in this case. It is easy to see that the scaling function ϕö = W ϕ, ÷ whose Fourier transform is ( ∞ H 12 Ŵ(a k ω) dŴ ö (ω) = √ dω k=0 2 nicely works with respect to a dilation by a operator Da, which makes it a good
DYADIC WARPED WAVELETS
79
candidate for the generation of scale a wavelets. However, if the translates of ϕ÷ generate an orthogonal basis, the same is not true for the translates of ϕ. ö Clearly, since the warping operator is unitary, the basis generator operator is a warped form of the shift-by-one operator, corresponding to multiplication by e− jŴ(ω) in the Fourier domain. Correspondingly, wavelets at the same scale level are obtained by all-pass Þltering a unique function. Our construction of generalized multiresolution approximation is based on these simple ideas. We show that starting from either ϕö or ϕ÷ one can construct orthogonal warped wavelet bases that are amenable to discrete computation. However, considerations related to the computation of the zero-level approximation lead us to the conclusion that expansion on the wavelet basis generated by ϕö requires signal prewarping, which, as already pointed out, cannot be implemented. While not requiring signal prewarping, the basis generated by ϕ÷ is not rooted on a pure scale a multiresolution. A better solution considers the use of the analogon of ϕ÷ built with heterogeneous warping maps. A theorem on iterated maps and eigenfunctions of the composition operator proves that the same map Ŵ can be realized in inÞnitely many ways with equal or unequal maps. We present an algorithm for building approximate scale a wavelets such that the computation of the associated signal expansion does not require signal prewarping. A key issue is that the Laguerre transform represents the unique discrete-time unitary warping operator that is computable by means of realizable signal processing structures. This means that in order to obtain warped wavelets in a computable form one needs to constrain the map θ(ω) to belong to the one-parameter family of Laguerre maps. In the heterogeneous map case, the Laguerre parameters can be obtained by enforcing an exponential cutoff condition on the scaling function. The computational structure intertwines discrete-time Laguerre warping with quadrature mirror Þlter (QMF) banks. This gives rise to a warped form of QMF, equivalent to frequency warping both Þlters and resampling operators.
III. Multiresolution Approximation In this section we deÞne a general multiresolution approximation scheme mimicking the dyadic multiresolution scheme on which ordinary wavelet sets are based (Mallat, 1998). The new ingredient is the introduction of a family of unitary operators replacing the shift operators of the ordinary multiresolution approximation. The case of interest for the construction of dyadic warped wavelets is when the family of unitary operators forms a group and is completely characterized by a unique phase function. This is specialized in the warped multiresolution approximation. The dilation operator considered here has arbitrary scale factor a < 1. This will allow for the construction of dyadic warped wavelets with arbitrary scale.
80
GIANPAOLO EVANGELISTA
A. General Multiresolution Approximation Inspired by the deÞnition of multiresolution approximation on dyadic scale (Mallat, 1998), we can provide the following deÞnition of multiresolution approximation on arbitrary scale, allowing room for general shift operators as ÒtranslationÓgenerators. DeÞnition III.1 A general multiresolution approximation (GMRA) of L2(R) is a sequence {Vn }n∈Z of closed subspaces of L2(R) satisfying the following properties: r
{Vn }n∈Z is decreasing; that is, ∀n ∈ Z
r
Vn+1 ⊂ Vn
(3)
There exists a family of unitary operators F = {Uk }k∈Z in L2(R), with (identity operator)
U0 = I
U−k = Uk†
(adjoint operator)
such that V0 is closed under the action of F ; that is, f ∈ V0 ⇐⇒ Um f ∈ V0 r
∀m ∈ Z
There exists a real number a ∈ (0, 1) called scale factor such that ∀n ∈ Z, f ∈ Vn ⇔ Da f ∈ Vn+1 where Da is the unitary dilation operator: √ Da f (t) = a f (at)
r
n∈Z
r
(5)
(6)
with inverse Da−1 = D1/a and nth power Dan = Da n . Vn = {0}
(7)
Vn = L2 (R)
(8)
n∈Z
r
(4)
There exists a function ζ such that {Um ζ (t)}m∈Z is a Riesz basis of V0
Remark III.1 From (4) and (5) it follows that ∀(n, m) ∈ Z2 , f ∈ Vn ⇔ Dan Um Da−n f ∈ Vn
(9)
DYADIC WARPED WAVELETS
81
In fact by (5) f ∈ Vn ⇔ Da−n f ∈ V0 and by (4) g ∈ V 0 ⇐⇒ Um g ∈ V0 so that f ∈ Vn ⇔ Um Da−n f ∈ V0 ⇔ Dan Um Da−n f ∈ Vn Example III.1 An example of GMRA is obtained from a family F of unitary operators forming a group G with respect to the product of operators. Let T be the generator of the group and suppose that the group is canonically ordered (i.e., Um = T m , ∀m ∈ Z). Then Um+n = Um Un = Un Um . Notice that the GMRA reverts to the ordinary multiresolution approximation (MRA) if the generator T = S, where S is the shift-by-one operator [S f ](t) = f (t − 1). Remark III.2 In this article we will be mostly concerned with groups of unitary operators generated by a time-invariant generalized shift operator T deÞned in the Fourier domain by the product F[T f ](ω) = F(ω)Q 0,1 (ω)
where F denotes the FourierÐPlancherel transform on L2(R) and F(ω) = F[ f ](ω). Since T is unitary, ∀ f ∈ L2 (R) we must have f 2 − 2T f 2 + T † T f 2 = 0 By Plancherel theorem the last equality becomes +∞ 1 |F(ω)|2 (1 − |Q 0,1 (ω)|2 )2 dω = 0 2π −∞
(10)
This implies that |Q 0,1 (ω)|2 = 1 a.e. Except on a set of zero Lebesgue measure, we can write Q 0,1 (ω) = e− jŴ(ω)
(11)
where Ŵ is a real function of ω ∈ R. The action of T is equivalent to multiplication by e− jŴ(ω) in the Fourier domain. By repeated application of (11), for any m ∈ Z the action of T m is equivalent to multiplication by Q 0,m (ω) = e− jmŴ(ω) in the Fourier domain. Furthermore, the action of the dilation operator may be described in the Fourier domain by the action F[Da f ](ω) = a −1/2 F(a −1 ω)
(12)
82
GIANPAOLO EVANGELISTA
It follows that the action of the operator Dan T m Da−n in Remark III.1 is equivalent to multiplication by Q n,m (ω) = e− jmŴ(a
−n
ω)
B. Frequency Warping Operators Discrete- and continuous-time frequency warping operators are at the heart of our construction of dyadic warped wavelets. We consider a regular class of warping operators whose warping characteristic is a smooth monotonically increasing function. We provide the following deÞnition. DeÞnition III.2 Suppose that Ŵ is a strictly increasing and continuously differentiable real function.∗ Given any f ∈ L2 (R) the warping operator W is deÞned by its action in the Fourier domain as follows: ( dŴ F(Ŵ(ω)) (13) F[W f ](ω) = dω Based on DeÞnition III.2 we can show the following lemma. Lemma III.1 The warping operator W is unitary in L2 (R). Its adjoint W † has Fourier domain action ( dŴ −1 † F[W f ](ω) = F(Ŵ −1 (ω)) (14) dω Proof. The warping operator preserves the L2(R) norm; in fact, +∞ dŴ 1 2 W f = |F(Ŵ(ω))|2 dω 2π −∞ dω
(15)
where, by DeÞnition III.2, Ŵ is strictly increasing and continuously differentiable. Thus Ŵ is invertible and we can perform the change of variable α = Ŵ(ω) in (15), obtaining W f 2 = f 2 With the same change of variable one can show that ∀( f, g) ∈ L2 (R) × L2 (R)
W f, W g = f, g * Throughout this article we consider strictly increasing maps in C1(R). However, most of our results can be easily generalized to maps in C1(R) whose Þrst derivative is zero at most on a countable set of points.
83
DYADIC WARPED WAVELETS
Thus, the warping operator is isometric. Moreover, W is surjective on L2(R). In fact, let g ∈ L2 (R) and pick +∞ ( −1 1 dŴ f (t) = G(Ŵ −1 (ω))e jωt dω 2π −∞ dω where G = F[g]. Then F(ω) =
(
dŴ −1 G(Ŵ −1 (ω)) dω
and F[W f ](ω) =
(
dŴ dω
(
" dŴ − 1 "" G(Ŵ −1 ◦ Ŵ(ω)) = G(ω) dα "α=Ŵ(ω)
Thus, for any g ∈ L2 (R) there exists an f ∈ L2 (R) such that W f = g. It follows that W is invertible in L2(R) with inverse ( dŴ −1 F(Ŵ −1 (ω)) F[W −1 f ](ω) = dω Since W is isometric and invertible in L2(R) then W is unitary with W † = W −1 . 1. Warped Fourier Analysis 1a. Warped Fourier Series Consider a function F ∈ L2 ([−π, π ]). The Fourier series f [n]e− jnω n∈Z
where f [n] =
1 2π
+π
−π
F(ω)e jnω dω
n∈Z
(16)
converges in the L2 ([−π, π ]) sense to F(ω). Let θ ∈ C1 ([−π, π ]) be a strictly increasing function mapping [−π, π] one-to-one and onto itself.∗ In (16) we * This hypothesis can be weakened a bit by including one-to-one maps in C 1 ([−π, π ]) whose Þrst derivative is zero at most on a countable set of points. In this case the derivative θ ′ must be replaced by its absolute value.
84
GIANPAOLO EVANGELISTA
can perform the change of variable ω = θ() to obtain +π +π ( 1 1 dθ dθ d = F(θ())∗n () d F(θ ())e jnθ() f [n] = 2π −π d 2π −π d (17) where ( dθ − jnθ(ω) n (ω) ≡ e n∈Z dω The set {n }n∈Z is orthonormal and complete in L2 ([−π, π ]). In fact, +π +π 1 dθ 1 ∗ n (ω)m (ω) dω = e j[m−n]θ (ω) dω 2π −π 2π −π dω +π 1 = e j[m−n] d = δn,m 2π −π Furthermore, given any G ∈ L2 ([−π, π ]) one can Þnd F ∈ L2 ([−π, π ]) such that ( dθ −1 G(θ −1 (ω)) F(ω) = dω and, in the L2 ([−π, π ]) sense, ( ( dθ dθ F(θ(ω)) = f [n]e− jnθ(ω) = f [n]n (ω) G(ω) = dω dω n∈Z n∈Z
(18)
where, from (17) 1 f [n] = 2π
+π
G(ω)∗n (ω) dω
(19)
−π
Equations (18) and (19) deÞne a warped form of the Fourier series expansion. The relationship between the coefÞcients g[n] of the Fourier series of G(ω) and those of the warped Fourier series is also interesting. By ParsevalÕs theorem, from (19) we have g[k]λ∗n [k] (20) f [n] = k∈Z
and from (18) we have g[k] =
n∈Z
f [n]λn [k]
(21)
DYADIC WARPED WAVELETS
where 1 λn [k] = 2π
+π
−π
n (ω)e
jkω
1 dω = 2π
+π
−π
(
85
dθ j[kω−nθ(ω)] e dω dω
The coefÞcients of the warped series are obtained by projecting the Fourier coefÞcients over the set {λn }n∈Z . Indeed, the completeness and orthonormality of the set {n }n∈Z in L2 ([−π, π ]) is equivalent to the completeness and orthonormality of the set {λn }n∈Z in ℓ2 (Z). In fact, the sequence of Fourier coefÞcients g[k] can be expanded over this set as in (21) and the expansion coefÞcients are exactly those of the warped Fourier series representing G(ω). Independently of the Fourier series representation, if f is a signal in ℓ2 (Z) then (21) is equivalent to frequency warping the signal and by means of (20) the warped signal is unwarped to obtain the original signal. In other words, the discrete-time unitary warping operator is characterized by the kernel w[k, n] = λn [k] From (20) the inverse kernel (i.e., the kernel of the unwarping operator) is given by w−1 [k, n] = λ∗k [n] and this, of course, coincides with the kernel of the adjoint warping operator obtained by exchanging the indices n and k in the complex conjugate of λn [k]. 1b. Warped Fourier Transform Similarly to the warped Fourier series, warped Fourier transform pairs can be introduced. Given F ∈ L1 (R), let +∞ 1 F(ω)e jωt dω f (t) = 2π −∞ If Ŵ ∈ C1 (R) is a strictly increasing function mapping the real axis one-toone onto itself,∗ we can perform the change of variable ω = Ŵ() in the last integral to obtain +∞ 1 G 1 ()e jŴ()t d (22) f (t) = 2π −∞ where G 1 () = ∗
dŴ F(Ŵ()) dω
Here again it sufÞces to assume that Ŵ ′ = 0 on at most a countable set of points.
86
GIANPAOLO EVANGELISTA
Notice that G 1 ∈ L1 (R) since +∞ |G 1 ()| d = −∞
+∞
−∞
|F()| d < ∞
Suppose for the sake of simplicity that f ∈ L1 (R), so that F is continuous, then G 1 () is continuous and dŴ +∞ f (t)e− jŴ()t dt (23) G 1 () = d −∞ Hence, (23) and (22) constitute a warped Fourier transform pair. We have the following warped form of RiemannÐLebesguelemma. Lemma III.2 (Generalized RiemannÐLebesgue) If F1 ∈ L1 (R) and Ŵ ∈ C1 (R) is a strictly increasing function mapping the real axis one-to-one onto itself and +∞ 1 F1 ()e jŴ()t d (24) f (t) = 2π −∞ then f(t) is continuous and bounded on R and f (t) → 0 as |t| → ∞. Proof. It sufÞces to perform the change of variable ω = Ŵ() in the integral in (24) and to apply the ordinary RiemannÐLebesguelemma. If F ∈ L2 (R) then it is convenient to rewrite (22) as follows: ( +∞ dŴ jŴ()t 1 e d G 2 () f (t) = 2π −∞ d
(25)
where the integral is in the FourierÐPlancherelsense and ( dŴ F(Ŵ()) G 2 () = d Notice that G 2 ∈ L2 (R) and G 2 () =
(
dŴ d
+∞
f (t)e− jŴ()t dt
(26)
−∞
in the L2(R) sense. Thus, (26) and (25) constitute a warped Fourier transform pair. If G 2 = F[g2 ] then g2 is related to f via the warping operator W deÞned in (13): g2 (t) = W f (t)
DYADIC WARPED WAVELETS
87
C. Warped Multiresolution Approximation In this section we specialize the deÞnition of general multiresolution approximation provided in Section III.A to the case where the family of unitary ÒtranslationÓoperators forms a group of time-invariant operators, characterized by a smooth increasing phase function Ŵ(ω). In the sense of DeÞnition III.2, this function serves as warping characteristic of a warping operator. We show that the generalized shift operators are unitarily equivalent to ordinary shift operators via the associated warping operator. First, we provide the following deÞnition of warped multiresolution approximation: DeÞnition III.3 A warped multiresolution approximation (WMRA) is any realization of the GMRA in which the family F of unitary operators forms a group with generator T deÞned by the following action in the Fourier domain: F[T f ](ω) = F(ω)e− jŴ(ω)
∀ f ∈ L2 (R)
(27)
where Ŵ(ω) is a strictly increasing and continuously differentiable real function called warping characteristic. According to (13), the function Ŵ(ω) deÞnes the warping operator associated to the WMRA. Next, we show that generalized shift operators are unitarily equivalent to ordinary shift operators via the warping operator built on the phase function Ŵ(ω). We have the following lemma: Lemma III.3 The warping operator W of a WMRA establishes a unitary equivalence of the generator T and the shift-by-one operator S; that is, T = W SW † Proof. By deÞnition, †
F[W T f ](ω) =
=
(
(
dŴ −1 −1 F(Ŵ −1 (ω))e− jŴ(Ŵ (ω)) dω dŴ −1 F(Ŵ −1 (ω))e− jω = F[SW † f ](ω) dω
hence T = W SW † Remark III.3 From Lemma III.3 it immediately follows that ∀m ∈ Z
T m = W Sm W †
88
GIANPAOLO EVANGELISTA
Therefore the group of generalized shift operators is unitarily equivalent to the group of ordinary shift operators. This property is preserved through scale via the scaled warping operator Dan W as follows: ∀(n, m) ∈ Z2
Dan T m Da−n = Dan W S m W † Da−n
D. Globally Warped Wavelets Baraniuk and Jones (1993, 1995) have exploited unitary frequency warping in order to deÞne warped wavelets. Their construction may be summarized as follows: the signal is unwarped by means of the inverse warping operator W † , then the expansion coefÞcients on a dyadic wavelet basis are computed. Reconstruction is achieved by applying the warping operator W to the wavelet expansion of the unwarped signal. Given an orthogonal and complete set of dyadic wavelets {ψn,m }n,m∈Z , where n m ψn,m (t) = 2−n/2 ψ(2−n t − m) = D1/2 S ψ (t) (28) one deÞnes the warped wavelets as follows:
ψön,m = W ψn,m
where W is deÞned as in (13). The set {ψön,m }n,m∈Z is orthogonal since
W ψn ′ ,m ′ , W ψn,m = ψn′ ,m ′ , W † W ψn,m = ψn′ ,m ′ , ψn,m = δn ′ ,n δm ′ ,m
and complete since, by unitary equivalence, given x ∈ L2 (R) it is always possible to Þnd y ∈ L2 (R) such that x = W y. Hence, by expanding y over the dyadic wavelet set and exploiting the continuity of the warping operator, we have x(t) = W y(t) = W yn,m ψn,m = (29) yn,m ψön,m (t) n,m∈Z
n,m∈Z
where
yn,m = y, ψn,m = W † x, ψn,m = x, W ψn,m = x, ψön,m In the Fourier domain the warped wavelets are related to the dyadic wavelets as follows: ) ön,m (ω) = F[W ψn,m ](ω) = Ŵ ′ (ω)n,m (Ŵ(ω)) ) n (30) = 2n Ŵ ′ (ω)(2n Ŵ(ω))e− j2 mŴ(ω)
The warped wavelets are not simply generated by dilating and translating a mother wavelet. Rather, the ÒtranslatedwaveletsÓare generated by all-pass
89
DYADIC WARPED WAVELETS n
n
Þltering e− j2 mŴ(ω) (i.e., by repeated action of the operator T 2 ), where T is given in (27). Scaling also depends on the warping map Ŵ(ω). However, frequency warped wavelets have the remarkable property that their essential frequency support can be arbitrarily assigned by proper choice of the map Ŵ(ω). Indeed, if the cutoff frequencies of dyadic wavelets are Þxed at 2−n π, the cutoff frequencies of the warped wavelets are ωn = Ŵ −1 (2−n π) Genuine scale a wavelets, 0 < a < 1, may be generated by selecting as warping map an a-homogeneous function (as deÞned in (27) Wornell and Oppenheim (1992)), satisfying Ŵ(aω) = 12 Ŵ(ω)
(31)
The example shown in Figure 1 considers the function πω −loga (|ω|/π) Ŵ(ω) = 2 |ω| as a dyadic to scale a conversion function. Notice that (31) denotes a partial form of self-similarity of the warping map. Repeated application of (31) shows that Ŵ(a −n ω) = 2n Ŵ(ω)
(32)
By deriving both sides of (32) with respect to ω we obtain Ŵ ′ (a −n ω) = (2a)n Ŵ ′ (ω)
Figure 1. Example of a-homogeneous warping map.
(33)
90
GIANPAOLO EVANGELISTA
Substituting (32) and (33) in (30) we obtain ön,m (ω) = a −n/2 ö0,m (a −n ω) = a −n/2 (a ö −n ω)e− jmŴ(a −n ω) which shows that ψön,m = Dan T m ψö (i.e., warped wavelets generated by an a-homogeneous warping map are obö tained by generalized shift and dilation of a unique warped mother wavelet ψ). Without loss of generality we can assume that Ŵ(π) = π In that case the cutoff frequencies of the scale a warped wavelets are Þxed at ωn = a n π
If a > 12 then the warped wavelets achieve a Þner frequency resolution than the one obtained by ordinary dyadic wavelets. However, as already pointed out, there is no exact algorithm for computing the globally warped wavelet transform.
IV. From WMRA and Warped Scaling Functions to Warped QMFs In this section we begin our construction of dyadic wavelet bases by iterated frequency warping. First we show that given a Riesz basis in the WMRA one can Þnd a unitarily equivalent basis spanning a shift-invariant subspace of L2(R). This auxiliary basis and associated space is easier to deal with and will lead to the deÞnition of the auxiliary scaling function. Next we Þnd a frequency domain equivalent condition for the orthogonality of warped Riesz bases and we obtain warped forms of the two-scale equation. This equation leads to warped quadrature mirror Þlters (QMFs) obtained by discrete-time warping a classical QMF pair. In order to obtain a genuine QMF pair one needs an extra requirement on the warping maps, which guarantees that the frequency responses of the Þlters are 2π-periodic, hence they do correspond to discrete-time Þlters. This result is formalized in Theorem IV.1.
A. Warped Riesz Bases Among the axioms deÞning the GMRA and the WMRA (see Sections III.A and III.C) one postulates the existence of a warped Riesz basis obtained by repeated application of the generalized shift operator to a unique function. The next proposition shows that this Riesz basis is unitarily equivalent to an auxiliary
91
DYADIC WARPED WAVELETS
Riesz basis obtained by repeated applications of the ordinary shift operator. This basis spans a shift-invariant subspace of L2(R) unitarily equivalent to the space V0 spanned by the warped Riesz basis, as shown in the next proposition. Proposition IV.1 Let W be the warping operator associated to a WMRA with generator T. The family {T m ζ (t)}m∈Z is a Riesz basis for the space V0 = span{T m ζ |m ∈ Z} ⊆ L2 (R)
if and only if the family {ζ÷(t − m)}m∈Z where ζ÷ = W † ζ is a Riesz basis for the space ÷0 = span{ζ÷(t − m)|m ∈ Z} ⊆ L2 (R) V Furthermore, the family {T m ζ (t)}m∈Z is orthonormal if and only if the family {ζ÷(t − m)}m∈Z is orthonormal.
Proof. Let ζm = T m ζ = W S m W † ζ , where the last equality follows from Lemma III.3. Let ζm (t) = S m ζ (t) = ζ÷(t − m) = W † ζm (t). If {ζm (t)}m∈Z is a Riesz basis for V0 then the functions ζm (t) are linearly independent and there exist two positive constants A and B such that for any f ∈ V0 one can Þnd a sequence of coefÞcients c[m] satisfying f (t) = in L2(R) with
+∞
c[m]ζm (t)
m=−∞
+∞ 1 1 |c[m]|2 ≤ f 2 f 2 ≤ B A m=−∞
(34)
Since W is unitary then W † f f and
+∞ 1 1 |c[m]|2 ≤ W † f 2 W † f 2 ≤ B A m=−∞
(35)
Moreover, W † is continuous. Let { f n } be a Cauchy sequence in V0 considered as a subspace of L2(R). Therefore ∀ǫ > 0, ∃N ∈ N such that f n − f m < ǫ, ∀n, m > N . Since f n − f m = W † f n − W † f m , it fol÷0 , whose limit is in V ÷0 since V ÷0 is lows that {W † f n } is a Cauchy sequence in V a closed subspace of L2(R). Also, [W † f ](t) = in L2(R).
+∞
m=−∞
c[m][W † ζm ](t) =
+∞
m=−∞
c[m]ζ÷m (t)
(36)
92
GIANPAOLO EVANGELISTA
Since W is unitary then the functions {ζ÷m (t)}m∈Z are linearly independent if and only if the functions {ζm (t)}m∈Z are linearly independent. By (35) and ÷0 . It follows that if {ζm (t)}m∈Z is a Riesz (36), if f ∈ V0 ⊆ L2 (R) then W † f ∈ V ÷0 . ÷ basis for V0 then {ζm (t)}m∈Z is a Riesz basis for V ÷ Conversely, by acting on V0 with W one can show with a similar argument that {ζm (t)}m∈Z is a Riesz basis for V0. From Lemma III.3 it follows that
S k ζ÷, S m ζ÷ = S k W † ζö, S m W † ζö = W S k W † ζö, W S m W † ζö = T k ζö, T m ζö Hence the family {T m ζö}m∈Z is orthonormal if and only if the family {S m ζ÷}m∈Z is orthonormal. The next proposition provides equivalent Fourier domain conditions for a function to generate an orthogonal warped Riesz basis. Proposition IV.2 Let T be the generalized shift generator of a WMRA with warping characteristic Ŵ(ω) and scale factor a. A function ζ (t) generates a Riesz basis {T m ζ (t)}m∈Z for the space V0 = span{T m ζ |m ∈ Z} if and only if there exist two constants A > 0 and B > 0 such that for almost all ω ∈ [−π, π ] +∞ 1 1 ≤ | Z÷(ω + 2kπ)|2 ≤ B k=−∞ A
(37)
where Z÷ = F[W † T m ζ ]. Furthermore, let {T m ζ (t)}m∈Z be a Riesz basis for V0 and let ÷ ÷ =* (ω) = F[ϕ](ω)
Z÷(ω) +∞ k=−∞
| Z÷(ω + 2kπ )|2
Then {Dan T m ϕ} ö m∈Z , where ϕö = W ϕ, ÷ is an orthonormal basis for Vn for any n ∈ Z. Proof. Let ζm = T m ζ and ζ÷m (t) = W † ζm (t) = ζ÷0 (t − m). As in the proof of Proposition IV.1, to any f ∈ V0 such that f (t) =
+∞
m=−∞
c[m]ζm (t)
(38)
DYADIC WARPED WAVELETS
93
there corresponds a g(t) = [W † f ](t) with the following decomposition: [W † f ](t) =
+∞
c[m]ζ÷m (t)
(39)
m=−∞
The two decompositions (38) and (39) share the same coefÞcients. The Fourier transform of the last equation yields
where Z÷ = F[W ζ ] and
F[W † f ](ω) = C(ω) Z÷(ω)
(40)
†
C(ω) =
+∞
c[m]e− jmω
m=−∞
is a 2π -periodic Fourier series. Since W is unitary, we have +∞ 1 2 † 2 |C(ω)|2 | Z÷(ω)|2 dω f = W f = 2π −∞ +∞ +π 1 |C(ω)|2 | Z÷(ω + 2kπ)|2 dω = 2π k=−∞ −π
The expressions
+K
k=−K
and
+π
−π
+π −π
|C(ω)|2 | Z÷(ω + 2kπ)|2 dω
|C(ω)|2
+K
k=−K
| Z÷(ω + 2kπ)|2 dω
either diverge or both converge to the same limit as K → ∞. If (37) is satisÞed and f ∈ V0 ⊆ L2 (R) then they both converge and +π +∞ 1 2 A f ≤ |c[m]|2 ≤ B f 2 (41) |C(ω)|2 dω = 2π −π m=−∞
for any sequence c[m] satisfying (39). Any nonzero sequence c[m] obtains f > 0 (i.e., f = 0). Therefore the elements of {T m ζ (t)}m∈Z are linearly independent and they form a Riesz basis. Conversely, if {T m ζ (t)}m∈Z is a Riesz basis then (41) is satisÞed for any sequence c[m] in ℓ2 (Z). If (37) is not veriÞed then one can construct a nonzero 2π-periodic function C(ω) such that (37) is not veriÞed on a subset of [−π, π ]
94
GIANPAOLO EVANGELISTA
of nonzero measure. Hence (41) is not veriÞed, which contradicts the hypothesis that {T m ζ (t)}m∈Z is a Riesz basis. Let {ζm }m∈Z , where ζm = T m ζ , be a Riesz basis for V0. We want to Þnd a ö m∈Z are orthonormal for any n ∈ Z. Since ϕö such that the families {Dan T m ϕ} ö m∈Z is orthonormal the scaling operator Da is unitary then the family {Dan T m ϕ} ö m∈Z is orthonormal. It sufÞces to construct an if and only if the family {T m ϕ} orthonormal basis for V0. In order to accomplish this we expand ϕö ∈ V0 over {ζm }m∈Z : ϕ(t) ö =
+∞
b[m]ζm (t)
m=−∞
To the function ϕ(t) ö there corresponds the function †
ö = ϕ(t) ÷ = [W ϕ](t)
+∞
b[m]ζ÷m (t)
m=−∞
ö m∈Z is orthonormal if and only if the By Proposition IV.1 the family {T m ϕ} family {ϕ(t ÷ − m)}m∈Z is orthonormal. By performing the same calculations for deriving (40) we obtain ÷ ÷ = F[W † ϕ](ω) ö = B(ω) Z÷(ω) (ω) = F[ϕ](ω)
(42)
where B(ω) = is 2π-periodic. Since
+∞
b[m]e− jmω
m=−∞
ϕ(t ÷ − m), ϕ(t ÷ − r ) = ϕ(t), ÷ ϕ(t ÷ + m − r ) then, by Plancherel theorem, the set {ϕ(t ÷ − m)}m∈Z is orthonormal if and only if +∞ 1 2 − jkω ÷ ÷ ϕ(t ÷ + k) = |(ω)| e dω δk,0 = ϕ(t), 2π −∞ +∞ (2m+1)π 1 2 − jkω ÷ |(ω)| e dω = 2π m=−∞ (2m−1)π +∞ +π 1 ÷ + 2mπ)|2 e− jkω dω = |(ω (43) 2π m=−∞ −π Suppose that {ϕ(t ÷ − m)}m∈Z is orthonormal. By the monotone convergence
DYADIC WARPED WAVELETS
95
theorem applied to the partial sums +M
m=−M
÷ + 2mπ)|2 |(ω
we obtain 1=
+π +∞ +π +∞ 1 ÷ + 2mπ)|2 dω ÷ + 2mπ)|2 dω = 1 |(ω |(ω 2π m=−∞ −π 2π −π m=−∞
so that 1 2π
+π
−π
+∞
m=−∞
÷ + 2mπ)|2 dω < ∞ |(ω
Also, " " +M +∞ " " " " 2 − jkω ÷ + 2mπ)| e ÷ + 2mπ)|2 |(ω |(ω " "≤ "m=−M " m=−∞
Therefore by the dominated convergence theorem we can interchange the sum and integral signs in the last member of (43), obtaining +π +∞ 1 ÷ + 2mπ )|2 e− jkω dω δk,0 = |(ω (44) 2π −π m=−∞ which implies +∞
m=−∞
÷ + 2mπ )|2 = 1 a.e. |(ω
(45)
Conversely, if (45) is veriÞed then (44) holds and we can apply the dominated convergence theorem to the partial sums +M
m=−M
÷ + 2mπ)|2 e− jkω |(ω
to obtain (43) so that the set {ϕ(t ÷ − m)}m∈Z is orthonormal. By (42) since B(ω) is 2π-periodic then +∞
m=−∞
÷ + 2mπ)|2 = |B(ω)|2 |(ω
+∞
m=−∞
| Z÷(ω + 2mπ)|2
(46)
96
GIANPAOLO EVANGELISTA
By (37), the quantity R(ω) =
+∞
m=−∞
| Z÷(ω + 2mπ)|2 > 0
If in (42) we let B(ω) =
+
1 R(ω)
(47)
then C(ω) is 2π-periodic with Þnite energy and (45) is veriÞed so that the ÷ − m)}m∈Z is set {ϕ(t ÷ − m)}m∈Z is orthonormal. Conversely, if the set {ϕ(t orthonormal then (45) is veriÞed and from (46) we obtain (47).
B. Warped Scaling Functions Like ordinary scaling functions, warped scaling functions are strategical for the construction of warped wavelet bases. Based on our results for warped Riesz bases we can provide the following deÞnition of warped scaling function: DeÞnition IV.1 A scaling function of the WMRA is a function ϕö ∈ L2 (R) ö m∈Z are orthonormal for any n ∈ Z. such that the families {Dan T m ϕ} Remark IV.1 From the proof of Proposition IV.2 it follows that ϕö ∈ L2 (R) is a scaling function if and only if +∞
m=−∞
÷ + 2mπ)|2 = 1 a.e. |(ω
(48)
ö It follows from (14) that an equivalent condition is where ϕ÷ = W † ϕ. " +∞ dŴ −1 "" ö −1 (ω + 2mπ))|2 = 1 a.e. |(Ŵ dα "α=ω+2mπ m=−∞ C. Warped Two-Scale Equations Suppose that ϕö ∈ L2 (R) is a scaling function of a WMRA and let ϕön,m = ö Consider the coefÞcients of the orthogonal projection of f ∈ L2 (R) Dan T m ϕ. over the basis {ϕön+1,m }m∈Z of Vn+1: f n+1 [m] = f, ϕön+1,m
(49)
97
DYADIC WARPED WAVELETS
We are looking for an algorithm to compute f n+1 [m] from the coefÞcients f n [k] = f, ϕön,k of the orthogonal projection of f ∈ L2 (R) over the basis {ϕön,m }m∈Z of Vn. The GMRA property (3) ensures that Vn+1 ⊂ Vn . Hence ϕön+1,m (t) ∈ Vn+1 has the following expansion over {ϕön,m }m∈Z : ϕön+1,m (t) =
+∞
ϕön+1,m , ϕön,k ϕön,k (t)
(50)
k=−∞
From DeÞnition IV.1 and the fact that the dilation operator Da is unitary it follows that the scalar product in (50) does not depend on the scale index n: , ö Dan T k ϕö = Da T m ϕ, ö T k ϕ ö ≡ bm [k] (51)
ϕön+1,m , ϕön,k = Dan+1 T m ϕ, Substituting (50) in (49) we obtain +∞ +∞ +∞ bm∗ [k] f, ϕön,k = bm∗ [k] f n [k] bm [k]ϕön,k = f n+1 [m] = f, k=−∞
k=−∞
k=−∞
(52)
Therefore, our algorithm for iteratively computing the projection coefÞcients does not depend on scale and is based on operations on sequences only. In particular, the coefÞcients b0 [k] relate a dilation by a of the scaling function to its generalized shifts obtained by repeated application of the operator T: +∞ √ a ϕ(at) ö = [Da ϕ](t) ö = b0 [k][T k ϕ](t) ö
(53)
k=−∞
Equation (53), which generalizes the two-scale equation of ordinary MRA schemes, is called the (time domain) warped two-scale equation. More generally, we have ö = [Da T m ϕ](t)
+∞
bm [k][T k ϕ](t) ö
(54)
k=−∞
By Fourier transforming both sides of (54) and taking (27) and (12) into account, we obtain 1 ö . ω / − jmŴ(ω/a) ö e = Bm (Ŵ(ω))(ω) (55) √ a a where
Bm (ω) =
+∞
k=−∞
bm [k]e− jkω
98
GIANPAOLO EVANGELISTA
is a 2π-periodic function. Its argument is only relevant modulo 2π: Bm (Ŵ(ω)) = Bm ( Ŵ(ω)2π ) Applying Lemma III.3 and using the fact that the warping operator W is unitary we obtain from (51) ÷ S m ϕ, ÷ W S k ϕ ö = W † Da W S m ϕ, ÷ S k ϕ ÷ = W ÷ S k ϕ ÷ bm [k] = Da W S m ϕ, where ϕ÷ = W † ϕö and we deÞned
÷ ≡ W † Da W W
(56)
ϒ(ω) ≡ Ŵ(a −1 Ŵ −1 (ω))
(57)
÷ has Fourier domain action The operator W ( ÷ g](ω) = dϒ G(ϒ(ω)) F[ W dω where
Like the function Ŵ(ω), the map ϒ(ω) is increasing and continuously differen÷ is a well-deÞned unitary warping operator. Furthermore, tiable. Therefore W ( dϒ ÷ m − jmϒ(ω) ÷ S ϕ](ω) ÷ ϕ](ω)e (ϒ(ω))e− jmϒ(ω) F[ W ÷ = F[ W ÷ = dω In other words, by deÞning the generalized shift operator T÷ with Fourier domain action F[T÷g](ω) ≡ G(ω)e− jϒ(ω)
for any m ∈ Z, we have
÷ m = T÷m W ÷ WS
(58)
The warped two-scale equation (54) induces a similar equation for ϕ. ÷ In fact, by substituting for ϕö = W ϕ÷ and by applying the operator W † on both sides of (54) we obtain ÷ = [W † Da T m W ϕ](t)
+∞
bm [k][W † T k W ϕ](t) ÷
k=−∞
which, from Lemma III.3, (56), and (58), becomes ÷ ϕ](t) [T÷m W ÷ =
+∞
k=−∞
bm [k]ϕ(t ÷ − k)
(59)
DYADIC WARPED WAVELETS
99
or, in the Fourier domain, ( dϒ ÷ ÷ (60) (ϒ(ω))e− jmϒ(ω) = Bm (ω)(ω) dω Equation (59) relates a warped and shifted version of ϕ÷ to its integer translates. The scaling factor is embedded in the map ϒ(ω) given in (57). Therefore (59) is implicitly another form of the two-scale equation. By taking the magnitude square of (60) and exploiting the 2π-periodicity of Bm (ω) and orthogonality condition (45), we obtain " +∞ dϒ "" 2 ÷ |Bm (ω)| = |(ϒ(ω + 2kπ))|2 (61) " dα α=ω+2kπ k=−∞ The right-hand side does not depend on m. Thus, the functions Bm (ω) differ at most for a phase factor. A consequence of (60) is that
÷ on the set {ω|(ω) = 0}.
Bm (ω) = e− jmϒ(ω) B0 (ω)
(62)
D. Warped Quadrature Mirror Filters A discrete-time quadrature mirror Þlter (QMF) is characterized by a 2πperiodic frequency response H (ω) satisfying the following power complementarity relationship: ∀ω ∈ R
|H (ω)|2 + |H (ω + π)|2 = 2
QMFs play an essential role in both the construction of dyadic wavelets and the computation of signal expansion over dyadic wavelet bases. In this section we show that the warped two-scale equation relating warped scaling functions at two different scales leads to a warped form of QMF. We will constrain the warped frequency response to be 2π-periodic (i.e., to represent a discrete-time Þlter). By applying the operator W÷† , deÞned in (56), on both sides of (59), we can rewrite the warped two-scale equation for ϕ÷ with m = 0 in the Fourier domain as follows: ( dϒ −1 ÷ −1 (ω)) ÷ B0 (ϒ −1 (ω))(ϒ (63) (ω) = dω Let ( dθ −1 B0 (θ −1 (ω)) H (ω) ≡ dω
100
GIANPAOLO EVANGELISTA
where θ(ω) = so that θ −1
ϒ(ω) 1 = Ŵ(a −1 Ŵ −1 (ω)) 2 2
(64)
.ω/
= ϒ −1 (ω) = Ŵ(aŴ −1 (ω)) 2 With this substitution, (63) becomes 1 . ω / ÷ . −1 . ω // ÷ θ (ω) =√ H 2 2 2
(65)
(66)
Except for the presence of the mapping θ −1 , this form of the two-scale equation is similar to that obtained from ordinary MRA. Repeated application of (66) ÷ leads to the following expression for (ω) in terms of a product: / // . . . 1 ω ÷ −1 ω ÷ θ (ω) =√ H 2 2 2 % & % && % / . . 1 −1 ω / ÷ −1 1 −1 . ω / ω 1 H θ θ θ = H 2 2 2 2 2 2 n−1 H 21 Ŵ(a k Ŵ −1 (ω)) N −1 ÷ Ŵ (ω))) = ··· = (Ŵ(a √ 2 k=0 where we used (65) and the fact that ϒ −1 ◦ ϒ −1 ◦ · · · ◦ ϒ −1 (ω) = Ŵ(a N Ŵ −1 (ω)) 0 12 3 N times
÷ If Ŵ(0) = 0 and if (ω) is continuous at ω = 0 then
N −1 ÷ ÷ Ŵ (ω))) = (0) lim (Ŵ(a
N →∞
and ÷ ÷ (ω) = (0)
∞ H k=0
1
2
Ŵ(a k Ŵ −1 (ω)) √ 2
(67)
The following theorem relates the warped scaling function with QMFs. In order to ensure that H (ω) is 2π-periodic we need an extra condition on the map ϒ(ω), given in (70) and (71). Theorem IV.1 Let ϕö ∈ L2 (R) ∩ L1 (R) be a scaling function associated with a WMRA with scale factor a ∈ (0, 1) and increasing and continuously
DYADIC WARPED WAVELETS
101
differentiable warping characteristic Ŵ(ω) satisfying the following conditions: (odd parity)
Ŵ(ω) = − Ŵ(ω)
(68)
Ŵ(π) = π
(69)
ϒ(ω) ≡ Ŵ(a −1 Ŵ −1 (ω)) = 2θ(ω)
(70)
θ (ω + 2kπ) = θ(ω) + 2kπ
(71)
and
where
Let ö T k ϕ ö b0 [k] = Da ϕ, and B0 (ω) = Then the 2π-periodic function H (ω) ≡
+∞
b0 [k]e− jkω
k=−∞
(
dθ −1 B0 (θ −1 (ω)) dω
(72)
satisÞes ∀ω ∈ R
|H (ω)|2 + |H (ω + π)|2 = 2
and H (0) =
√
2
Proof. From (71) and (70) it follows that θ ′ (ω) is 2π-periodic and ϒ ′ (ω) = 2θ (ω). Substituting this result in (61) we obtain ′
|B0 (ω)|2 = 2 where ϕ÷ = W † pt ϕ. ö Therefore |H (ω)|2 =
+∞ dθ ÷ |(2θ(ω) + 4kπ)|2 dω k=−∞
+∞ dθ −1 ÷ |B0 (θ −1 (ω))|2 = 2 |φ(2ω + 4kπ)|2 dω k=−∞
Hence, from (48) we have
|H (ω)|2 + |H (ω + π)|2 = 2
+∞
k=−∞
÷ |(2ω + 2kπ)|2 = 2
102
GIANPAOLO EVANGELISTA
From (66), (68), and (70) it follows that ÷ −1 (0)) = √1 H (0)(0) ÷ ÷ = √1 H (0)(θ (73) (0) 2 2 √ ÷ = 0. We are Therefore, to show that H (0) = 2 it sufÞces to show that (0) ÷ going to show that the GMRA property (8) implies |(0)| = 1. Consider the orthogonal projection of f ∈ L2 (R) over Vn : PVn f =
+∞
f, ϕön,m ϕön,m
(74)
m=−∞
Property (8) of GMRA is equivalent to 42 4 lim 4 f − PVn f 4 = 0 n→−∞
By Plancherel theorem this is in turn equivalent to 4 42 lim 4F[ f ] − F PVn f 4 = 0 n→−∞
(75)
In particular, select an f ∈ L2 (R) such that F(ω) = F[ f ](ω) = 0 is bounded and has compact support included in [−π, π] (e.g., pick f (t) = sin c(t)). Let gn [m] = f, ϕön,m
(76)
then +∞ ö n,m (ω) gn [m] F PVn f (ω) =
(77)
m=−∞
where
+∞ 1 ö.ω/ n = √ gn [m]e− jmŴ(ω/a ) a n m=−∞ an . . ω // 1 ö.ω/ Ŵ n =√ G n an a an
G n (ω) =
+∞
gn [m]e− jmω
(78)
m=−∞
is a 2π-periodic function represented by the Fourier series on the right-hand side. By (76), (74), and BesselÕs inequality, +∞
m=−∞
|gn [m]|2 ≤ f 2 < ∞
DYADIC WARPED WAVELETS
103
then the series (78) converges in L2 ([−π, π ]). Notice that 1
F[ f ], F[ϕön,m ] (79) 2π +π . / 1 ö ∗ ω e jmŴ(ω/a n ) dω = F(ω) √ an 2π a n −π √ +Ŵ(π/a n ) an dŴ −1 ö ∗ (Ŵ −1 ())e jm d F(a n Ŵ −1 ()) = 2π −Ŵ(π/a n ) d
gn [m] = f, ϕön,m =
where the last equality was obtained by performing the variable change = Ŵ(ω/a n ). Since for n ≤ 0 .π / Ŵ n ≤ Ŵ(π) = π a
then
G n (ω) = ⎧ . / −1 ⎪ ⎨ √a n dŴ F(a n Ŵ −1 (ω)) ö ∗ (Ŵ −1 (ω)) if |ω| < Ŵ π dω . π / an ⎪ ⎩0 if Ŵ n < |ω| < π a
(n ≤ 0) (80)
One can check that G n (ω) ∈ L2 ([−π, π ]). In fact, there exists a B > 0 such that ( dŴ −1 |F(a n Ŵ −1 (ω))| ≤ B dω on |ω| ≤ Ŵ(π/a n ). Therefore dŴ −1 ö −1 |(Ŵ (ω))|2 dω 2 ÷ = a n B 2 |(ω)| |ω| ≤ π
|G n (ω)|2 ≤ a n B 2
÷ and (ω) ∈ L2 (R), hence its restriction to [−π, π ] is in L2 ([−π, π ]). From (77) and (80) we obtain F[PVn f ](ω) =
" . . ω //"2 n 2 ö F(ω)|(ω/a )| "÷ " = F(ω) " Ŵ n " ′ n Ŵ (ω/a ) a
(81)
ö ∈ L∞ (R) ∩ C0 (R) and Ŵ(ω) ∈ C1 (R) with But ϕö ∈ L2 (R) ∩ L1 (R), thus (ω)
104
GIANPAOLO EVANGELISTA
÷ Ŵ ′ (ω) = 0, therefore (Ŵ(ω)) is continuous and ÷ 2 lim F[PVn f ](ω) = F(ω)|(0)|
n→−∞
On the other hand, from (75) and (81) we have " ". . //" "2 +π " 2" " " 2" ÷" Ŵ ω "" " dω = 0 |F(ω)| "1 − "" lim " n n→−∞ −π a
÷ From (48) it follows that |(ω)| ≤ 1 a.e., hence " " . . ω // "2 ""2 " " "÷ |F(ω)|2 ""1 − " Ŵ n " "" ≤ F(ω)2 a
(82)
a.e.
and, by applying the dominated convergence theorem to (82), we obtain " +π " . . ω //"2 ""2 " "÷ " 2 0= Ŵ n " "" dω |F(ω)| lim ""1 − " n→−∞ a −π +π " " ÷ 2 "2 dω = |F(ω)|2 "1 − |(0)| −π
÷ 2 = 1 and (73) yields H (0) = which gives |(0)|
√ 2.
Remark IV.2 The previous theorem shows that the Fourier domain two-scale equations (55) (60) are governed by 2π-periodic functions Bm (ω) satisfying the following power complementarity relationship: " dθ −1 dθ −1 "" −1 2 ∀ω ∈ R |Bm (θ (ω))| + |Bm (θ −1 (ω + π))|2 = 2 dω dα "α=ω+π
that is, they are warped QMF, where by (62) and (64) we have Bm (ω) = e− j2mθ (ω)
V. From Warped QMF to Warped Scaling Functions and WMRA In this section we reverse the steps undertaken in Section IV. We establish sufÞcient conditions for the warping map and warped QMF to generate by generalized shift an orthogonal warped scaling functions system in L2(R), via the inÞnite product (67). For this product to converge and deÞne a genuine scaling function we need extra requirements on the warping map and on the QMF, as formalized in Theorem V.1.
DYADIC WARPED WAVELETS
105
A. L2 (R) Orthogonality of the Scaling Function System The following theorem provides a sufÞcient condition for a QMF to generate a genuine scaling function. The technical condition (90) on the QMF is classical in wavelet analysis and it can be generalized to a necessary and sufÞcient condition (Cohen, 1990). For the sake of simplicity we shall refrain from taking this step. We found that an additional technical condition for the convergence of the inÞnite product (91) can be stated in terms of a linear bound on the elementary maps θ(ω) associated to the warping map, as shown in the next theorem. Theorem V.1 Let Ŵ(ω) ∈ C1 (R) be a monotonically increasing function such that Ŵ(ω) = −Ŵ(−ω)
(odd parity)
(83)
and Ŵ(π) = π
(84)
Suppose that there exists a ∈ (0, 1) such that ϒ(ω) ≡ Ŵ(a −1 Ŵ −1 (ω)) = 2θ(ω)
(85)
θ (ω + 2kπ) = θ(ω) + 2kπ
(86)
where
and that there exists A ∈ (0, 1) such that ∀ω ∈ R .ω/ ≤ A|ω| θ −1 2
(87)
Let H (ω) be a 2π-periodic function continuously differentiable in a neighborhood of ω = 0 and satisfying the following requirements: ∀ω ∈ R
|H (ω)|2 + |H (ω + π)|2 = 2 √ H (0) = 2
(88) (89)
and inf
|ω| 0
then the set of functions ÷ n S m ϕ} {W ÷ m∈Z
(90)
106
GIANPAOLO EVANGELISTA
where ÷ ÷ = (ω) = F[ϕ](ω)
∞ H k=0
1 2
Ŵ(a k Ŵ −1 (ω)) √ 2
and ÷ ϕ](ω) F[ W ÷ = F[W † Da W ϕ](ω) ÷ =
)
(91)
÷ ϒ ′ (ω)(ϒ(ω))
is orthonormal in L2 (R) for any n ∈ Z. Equivalently, the set of functions 5 n m 6 Da T ϕö m∈Z where T acts in the Fourier domain by multiplication by e− jŴ(ω) and ) ö ÷ (ω) = F[ϕ](ω) ö = F[W ϕ](ω) ÷ = Ŵ ′ (w)(Ŵ(ω))
(92)
is orthonormal in L2 (R) for any n ∈ Z. Proof. Consider the functions
÷n (ω) ≡ (ω)χ ÷ [−2n π,2n π ] (ω) where χ[a,b] (ω) =
1 0
(93)
if ω ∈ [a, b] otherwise
is the characteristic function of the interval [a, b]. DeÞne ϒk−1 (ω) ≡ Ŵ(a k Ŵ −1 (ω)) = ϒ −1 ◦ ϒ −1 ◦ · · · ◦ ϒ −1 (ω) 0 12 3
(94)
k times
with
ϒ −1 (ω) = ϒ1−1 (ω) = Ŵ(aŴ −1 (ω)) = θ −1 and
ϒ0−1 (ω)
DeÞne 1 In [m] = 2π
.ω/ 2
= ω. From (94), (91), and (93) we can write n−1 H 21 ϒk−1 (ω) ÷ n (ω) = χ[−2n π,2n π] (ω) √ 2 k=0
÷n (ω)|2 e jmω dω = 1 | 2π
+2n π
−2n π
e
jmω
" 1 −1 "2 n−1 " H ϒk (ω) " 2
k=0
2
dω (95)
DYADIC WARPED WAVELETS
107
We are going to prove by induction that In [m] = δm,0 . First consider +2π |H (ω/2)|2 1 I1 [m] = dω e jmω 2π −2π 2 and split the integral on the right-hand side into two integrals: % 0 2π " . ω /"2 & " . ω /"2 1 " " jmω " jmω " e "H e "H I1 [m] = " dω + " dω 4π 2 2 0 −2π
Then perform the change of variable ω′ = ω + 2π in the Þrst integral and combine the two integrals back together to obtain %" . 2π /"2 " . ω /"2 & 1 ω " " " jmω " e − π " + "H I1 [m] = " dω "H 4π 0 2 2 By condition (88) the terms in parentheses sum up to 2, hence, 2π 1 I1 [m] = e jmω dω = δm,0 2π 0
as required. Consider now In [m] for n > 1. By splitting the integral in (95) into two integrals we have " 1 −1 "2 0 n−1 " H 2 ϒk (ω) " 1 jmω dω e In [m] = 2π −2n π 2 k=0 " 1 −1 "2 +2n π n−1 " H 2 ϒk (ω) " 1 jmω dω e + 2π 0 2 k=0
If we perform the change of variable ω′ = ω + 2n π in the Þrst integral and combine the two integrals back together we obtain " 1 −1 "2 2n π n−1 " H 2 ϒk (ω − 2n π) " 1 jmω In [m] = e 2π 0 2 k=0 " " 2 n−1 " H 12 ϒk−1 (ω) " dω + 2 k=0
From (86) we obtain
θ(θ −1 (ω) + 2r π) = θ (θ −1 (ω)) + 2r π = ω + 2r π thus θ −1 (ω + 2r π) = θ −1 (ω) + 2r π
108
GIANPAOLO EVANGELISTA
hence ϒ −1 (ω + 22r π) = ϒ −1 (ω) + 2r π In particular, for any s ≥ 0
ϒ −1 (ω − 2s+2 π) = ϒ −1 (ω) − 2s+1 π
and for any n > k −1 −1 −1 ϒ2 (ω) − 2n−2 π (ϒ −1 (ω) − 2n−1 π) = ϒk−2 ϒk−1 (ω − 2n π) = ϒk−1 −1 = · · · = ϒ1−1 ϒk−1 (ω) − 2n−k+1 π = ϒk−1 (ω) − 2n−k π
Consequently, since H (ω) is 2π-periodic, we can write &"2 " % &"2 & %" % 2n π " " " " 1 −1 1 −1 1 jmω " " " H ϒn−1 (ω) − π) " + " H ϒn−1 (ω) "" e In [m] = " 4π 0 2 2 " " 2 n−2 " H 12 ϒk−1 (ω) " dω × 2 k=0 By condition (88) the terms in parentheses sum up to 2, hence, " 1 −1 "2 2n−1 π n−2 " H 2 ϒk (ω) " 1 e jmω dω = In−1 [m] In [m] = 2π −2n−1 π 2 k=0 By induction, since we proved that I1 [m] = δm,0 then In [m] = δm,0
∀n ≥ 1
(96)
The product ∞ H k=0
1 2
ϒk−1 (ω) √ 2
is pointwise convergent for any ω ∈ R and uniformly convergent on compact sets. In order to see this, we remark that the absolute convergence of the product is equivalent to the convergence of the sums " " 1 −1 ∞ " " " H 2 ϒk (ω) " − 1" √ " " " 2 k=0
Since H√ (ω) is continuously differentiable in a neighborhood of ω = 0 and H (0) = 2 then there exist ε > 0 and C0 > 0 such that " " " H (ω) " " √ − 1" ≤ C0 |ω| " 2 "
DYADIC WARPED WAVELETS
109
whenever |ω| < ε. By condition (87) we have |ϒ −1 (ω)| ≤ A|ω| and ∀k ≥ 0 " −1 " "ϒ (ω)" ≤ Ak |ω| with A < 1 (97) k √ From (88) it follows that |H (ω)| ≤ 2. Given ω ∈ R, pick k0 such that Ak0 |ω| ≤ ε, then " " " " 1 −1 1 −1 k ∞ " " " 0 −1 " " H 2 ϒk (ω) " H 2 ϒk (ω) " " − 1" + − 1" √ √ " " " " " " 2 2 k=k0 k=0 ≤ 2k0 +
∞ " ∞ " C0 "ϒ −1 (ω)" ≤ 2k0 + C0 |ω| Ak = 2k0 + C1 |ω| k 2 k=k0 2 k=k0
Therefore the product converges absolutely, hence pointwise, and uniformly on any compact set. Clearly, " 1 −1 "2 ∞ " H 2 ϒk (ω) " 2 2 ÷ ÷ = |(ω)| lim |n (ω)| = n→∞ 2 k=0 By FatouÕs lemma and (96) we have +∞ +∞ 1 1 2 ÷ ÷n (ω)|2 dω = 1 |(ω)| dω ≤ lim | n→∞ 2π −∞ 2π −∞
÷ hence the product converges to the function (ω) ∈ L2(R). By Plancherel theorem we have +∞ +∞ 1 ∗ 2 jmω ÷
ϕ(t), ÷ ϕ(t ÷ − m) = ϕ(t) ÷ ϕ÷ (t − m) dt = |(ω)| e dω 2π −∞ −∞ (98) Suppose that there exists C > 0 such that 2 ÷n (ω)|2 e jmω | = | ÷n (ω)|2 ≤ C|(ω)| ÷ || ∈ L1 (R)
(99)
÷n (ω) then by the dominated convergence theorem applied to the sequence e jmω , we have +∞ +∞ 1 1 2 jmω ÷ ÷n (ω)|2 e jmω dω |(ω)| e dω = lim | n→∞ 2π −∞ 2π −∞ = lim In [m] = δm,0 n→∞
÷n (ω) converges in L (R) to (ω) ÷ Therefore, and by (98) the family of functions 2
÷ − m)}m∈Z {(t
forms an orthonormal set in L2(R).
110
GIANPAOLO EVANGELISTA
÷n (ω) = 0, thus (99) is We need to show (99). For |ω| > 2n π we have n obviously veriÞed for any C > 0. Suppose |ω| ≤ 2 π then n−1 1 −1 ∞ ∞ H ϒk (ω) H 1 ϒk−1 (ω) H 12 ϒk−1 (ω) 2 2 ÷ = (ω) = √ √ √ 2 2 2 k=n k=0 k=0 −1 ÷ ϒn (ω) ÷n (ω) (100) =
The function ϒn−1 (ω) maps [−2n π, 2n π] one-to-one and onto [−π, π ]. In fact, ϒ −1 (ω) = θ −1 (ω/2) is by hypotheses odd and monotonically increasing. Furthermore, θ −1 (π) = θ −1 (−π + 2π) = −θ −1 (π) + 2π hence ϒ −1 (2π) = θ −1 (π) = π and for any r > 0 ϒ −1 (2r +1 π) = θ −1 (2r π) = θ −1 (0) + 2r π = 2r π Therefore −1 −1 ϒn−1 (2n π) = ϒn−1 (2n−1 π) = · · · = ϒ1−1 (2π) = π (ϒ −1 (2n π)) = ϒn−1
Hence, for us to prove (99) it sufÞces to show that 2 ÷ |(ω)| ≥
1 C
for ω ≤ π
(101)
since if (101) is true then from (100) we have 2 ÷ ≥ |(ω)|
1 ÷ |n (ω)|2 C
for ω ≤ 2n π
In order to show (101) observe that by (88) |H (ω)|2 ≤ 2 = |H (0)|2 . Since H (ω) is continuously differentiable in a neighborhood of ω = 0 then this point is a maximum for |H (ω)|2 and " " d ln |H (ω)|2 "" d|H (ω)|2 "" =2 =0 " dω "ω=0 dω ω=0 Therefore, there exists ε > 0 such that −|ω| ≤ ln
|H (ω)|2 ≤0 2
∀|ω| ≤ ε
DYADIC WARPED WAVELETS
111
For |ω| ≤ ε, by (97) we have 7 7 " 1 −1 "2 8 "8 " ∞ ∞ " −1 " H ϒ (ω) " " (ω) ϒ k k 2 ÷n (ω)|2 = exp ≥ exp − ln | 2 2 k=0 k=0 8 7 ∞ |ω| Ak = e−C2 |ω| ≥ e−C2 ε (102) ≥ exp − 2 k=0
Suppose now ε < |ω| ≤ π and pick an integer k0 such that Ak0 |ω| ≤ ε. By condition (90), since " −1 " "ϒ (ω)" ≤ ϒ −1 (π) ≤ π ∀k ≥ 0 k k then we have
" 1 −1 "2 % 2 &k 0 k 0 −1 " " " H 2 ϒk (ω) " K 2 −1 −C2 ε 2 " " ÷ ÷ |(ω)| = ϒk0 (ω) ≥e 2 2 k=0 √ where K = inf|ω| n we have Wk ⊂ Vk−1 ⊂ · · · ⊂ Vn and we know that Wn is orthogonal to Vn so that Wn is orthogonal to Wk as well. Furthermore, given any m ∈ Z and n > m, we have n ; Wr ⊕ Vn Vm = Wm+1 ⊕ Vm+1 = Wm+1 ⊕ Wm+2 ⊕ Vm+2 = · · · = r =m+1
Since {Vn }n∈Z is a WMRA then limn→∞ Vn = {0} and limm→−∞ Vm = L2 (R). Therefore +∞ ; 2 Wr L (R) = r =−∞
Hence the union of orthonormal bases of Wn , n ∈ Z, is an orthonormal basis of L2(R).
126
GIANPAOLO EVANGELISTA
A. Regularity In the theory of dyadic wavelet bases, it is important to investigate the approximation power of wavelets on classes of regular functions. We will brießy detail ö has N vanishing moments, the analogous concept for warped wavelets. If ψ(t) that is, if +∞ ö dt = 0 t n ψ(t) 0≤n 0 implies k−1 (ωk ) > 0 and bk < 1. Furthermore, since ωk < ωk−1 then k−1 (ωk ) < k−1 (ωk−1 ) = π, therefore bk < −1. By induction on k the solution exists. Since, as observed, for any k ∈ N we have 0 ≤ k−1 (ωk ) ≤ π then the argument of the tangent in (161) is always within the interval [π/4, +π/4]. Since the tangent is one-to-one in this interval then the solution is unique. Thus, at least for discrete-time wavelets, iterated warping achieves arbitrary band allocation in the dyadic scheme by moving the original cutoff frequencies 2−n π to arbitrary but ordered locations. In audio applications, cutoff frequencies may be arranged on a perceptual scale, such as the Bark or the ERB scale, which leads to perceptually organized orthogonal transforms (Evangelista and Cavaliere, 1998a, 1998b). In other applications, the cutoff frequencies may match characteristics of the signal or optimum criteria.
C. Schr¬ oderÕs Equation and Generalized K¬ onigsÕModels Eigenvalue equations for the composition operator were introduced by Schr¬ oder and solved by K¬ onigs in a different setting than the one needed for our purposes. An account for the theory in the inner unit disk can be found in an excellent book by J. H. Shapiro (1993). In our case we need to transpose the results to the real line and to allow for parametric maps leading to an asymptotic eigenvalue equation. The following theorem will play an essential role for the construction of iterated warping maps. Theorem VII.1 Let f (ω; η) be a one-parameter family of maps of the real line into itself, Þxing the point ω = 0 for any value of the parameter η. Let f n (ω) ≡ f (ω; ηn ), where ηn , n ∈ N, is a sequence of values of the parameter η such that r r r
∀n ∈ N, f n has bounded derivatives up to order 2 and | f n′ (0)| > 0 ∀n ∈ N, | f n (ω)| ≤ αn |ω| with 0 < αn 0 and Fn ≡ f 1 ◦ f 2 ◦ · · · ◦ f n , then, for any Þnite n, Fn′ (0) =
n k=1
f k′ (0) = 0
In order to show uniform convergence of Fn (ω)/Fn′ (0) we transform this ratio into a product: Fn (ω) Fn (ω) = !n ′ ′ Fn (0) k=1 f k (0) &% & % & % f n−1 ◦ f n (ω) f 1 ◦ f 2 ◦ · · · f n (ω) f n (ω) ··· =ω ′ f n′ (0)ω f n−1 (0) f n (ω) f 1′ (0) f 2 ◦ f 3 ◦ · · · ◦ f n (ω) Thus, n Fn (ω) = ω Q k (ϕk,n (ω)) Fn′ (0) k=1
(163)
where Q k (ω) ≡
f k (ω) f k′ (0)ω
and ϕk,n (ω) ≡ f k+1 ◦ f k+2 ◦ · · · ◦ f n (ω) with ϕn,n (ω) = ω. Convergence of the product in (163) is equivalent to convergence of the sum n k=1
Notice that
|1 − Q k (ϕk,n (ω))|
" " ′ " f k (0)ω − f k (ω) " " " |1 − Q k (ω)| = " " f ′ (0)ω k
Let Ak be an upper bound for 12 | f k′′ (ω)|. Since f k (ω) is twice differentiable and f k (0) = 0, it follows from LagrangeÕs theorem that | f k (ω) − f k′ (0)ω| ≤
142
GIANPAOLO EVANGELISTA
Ak ω2 . Thus, |1 − Q k (ω)| ≤
Ak |ω| ≤ A|ω| | f k′ (0)|
where A ≡ sup k
Ak | f k′ (0)|
is Þnite since | f k′ (0)| > 0. By hypothesis, there exists an integer K such that f k (ω) is linearly bounded with constant αk ≤ α < 1 for any k > K . Therefore |ϕk,n (ω)| = | f k+1 ◦ f k+2 ◦ · · · ◦ f n (ω)| ≤ α n−k |ω|
∀k ≥ K
(164)
Thus, n
k=K
|1 − Q k (ϕk,n (ω))| ≤ A
n
k=K
α n−k |ω| = A|ω|
n−K
αm
m=0
On the other hand, for k < K and any Þnite n we have |1 − Q k (ϕk,n (ω))| ≤ 1 + |Q k (ϕk,n (ω))| " " " " " f k ◦ ϕk,n (ω) " " αk " " " " =1+" ′ ≤ 1 + " ′ "" < ∞ f k (0)ϕk,n (ω) " f k (0)
where the last inequality follows from the fact that f k ω is linearly bounded and f k′ (0) = 0. Therefore n k=1
with
|1 − Q k (ϕk,n (ω))| ≤ (1 + C)(K − 1) + A|ω| " " " αk " " C ≡ max " ′ "" k∈{1,...,K −1} f (0) k
n−K
αm
m=0
and α < 1, so that the sum on the left-hand side and the product in (163) both converge uniformly on any compact subset of R. The convergence of the sequence n f k′ ◦ ϕk,n (ω) Fn′ (ω) = Fn′ (0) f k′ (0) k=1
DYADIC WARPED WAVELETS
is equivalent to the convergence of the sum " " n "" ′ n "" ′ ′ " " f f (0) − f ◦ ϕ (ω) ◦ ϕ (ω) k,n k,n k k k " "1 − "= " ′ ′ " " " " f k (0) f k (0) k=1 k=1
143
(165)
Since f k′ is differentiable with bounded f k′′ and f k′ (0) = 0 then " ′ " " f k (ω) − f k′ (0) " 2Ak |ω| " "≤ ≤ 2A|ω| " " f k′ (0) | f k′ (0)| From (164) we have
n " ′ n−K " ′ " f k (0) − f k ◦ ϕk,n (ω) " ≤ 2A|ω| αm " " ′ (0) f k k=K m=0
with α < 1. Furthermore, since f k′ is bounded then there exists a constant B such that for any ω ∈ R K −1 " ′ " ′ " f k (0) − f k ◦ ϕk,n (ω) " "≤B " ′ f k (0) k=1
Hence the sum in (165) and Fn′ (ω)/Fn′ (0) both converge uniformly on any compact subset of R. For us to prove that F ≡ lim n
Fn Fn′ (0)
(166)
satisÞes Schr¬ oderÕs equation it sufÞces to observe that for any Þnite n we have Fn+1 Fn ◦ f n+1 ′ = f n+1 (0) ′ Fn′ (0) Fn+1 (0) and that both limits as n → ∞ of the left- and right-hand sides exist. To prove the unicity of the solution, suppose that F and G are two distinct solutions of Schr¬ oderÕs equation obtained as in (166) by means of two distinct sequences f n (ω) and gn (ω) both converging to the same function f ∞ (ω) as ′ (0)| = 1. Since f n (ω) and gn (ω) are one-to-one and continn → ∞, with | f ∞ uously differentiable for any n ∈ N, then F, G, and f ∞ (ω) are one-to-one. Let ′ (0), then from (162) we have a = f∞ F −1 (a F(ω)) = f ∞ (ω)
(167)
G ◦ f ∞ ◦ G −1 (ω) = aω
(168)
and
144
GIANPAOLO EVANGELISTA
Substituting (167) into (168) we obtain G ◦ F −1 ◦ a F ◦ G −1 (ω) = aω or H (aω) = a H (ω)
(169)
where H = G ◦ F −1 . Without loss of generality we can assume a < 1, since if a > 1 we can transform (169) in the following way: H (a −1 ω) = a −1 a H (a −1 ω) = a −1 H (aa −1 ω) = a −1 H (ω) with a −1 < 1. By repeated use of (169) we obtain H (a n ω) = a n H (ω)
(170)
The function H is continuously differentiable. By deriving both sides of (170) with respect to ω we obtain " " d n dH " H (a n ω) = a n H ′ (ω) = a " dβ β=a n ω dω or, dividing by a n , for any n ∈ N and for any ω ∈ R we have " d H "" = H ′ (ω) dβ "β=a n ω
which implies that
H ′ (ω) = H ′ (0) Since by continuity (169) also implies that H (0) = 0, then G ◦ F −1 (ω) = H (ω) = H ′ (0)ω Therefore G = H ′ (0)F and the solution of (162) is unique up to a multiplicative constant. 1. Realization of Warping Maps by Iterated Laguerre Maps: The Constant Parameter Case Theorem VII.1 can be applied to the composition-by-ϒ −1 operator, in order to solve (147). In our case of interest, ϒ −1 = θ −1 (ω/2) where θ belongs to the
145
DYADIC WARPED WAVELETS
family of Laguerre warping maps (152). We let .ω/ ω b sin(ω/2) = − 2 arctan ϑ −1 (ω; b) = θ −1 2 2 1 + b cos(ω/2)
(171)
instantiate the one-parameter family of Theorem VII.1. The family ϑ −1 (ω; b) is indeÞnitely differentiable with respect to ω. Furthermore, " ν ∂ϑ −1 "" = >0 " ∂ω ω=0 2
for −1 < b < 1, where
1−b 1+b The simplest case arises when the sequence of parameters is constant. According to Theorem VII.1, in order to satisfy the eigenfunction equation ν=
Ŵ −1 ◦ ϒ −1 = aŴ −1
(172)
with eigenvalue a, we need ν = 2a Hence, for 0 < a < 1 we have
" ∂ϑ −1 "" 0< 0 since by Lemma VII.2 either νk ≥ 1 for 12 ≤ a < 1 or νk ≥ ν1 = tan (aπ/2) > 0 for 0 < a < 12 . Let A = supk∈N Ak < ∞, then Sn ≤ A
n
ξn,k
k=1
Suppose 12 < a < 1, then, for any k ∈ N, νk > 1 and k (ω) is convex on (0, a k+1 π), hence ξn,k = k (a n π) ≤ a n−k k (a k π ) = a n−k π so that Sn ≤ Aπ
n−1 r =0
ar
160
GIANPAOLO EVANGELISTA
Suppose 0 < a < 12 , then, for any k ∈ N, νk ≤ 1 and ϑk−1 is convex or linear at most on (0, 2π), therefore |ϑk−1 (ω)| ≤ |ω|/2 so that π ξn,k ≤ n−k 2 and n−1 1 Sn ≤ Aπ r 2 r =0 Thus, for 0 < a < 1 we always have
lim Sn < ∞
n→∞
and the product Pn converges as n → ∞. Remark VII.1 The convergence of the product lim νk = 2a
k→∞
!∞
k=1
(2a/νk ) implies that (188)
This is quite a remarkable result since the dependency of νk on a is quite complex. To give an idea we write the fourth iterate: tan 2 arctan
tan 2 arctan tan 2 arctan
ν4 = tan 2 arctan
tan 2 arctan
tan 2 arctan tan 2 arctan
4 tan a π 2 aπ tan 2 2 tan a π 2 tan aπ 2 3 tan a π 2 tan aπ 2 2 tan a π 2 tan aπ 2
Even though tan 2 arctan x can be simpliÞed to the rational function 2x/(1 − x 2 ), the dependency on a remains inaccessible as k grows, while (188) shows that asymptotically νk depends linearly on a. In particular, since a < 1 then limk→∞ νk < 2 and νk ≥ 2 for at most a Þnite number of integers k. Thus, Theorem VII.1 applies to the iterated maps obtained by enforcing the exponential cutoff condition. Numerical experiments show that for a ≥ 12 at most ν1 ≥ 2 while νk < 2 for k > 1. Furthermore, the rate of convergence is exponential: νn = 2a + O(a 2n−1 )
÷−1 (ω) in (181) is an eigenfunction of the By Theorem VII.1 the map −1 operator with eigenvalue a: composition-by-ϑ∞ −1 ÷−1 ÷−1 ◦ ϑ∞ = a
(189)
DYADIC WARPED WAVELETS
161
−1 where ϑ∞ has parameter ν∞ = 2a. Since the solution of Schr¬ oderÕs equation ÷−1 by (189) is unique up to a multiplicative constant, one can normalize deÞning
÷−1 (ω) π ÷−1 (π )
−1 (ω) = By (182) we have n
Therefore
0) can be expressed in terms
162
GIANPAOLO EVANGELISTA
of f 0,k as follows:
f, ϕön,m = = where
+∞
k=−∞ +∞
f 0,k ϕö0,k , ϕön,m =
+∞
k=−∞
f 0,k W S k ϕ, ÷ Dan W S m ϕ ÷
∗ f 0,k ϕn,m [k]
k=−∞
÷ W S k ϕ ÷ = W † Dan W S m ϕ, ÷ S K ϕ ÷ ϕn,m [k] = Dan W S m ϕ, are scaling sequences. Clearly, by Plancherel theorem we have n,m (ω) =
+∞
ϕn,m [k]e− jkω
k=−∞
+∞ ( +∞ 1 dϒn ÷ − jkw ÷∗ ()e jk d (ϒn ())e− jmϒn () e = 2π k=−∞ d −∞ ( +∞ dϒn − jmϒn (ω) ÷∗ (ω + 2kπ)(ϒ ÷ n (ω + 2kπ)) (190) e = dω k=−∞
where we used the same notation as in Section V and the fact that ϒn (ω + 2kπ) = ϒn (ω) + 2n+1 kπ. But, ∞ H 21 ϒr−1 (ω) ÷ (ω) = √ 2 r =0
and ϒr−1 (ϒn (ω)) = ϒn−r (ω) for r > n and ϒr−1 (ϒn (ω)) = ϒr−1 −n (ω) for r ≤ n, hence n H 21 ϒs (ω) ÷ + 2kπ) ÷ n (ω + 2kπ)) = (ω (ϒ √ 2 s=1 Plugging this result into (190) and using (48) we obtain ( n 1 dϒn − jmϒn (ω) H 12 ϒs (ω) n,m (ω) = e n 2 dω s=1 Furthermore,
n dϒn ϒ ′ (ϒs−1 (ω)) = dω s=1
(191)
DYADIC WARPED WAVELETS
163
hence n ) √ n,m (ω) = e− jmϒn (ω) 2−n ϒ ′ (ϒs−1 (ω))H 21 ϒs (ω) s=1
Similarly, one can show that √ ) n,m (ω) = e− jmϒn (ω) 2−1 ϒ ′ (ϒn−1 (ω))G 21 ϒn (ω) n−1,0 (ω)
where ψn,m [k] = ψön,m , ψö0,k . The last two equations should be compared with (157) and (160) obtained from iterated Laguerre warped Þlter banks. Identity is obtained by making the following associations: ϒ(ω) = 2θ(ω) ϒs (ω) = s (ω) with identical parameters for the Laguerre maps θk (ω) = θ(ω) and recalling that 0 (ω) is the causal square root of θ ′ (ω). The iterated Laguerre warped Þlter bank with identical maps is therefore suitable for the computation of the dyadic warped wavelet transform of causal signals. Moreover, the constructive results shown in Section VII.C.1 show that the common Laguerre parameter has to be set to 1 − 2a b= 1 + 2a
and the global warping map Ŵ(ω) (i.e., the characteristic of the operator W) results from (176). A problem in the computation of the dyadic warped wavelets lies in the fact that the coefÞcients f 0,k should be obtained either by unwarping the continuous time signal f (t) with the map Ŵ −1 (ω) or by computing the scalar products of the signal with the scaling functions ϕö0,k (t). This is exactly the same drawback as that encountered in the globally warped wavelets described in Section III.D. In the ordinary dyadic wavelet expansion, the coefÞcients f 0,k are assimilated with the samples of the signal f (k). This is a reasonable assumption if the signal is bandlimited in [−π, π ] since, due to the fact that the level n = 0 scaling functions are obtained by pure time-shifted copies of the same function ϕ(t), ö the coefÞcients f (0,k) are obtained by means of a sampled convolution of f (t) with ϕ ∗ (−t) (i.e., by passing the bandlimited signal through a full-band quasi-ideal Þlter. This approximation is not feasible in the dyadic warped wavelets since the level n = 0 scaling functions are obtained by all-pass Þltering the same function ϕ(t), ö which results in frequency-dependent delays. In fact, assuming that ÷ f (k) ≈ f, ϕö0,k = f, W S k ϕ
164
GIANPAOLO EVANGELISTA
is equivalent to assuming that ÷ f (k) ≈ W † f, S k ϕ Since ϕ÷is a quasi-ideal Þlter, this would imply that W † f ≈ f (i.e., that warping does not considerably alter the signal), which is of course not true. Equivalently, ÷ − k) as a starting point. One can construct one could use the set ϕ÷0,k (t) = ϕ(t the wavelets ψ÷n,m = W † ψön,m, which are globally unwarped versions of the wavelets ψön,m (i.e., unitarily equivalent to the scale a wavelet set ψön,m ). For n > 0 the cutoff frequencies ω÷n of the wavelets ψön,m are determined by the cutoff frequency of the narrowest band Þlter in (191), that is, by the equation ω÷n = ϒn−1 (π) = Ŵ a n Ŵ −1 (π) = Ŵ(a n π) while the cutoff frequencies ωön of the wavelets ψön,m are given by ω÷n = Ŵ −1 (ω÷n ) = Ŵ −1 ϒn−1 (π) = a n π
A viable solution, satisfactory from both computational and approximation points of view, is to use warped wavelets based on iterated Laguerre warping maps with variable parameter as shown in Section VII.C.2. The iterated Laguerre map −1 −1 −1 −1 n (ω) = ϑ1 ◦ ϑ2 ◦ · · · ◦ ϑn (ω)
is constrained to satisfy the exponential cutoff condition (182). We showed that lim n
ϒn−1 (ω) −1 n (ω) = lim = Ŵ −1 (ω) n an an
(192)
For n ≤ 0 we let ξn,m = ϕön,m while for n > 0 we deÞne new scaling functions as follows: ξn,m (t) =
+∞
k=−∞
ϕn,m [k]ϕ(t ÷ − k)
where the sequences ϕn,m [k] form the coefÞcients of the Fourier series ( & % n 1 1 dn − jmn (ω) H s (ω) n,m (ω) = e 2n dω s=1 2 Hence, in the Fourier domain we have
÷ n,m (ω) = n,m (ω)(ω)
(193)
DYADIC WARPED WAVELETS
165
Clearly, since for n ≤ 0 the scaling functions ξn,m and ϕ÷n,m coincide then Theorem V.1 applies to {ξn,m }m∈Z for any n ∈ Z − N. For any n > 0 the set {ξn,m }m∈Z is orthogonal in L2 (R) by construction since the sequences ϕn,m [k] are orthogonal in ℓ2 (Z). Hence the set {ξn,m }m∈Z is orthogonal for any n ∈ Z. Similarly, the spaces Un = span{ξn,m |m ∈ Z} satisfy Un+1 ⊂ Un
m
∀n ∈ Z
f ∈ U0 ⇐⇒ S f ∈ U0 by construction and by Theorem V.2. Theorem V.3 is also veriÞed since the spaces Un are unitarily equivalent to the spaces Vn for n ≤ 0 and Un+1 ⊂ Un . In order to show the completeness of the set of wavelets ⎧ ÷ n≤0 ⎪ ⎨ ψ n,m (t) +∞ ζn,m (t) = ⎪ ψn,m [k]ϕ(t ÷ − k) n>0 ⎩ k=−∞
where
√ ) n,m (ω) = e− jmn (ω) 2−1 ϑn′ (n−1 (ω))G 21 n (ω) n−1,0 (ω)
we must prove that
n∈Z
Notice that for any n ∈ Z − N
Un = {0}
(194)
f ∈ Un−1 ⇐⇒ W † Da W f ∈ Un
however, this property is no longer true when n > 0 in view of deÞnition (193). To prove (194) consider the orthogonal projection of f over the space Un: PUn f = =
+∞
m=−∞
f, ξn,m ξn,m =
Dan PU÷n Da−n
f
+∞
Da−n f, Da−n ξn,m Dan Da−n ξn,m
m=−∞
where PU÷n is the orthogonal projection operator over the space 5 6 ÷n = span D −n ξn,m |m ∈ Z U a Hence
4 4 44 4 4 4 4 4 4 4 PUn f 4 = 4 D n PU÷ D −n f 4 = 4 PU÷ D −n f 4 ≤ 4 PU÷ 44 D −n f 4 a a a a n n n
166
GIANPAOLO EVANGELISTA
As in the proof of Theorem V.4 if f ∈ S(R) then lim Da−n f = 0
n→∞
In order to complete the proof of (194) we need to show that limn→∞ PU÷n exists Þnite. Clearly, for any Þnite n, PU÷n is a well-deÞned orthogonal projection operator since {Da−n ξn,m | m ∈ Z} is an orthogonal set in L2 (R), hence PU÷n = 1. We have √ √ ÷ n ω) F Da−n ξn,m (ω) = a n n,m (ω) = a n n,m (a n ω)(a √ = 1 L (R) and H (0) = 2 We have For ϕ÷ ∈ L2 (R) ÷ n ω) = (0) ÷ =1 lim (a
n→∞
Furthermore, √ lim a n n,0 (a n ω) = lim
n→∞
n→∞
(
n H dn (a n ω) dω s=1
1 2
s (a n ω) √ 2
(195)
As shown in Section VII.C.2 Theorem VII.1 applies to iterated Laguerre maps under exponential cutoff constraint. Hence (192) is true and lim n (a n ω) = Ŵ(ω)
(196)
n→∞
Furthermore, n ∈ C1 (R) and both (196) and the sequence of its Þrst derivatives are uniformly convergent. Thus, ( dn (a n ω) ) ′ = Ŵ (ω) lim n→∞ dω Concerning the product in (195) we have n n n H 12 Q −1 H 12 s (a n ω) n,s (n (a ω)) = √ √ 2 2 s=1 s=1
where Q −1 n,s (ω)
=
7 −1 −1 ◦ · · · ◦ ϑn−1 ϑs+1 ◦ ϑs+2 ω
if s < n if s = n
−1 (ω) is Since by Proposition VII.2 and (174), for s large enough, each map ϑs+1 linearly bounded with coefÞcient less than one, and since (196) holds uniformly on ω, we can use the same arguments as in the proof of Theorem V.1 to show
DYADIC WARPED WAVELETS
that n H
lim
n→∞
s=1
1 2
167
s (a n ω) ÷ = (Ŵ(ω)) √ 2
Collecting our results, we have ) √ ÷ ÷ ö lim a n n,m (ω) = Ŵ ′ (ω)(Ŵ(ω)) (0) = (ω) n→∞
Furthermore,
lim e− jmn (a
n
ω)
n→∞
= e− jmŴ(ω)
Thus, 4 4 4 4 lim 4 PU÷n 4 = 4 PV0 4 = 1
n→∞
Consequently,
4 4 lim 4 PUn f 4 = 0
n→∞
for any f ∈ S (R). Since S (R) is dense in L2 (R) and using the same arguments as in Theorem VI.1, we have proved the following theorem. Theorem VIII.1 Let H (ω) be a 2π-periodic function continuously differentiable in a neighborhood of ω = 0 satisfying the following requirements: |H (ω)|2 + |H (ω + π)|2 = 2 √ H (0) = 2
∀ω ∈ R and
inf
|ω| 0
Let G(ω) = e− jω H ∗ (ω + π) and let Ŵ(ω) = lim n (a n ω) n→∞
where k (ω) =
ϑk (k−1 (ω)) ω
0 0
m∈Z
k=−∞
where
and
.ω/ 1 −1 ÷ ϑ∞ ÷ (ω) (ω) = √ G 2 2
√ ) n,m (ω) = e− jmn (ω) 2−1 ϑn′ (n−1 (ω))G 12 n (ω) n−1,0 (ω)
is orthonormal and complete in L2 (R).
DYADIC WARPED WAVELETS
169
Remark VIII.1 In the proof of the previous theorem we showed that lim Da−n ξn,m = ϕö0,m
n→∞
For n large we have ξn,m ≈ Dan ϕö0,m = ϕön,m Hence, the modiÞed warped scaling functions and wavelets generated by iterated Laguerre maps with exponential cutoff asymptotically tend to the globally warped scaling functions and wavelets, respectively. Furthermore, a Þnite number of cutoff frequencies can always be altered while we preserve orthogonality and completeness of the wavelets. Therefore, we can adjust the modiÞed wavelets on arbitrary perceptual scales.
IX. Conclusions In this article we presented an extension of the theory of orthogonal wavelet bases using iterated frequency warping as a tool for designing the frequency resolution of the basis elements. Our theory started with the deÞnition of generalized multiresolution approximation, from which the construction of wavelet bases by iterated warping follows. Most of the theorems in Sections III to VI are extensions of the results found in Mallat (1998) and Daubechies (1992), from which many ideas for their proof were drawn. A relevant part of this work is devoted to the construction of warping maps. In this context, original results were obtained by considering the warping map as an eigenfunction of the composition operator. We were concerned with realizable frequency warping methods that can be translated into efÞcient algorithms. For this reason we constrained the warping map to the class of maps obtained by alternating dyadic scaling with the discrete-time Laguerre transform. We showed that asymptotically scale a wavelets can be constructed in computable form by constraining the wavelet cutoff frequencies to obey an exponential law. Warped wavelet decompositions lead to unconventional tilings of the timeÐfrequency plane in which, due to the frequency-dependent delay of each basis element, the uncertainty zones are characterized by curved boundaries. Their ßexible frequency resolution allocation makes them interesting for the applications.
Acknowledgments The author wishes to thank Dr. J« erö ome Lebrun for fruitful discussions. This work was partially supported by the Swiss National Funds under Grant 21-57220.99.
170
GIANPAOLO EVANGELISTA
References Baraniuk, R. G., and Jones, D. L. (1993). Warped wavelets bases: Unitary equivalence and signal processing. Proc. ICASSPÕ93,IEEE III, 320Ð323. Baraniuk, R. G., and Jones, D. L. (1995). Unitary equivalence: A new twist on signal processing. IEEE Trans. Signal Processing 43, 2269Ð2282. Blu, T. (1998). A new design algorithm for two-band orthonormal Þlter banks and orthonormal rational wavelets. IEEE Trans. Signal Processing 46, 1494Ð1504. Braccini, C., and Oppenheim, A. V. (1974). Unequal bandwidth spectral analysis using digital frequency warping. IEEE Trans. Acoustics, Speech Signal Processing 22, 236Ð244. Broome, P. W. (1965). Discrete orthonormal sequences. J. Assoc. Comput. Machinery 12, 151Ð 168. Cohen, A. (1990). Ondelettes, analyses multir« esolutions et Þltres miroire en quadrature, in Ann. Inst. H. Poincar« e, Anal. non lin« eaire. vol. 7, pp. 439Ð459. Coifman, R. R., and Wickerhauser, M. V. (1992). Entropy-based algorithms for best basis selection. IEEE Trans. Inform. Theory 38, 713Ð718. Daubechies., I. (1988). Orthonormal bases of compactly supported wavelets. Common un. Pure Appl. Math. XLI, 909Ð996. Daubechies, I. (1992). Ten Lectures on Wavelets. CBMS-NSF Regional Conference Series in Applied Mathematics, SIAM, Philadelphia. Evangelista, G. (1992). Wavelet transforms and wave digital Þlters, in Wavelets and Applications edited by Meyer. Y. New York: Springer-Verlag, pp. 396Ð412. Evangelista, G. (1993). Pitch synchronous wavelet representations of speech and music signals. IEEE Trans. Signal Processing (Special issue on Wavelets and Signal Processing) 41, 3313Ð 3330. Evangelista, G. (1994). Comb and multiplexed wavelet transforms and their applications to signal processing. IEEE Trans. Signal Processing 42, 292Ð303. Evangelista, G., and Cavaliere, S. (1998a). Arbitrary bandwidth wavelet sets. Proc. ICASSPÕ98 III, 1801Ð1804. Evangelistar, G., and Cavaliere, S. (1998b). Auditory modeling via frequency warped wavelet transform. Proc. EUSIPCO98 I, 117Ð120. Evangelista, G., and Cavaliere, S. (1998c). Discrete frequency warped wavelets: Theory and applications. IEEE Trans. Signal Processing (Special issue on Theory and Applications of Filter Banks and Wavelets) 46, 874Ð885. Evangelista, G., and Cavaliere, S. (1998d). Frequency warped Þlter banks and wavelet transforms: A discrete-time approach via Laguerre expansion. IEEE Trans. Signal Processing 46, 2638Ð 2650. Grossmann, A, and Morlet, J. (1984). Decomposition of Hardy functions into square integrable wavelets of constant shape. SIAM J. Math. Anal. 723Ð736. Mallat, S. (1989). Multiresolution approximation and wavelets. Trans. Am. Math. Soc. 69Ð88. Mallat, S. (1998). A Wavelet Tour of Signal Processing. Boston: Academic Press. Morlet, J., Arens, G., Fourgeau, I., and Giard, D. (1982). Wave propagation and sampling theory. Geophysics. 203Ð236. Oppenheim, A. V., and Johnson, D. H. (1972). Discrete representation of signals. Proc. IEEE 60, 681Ð691. Oppenheim, A. V., Johnson, D. H., and Steiglitz, K. (1971). Computation of spectra with unequal resolution using the Fast Fourier Transform, Proc. IEEE 59, 299Ð301. Shapiro, J. H. (1993). Composition Operators and Classical Function Theory, New York: Universitext: Tracts in Mathematics, Springer-Verlag.
DYADIC WARPED WAVELETS
171
Shapiro, J. H. (1998). Composition operators and Schr¬ oderÕs functional equation. Trans. Am. Math. Soc. Contemp. Math. 213Ð228. Strang, G., and Nguyen, T. (1996). Wavelets and Filter Banks. Wellesley, MA, and Boston, Wellesley-Cambridge. Vetterli, M., and Kovacevic, J. (1995). Wavelets and Subband Coding. Englewood Cliffs, NJ: Prentice Hall. Wornell, G. W., and Oppenheim, A. V. (1992). Wavelet-based representations for a class of self-similar signals with applications to fractal modulation. IEEE Trans. Inform. Theory 38, 785Ð800.
This Page Intentionally Left Blank
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 117
Recent Developments in Stack Filtering and Smoothing JOSE« L. PAREDES1 and GONZALO R. ARCE2 1
Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19716 Department of Electrical Engineering, University of Los Andes, M« erida, Venezuela 2 Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19716
I. Introduction . . . . . . . . . . . . . . . . . . . . . II. Threshold Decomposition and Stack Smoothers . . . . . . . A. Threshold Decomposition. . . . . . . . . . . . . . . B. Stack Smoothers. . . . . . . . . . . . . . . . . . . III. Mirrored Threshold Decomposition and Stack Filters. . . . . A. Mirrored Threshold Decomposition. . . . . . . . . . . B. Stack Filters . . . . . . . . . . . . . . . . . . . . IV. Integer Domain Filters of Linearly Separable PBFs. . . . . . A. Weighted Median (WM) Filters . . . . . . . . . . . . V. Analysis of WM Filters Using Threshold Logic . . . . . . . A. Finding the PBF Corresponding to a Weighted Median Filter B. Finding the Weighted Median Filter Corresponding to a PBF VI. Recursive Weighted Median Filters and Their Nonrecursive WM Filter Synthesis . . . . . . . . . . . . . . . . . . VII. Weighted Median Filters with N Weights. . . . . . . . . . A. Weighted Median Filter Computation . . . . . . . . . . B. Recursive Weighted Median Filters . . . . . . . . . . . VIII. Stack Filter Optimization . . . . . . . . . . . . . . . . A. Thresholded Signals Generated by Mirrored Threshold Decomposition . . . . . . . . . . . . . . . . . . . B. Stack Filter Optimization . . . . . . . . . . . . . . . C. Adaptive Optimization Algorithm . . . . . . . . . . . D. Fast Adaptive Optimization Algorithm . . . . . . . . . E. Optimal WM Filtering: The Least Mean Absolute (LMA) Algorithm . . . . . . . . . . . . . . . . . . . . . F. Optimal Recursive Weighted Median Filtering . . . . . . IX. Applications of Stack Filters . . . . . . . . . . . . . . . A. Design of a High-Pass Stack Filter . . . . . . . . . . . B. Image Denoising with WM Filters . . . . . . . . . . . C. Optimal Frequency Selection WM Filtering . . . . . . . D. Sharpening with WM Filters. . . . . . . . . . . . . . X. Conclusion . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
174 175 175 177 179 179 181 183 185 189 190 191
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
195 200 201 203 205
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
205 208 210 214
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
216 221 223 223 226 228 233 237 238
173 Volume 117 ISBN 0-12-014759-9
C 2001 by Academic Press ADVANCES IN IMAGING AND ELECTRON PHYSICS Copyright All rights of reproduction in any form reserved. ISSN 1076-5670/01 $35.00
174
JOSE« L. PAREDES AND GONZALO R. ARCE
I. Introduction Linear Þltering has dominated the Þeld of signal processing and its applications. This is a result of the theoretical basis provided by the rich theory of linear systems and by the computational efÞciency of linear Þlters. Although linear Þlters perform well in a number of applications, their performance is less than adequate in many situations, particularly with signals characterized by sharp discontinuities, such as edges in digital images, and with signals corrupted by non-Gaussian noise (Pitas and Venetsanopoulos, 1990; Pratt, 1991). Stack smoothers and stack Þlters have emerged as a useful class of nonlinear Þlters which overcome many of the limitations of linear Þlters. This nonlinear Þlter class enjoys a rich theory for their analysis, design, and optimization. As a result, stack Þlters have become widely accepted in the signal processing Þeld. At Þrst, the stack smoother structure was limited to have low-pass frequency characteristics (Wendt et al., 1986; Yli-Harja et al., 1991). Recently, these structures have been generalized to allow not only low-pass frequency characteristics but also bandpass and high-pass Þltering characteristics (Arce, 1998; Arce and Paredes, 2000; Paredes and Arce, 1999). The new structures are thus referred to as stack Þlters whereas the limited stack structures are referred to as stack smoothers. Stack smoothers and stack Þlters compose a very large class of Þlters. Among these, the most popular and efÞcient in terms of implementation are the so-called weighted median (WM) Þlters (Arce, 1998; Brownrigg, 1984). WM Þlters have a structure analogous to that of linear Finite Impulse Response (FIR) Þlters and, consequently, share many similar attributes. Stack Þlters and WM Þlters are intimately related, thus the understanding of stack Þlters reveals many important properties of WM Þlters (Paredes and Arce, 1999). The goals of this article are: (1) to describe the class of stack smoothers and an important tool for their analysis and design, namely threshold decomposition; (2) to show how the structure of stack smoothers can be generalized to a richer and more versatile class of Þlters referred to as stack Þlters; (3) to present mirrored threshold decomposition as a theoretical tool for the analysis of stack Þlters; (4) to study nonrecursive and recursive weighted median Þlters admitting real-valued weights as subclasses of stack Þlters; (5) to elaborate conversion methods which map the ÞlterÕs binary representation to its integervalued representation and vice-versa; (6) to develop adaptive optimization algorithms for the design of stack Þlters and stack smoothers; and (7) to illustrate the use of stack Þlters in several applications that require frequency selection type of Þltering characteristics.
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 175
II. Threshold Decomposition and Stack Smoothers Threshold decomposition provides the foundation needed for the deÞnition of stack smoothers. Threshold decomposition was originally introduced by Fitch et al. (1984) to analyze the median Þltering operation acting over nonnegative integer-valued signals. Threshold decomposition accepting nonnegative real-valued signals, and threshold decomposition accepting real-valued signals were later introduced in Yin and Neuvo (1994) and Arce (1998), respectively. For the purposes of this article, threshold decomposition must Þrst be extended to admit integer-valued signals taking on positive and negative values. In Section III, mirrored threshold decomposition, also for integer-valued signals, is then introduced to deÞne stack Þlters. It will be shown that stack Þlters, in turn, can be used to deÞne weighted median Þlters admitting positive and negative weights.
A. Threshold Decomposition Consider an integer-valued set of samples X 1 , . . . , X N forming the vector X = [X 1 , . . . , X N ]T , where X i ∈ {−M, . . . , −1, 0, . . . , M}. The threshold decomposition of X amounts to decomposing this vector into 2M binary vectors x−M+1 , . . . , x0 , . . . , x M where the ith element of xm is deÞned by 1 if X i ≥ m (1) xim = T m (X i ) = −1 if X i < m where T m (·) is referred to as the thresholding operator. Although deÞned for integer-valued signals, the thresholding operation in Eq. (1) can be extended to real-valued signals (Arce, 1998; Yin and Neuvo, 1994). The threshold decomposition of the vector X = [0, 0, 2, −2, 1, 1, 0, −1, −1]T
(2)
with M = 2, for instance, leads to the four binary vectors x2 = [−1, −1, 1, −1, −1, −1, −1, −1, −1]T x1 = [−1, −1, 1, −1, 1, 1, −1, −1, −1]T x0 = [ 1, 1, 1, −1, 1, 1, 1, −1, −1]T −1 x = [ 1, 1, 1, −1, 1, 1, 1, 1, 1]T
(3)
Threshold decomposition has several important properties. First, threshold decomposition is reversible. Given a set of thresholded signals, each of the
JOSE« L. PAREDES AND GONZALO R. ARCE
176
samples in X can be exactly reconstructed as Xi =
M 1 xm 2 m=−M+1 i
(4)
Thus, an integer-valued discrete-time signal has a unique threshold signal T.D. T.D. representation, and vice-versa: X i ←→ {xim }, where ←→ denotes the oneto-one mapping provided by the threshold decomposition operation. A second property of importance is the partial ordering obeyed by the threshold decomposed variables. For all thresholding levels m > ℓ, it can be shown that xim ≤ xiℓ . In particular, if xim = 1 then xiℓ = 1 for all ℓ < m. Similarly, if xiℓ = −1 then xim = −1 for all m > ℓ. The partial order relationships among samples across the various thresholded levels emerge naturally in thresholding and are referred to as the stacking constraints (Wendt et al., 1986). Threshold decomposition is of particular importance in median smoothing since threshold decomposition and median smoothing are commutable operations. That is, applying a median smoother to a 2M + 1 valued signal is equivalent to decomposing the signal to 2M binary thresholded signals, processing each binary signal separately with the corresponding median smoother, and then adding the binary outputs to obtain the integer-valued output. Thus, the median of a set of samples X 1 , X 2 , . . . , X N is related to the set of medians of the thresholded signals as (Fitch et al., 1984) MEDIAN (X 1 , . . . , X N ) = T.D.
M 1 MEDIAN x1m , . . . , x Nm 2 m=−M+1
(5)
T.D.
N N Since X i ←→ {xim } and MEDIAN (X i |i=1 )}, the re) ←→ {MEDIAN (xim |i=1 lationship in Eq. (5) establishes a weak superposition property satisÞed by the nonlinear median operator, which is important from the fact that the effects of median smoothing on binary signals are much easier to analyze than those on multilevel signals. In fact, the median operation on binary samples reduces to a simple Boolean operation. The median of three binary samples x1 , x2 , x3 , for example, is equivalent to x1 x2 + x2 x3 + x1 x3 , where the + (OR) and xi x j (AND) ÒBooleanÓoperators in the {−1, 1} domain are deÞned as
xi + x j = max(xi , x j ) xi x j = min(xi , x j )
(6)
Note that the operations in Eq. (6) are also valid for the standard Boolean operations in the {0, 1} domain.
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 177
Figure 1. Weak superposition property of median smoothers.
Figure 1 illustrates the weak superposition property of the 3-point median smoother. Note that applying the 3-point median smoother to the multilevel signal X = [−2, 1, −1]T is equivalent to decomposing the signal to four binary thresholded signals; processing each binary signal separately with the corresponding binary median smoother, f (x1 , x2 , x3 ) = x1 x2 + x1 x3 + x2 x3 ; and then adding the binary outputs to obtain the integer-valued output. B. Stack Smoothers To deÞne the running stack smoother, let {X (·)} be a discrete time sequence where each element in the sequence can take on integer values in {−M, −M + 1, . . . , 0, . . . , M}. The running stack smoother passes a window over the sequence {X (·)} that selects, at each instant n, a set of samples to compose the observation vector X(n). The observation window is centered at n, which results in X(n) = [X (n − N L ), . . . , X (n), . . . , X (n + N R )]T = [X 1 (n), X 2 (n), . . . , X N (n)]T
(7)
where X i (n) = X (n − N L + i − 1), N L and N R may range in value over the nonnegative integers, and N = N L + N R + 1 is the window size. The stack smoother operating on the input sequence {X (·)} produces the output sequence
178
JOSE« L. PAREDES AND GONZALO R. ARCE
Figure 2. The stack smoothing operation.
{S(·)}, where at time index n S(n) = S(X 1 (n), . . . , X N (n)) =
M 1 f x1m (n), . . . , x Nm (n) 2 m=−M+1
(8)
where {xim (n)} is the threshold decomposition of X i (n) and f (·) is a ÒBooleanÓ operation satisfying Eq. (6) and the stacking property. More precisely, if two binary vectors u ∈ {−1, 1}N and v ∈ {−1, 1}N stack (i.e., u i ≥ vi for all i ∈ {1, . . . , N }), then their respective outputs stack, f (u) ≥ f (v). A necessary and sufÞcient condition for a function to possess the stacking property is that it can be expressed as a Boolean function which contains no complements of input variables (Gilbert, 1954). Such functions are known as positive Boolean functions (PBFs). The stack smoothing operation is schematically described in Figure 2 where the discrete time index has been dropped for notational simplicity. Given a positive Boolean function f (x1m , . . . , x Nm ) which characterizes a stack smoother, it is possible to Þnd the equivalent smoother in the integer domain by replacing the binary AND and OR Boolean functions acting on the xi Õs with min and max operations acting on the multilevel X i samples. For instance, if the Boolean function that characterizes the stack smoother is given by f (x1 , x2 , x3 ) = x1 x3 + x2 , the equivalent smoother in the integer domain is S(X 1 , X 2 , X 3 ) = max(min(X 1 , X 3 ), X 2 ). A more intuitive class of smoothers is obtained, however, if the positive Boolean functions are further restricted (Yli-Harja et al., 1991). When self-duality and separability is imposed, for instance, the equivalent integer domain stack smoothers reduce to the well-known class of weighted median smoothers with positive weights. For example, if the Boolean function in the stack smoother representation is selected as f (x1 , x2 , x3 , x4 ) = x1 x3 x4 + x2 x4 + x2 x3 + x1 x2 , the equivalent WM smoother takes on the positive weights (W1 , W2 , W3 , W4 ) = (1, 2, 1, 1). The procedure of how to obtain the weights Wi from the PBF will be described in detail shortly.
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 179
Admitting only positive weights, WM smoothers are severely constrained, as are all Þlters deÞned by stack smoothers (Yli-Harja et al., 1991). Much as the WM smoother can be generalized to a WM Þlter structure admitting realvalued weights (Arce, 1998), a generalized stack Þlter representation must exist which includes Þlters with much richer characteristics than those of stack smoothers. As we show next, this is in fact true where the key to the new stack Þlter structure lies in the deÞnition of mirrored threshold decomposition and on the Þlters associated with the new decomposition. III. Mirrored Threshold Decomposition and Stack Filters Much as threshold decomposition provides the underlying basis for the deÞnition of stack smoothers, the class of stack Þlters can be deÞned in a similar fashion through a more general threshold decomposition architecture referred to as mirrored threshold decomposition (Paredes and Arce, 1999). A. Mirrored Threshold Decomposition Consider again the set of integer-valued samples X 1 , X 2 , . . . , X N forming the vector X where X i ∈ {−M, . . . , −1, 0, . . . , M}. Unlike threshold decomposition, mirrored threshold decomposition of X generates two sets of binary vectors, each consisting of 2M vectors. The Þrst set consists of the 2M vectors associated with the traditional deÞnition of threshold decomposition, x−M+1 , x−M+2 , . . . , x0 , . . . , x M . The second set of vectors is associated with the decomposition of the mirrored vector of X, which is deÞned as S = [ S1 , S2 , . . . , S N ]T
= [−X 1 , −X 2 , . . . , −X N ]T
(9) (10)
Since Si takes on symmetrical values about the origin from X i , Si is referred to as the mirror sample of X i , or simply as the signed sample X i . Threshold decomposition of S leads to the second set of 2M binary vectors, s−M+1 , s−M+2 , . . . , s0 , . . . , s M . The ith element of xm is as speciÞed before by 1 if X i ≥ m m m xi = T (X i ) = (11) −1 if X i < m whereas the ith element of sm is deÞned by 1 sim = T m (−X i ) = −1
if (−X i ) ≥ m if (−X i ) < m
(12)
JOSE« L. PAREDES AND GONZALO R. ARCE
180
The thresholded mirror signal can be written as sim = sgn(−X i − m) = −sgn(X i + m − 1) where sgn(·) denotes the sign function deÞned as +1 if X ≥ 0 sgn(X ) = (13) −1 if X < 0 The two sets of decomposed signals preserve all the desirable properties previously described: 1. X i and Si are both reversible from their corresponding set of decomposed signals and, consequently, an integer-valued signal X i has a unique mirrored threshold signal representation, and vice-versa: 6 5 6 T.D. 5 X i ←→ xim ; sim T.D.
where ←→ denotes the one-to-one mapping provided by the mirrored threshold decomposition operation. 2. The stacking property is obeyed by each set of binary vectors. That is, given the set of thresholded binary vectors x−M+1 , . . . , x0 , . . . , x M and s−M+1 , . . . , s0 , . . . , s M , it follows from the deÞnition of threshold decomposition that the set of thresholded binary vectors satisfy the partial ordering [xi ; si ] ≤ [x j ; s j ]
if i ≥ j
That is [xi ; si ] ∈ {−1, +1}2N stack (i.e., xki ≤ 2, . . . , N ).
j xk
and ski ≤
(14) j sk
for all k = 1,
In addition, since the vector S is the mirror of X, a partial ordering relation exists between the two sets of thresholded signals. With X i = −Si , the thresholded samples satisfy s M−ℓ = −x −M+1+ℓ and x M−ℓ = −s −M+1+ℓ for ℓ = 0, 1, . . . , 2M − 1. As an example, the representation of the vector X = [2, −1, 0, −2, 1, 2, 0] in the binary domain of mirrored threshold decomposition is x2 x1 x0 x−1
= = = =
[1, −1, −1, −1, −1, 1, −1]T [1, −1, −1, −1, 1, 1, −1]T [1, −1, 1, −1, 1, 1, 1]T [1, 1, 1, −1, 1, 1, 1]T
s2 s1 s0 s−1
= = = =
[−1, −1, −1, 1, −1, −1, −1]T [−1, 1, −1, 1, −1, −1, −1]T [−1, 1, 1, 1, −1, −1, 1]T [−1, 1, 1, 1, 1, −1, 1]T
For notational convenience, the mirrored threshold decomposition of a vector X at level m will be denoted by a 2N-component binary vector where the Þrst N components correspond to the binary vector xm , whereas the last N components correspond to the binary vector sm , namely T m (X) = [(xm )T ; (sm )T ]T
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 181
Furthermore, we denote [xT ; sT ]T as the mirrored threshold decomposition of X at an arbitrary threshold level. It will be shown in Section VIII that not all 22N binary vectors of length 2N can be outputted by the threshold operator. Mirrored threshold decomposition has been formulated to be used with integer-valued signals since it is much easier to understand intuitively. However, mirrored threshold decomposition can also be extended to admit realvalued signals. Consider the real-valued vector X = [X 1 , X 2 , . . . , X N ]T . Mirrored threshold decomposition maps this real-valued vector to an inÞnite set of binary vectors xm ∈ {−1, 1} N , sm ∈ {−1, 1} N , m ∈ (−∞, +∞), where
and
xm = [sgn(X 1 − m), sgn(X 2 − m), . . . , sgn(X N − m)]T T = x1m , x2m , . . . , x Nm
(15)
sm = [sgn(−X 1 − m), sgn(−X 2 − m), . . . , sgn(−X N − m)]T T = s1m , s2m , . . . , s Nm
(16)
Since threshold decomposition is invertible, X can be reconstructed from its binary representation as 1 +∞ m x dm i = 1, 2, . . . , N (17) Xi = 2 −∞ i Since m can take any real value, the inÞnite set of binary vectors {[(xm )T ; (s ) ] } seems redundant in representing the real-valued vector X. As will be seen later, threshold signal representation can be simpliÞed based on the fact that there are at most 2N + 1 different binary vectors {[(xm )T ; (sm )T ]T } for each observation vector X. m T T
B. Stack Filters Much as traditional threshold decomposition has led to the deÞnition of stack smoothers, mirrored threshold decomposition leads to the deÞnition of a richer and more powerful class of nonlinear Þlters referred to as stack Þlters (Paredes and Arce, 1999). DeÞnition III.1 Given an integer-valued vector X = [X 1 , X 2 , . . . , X N ]T , X i ∈ {−M, . . . , 0, . . . , M}, and its corresponding mirrored threshold decomposition x−M+1 , . . . , x0 , . . . , x M , and s−M+1 , . . . , s0 , . . . , s M , the output of a
182
JOSE« L. PAREDES AND GONZALO R. ARCE
stack Þlter is deÞned as follows: S f (X 1 , X 2 , . . . , X N ) =
M 1 f x1m , x2m , . . . , x Nm ; s1m , s2m , . . . , s Nm (18) 2 m=−M+1
where f : {−1, +1}2N −→ {−1, +1} is a 2N -variable positive Boolean function that satisÞes the stacking property. That is, if u ∈ {−1, +1}2N and v ∈ {−1, +1}2N stack (i.e., u k ≤ vk for k = 1, 2, . . . , 2N ), then their respective outputs stack: f (u) ≤ f (v)
(19)
The stacking property, Eq. (19), ensures that the outputs on different threshold levels are consistent. Thus, if the Boolean function outputs +1 on a given level m, it must output +1 on all levels less than m, or if the Boolean function outputs −1 on a given level m, it must output −1 on all levels greater than m. Note that f (·), in Eq. (18), maps a 2N -component binary vector to {−1, 1}. Therefore, f (·) can be deÞned in the binary domain of mirrored threshold decomposition as a disjunction of conjunction of the binary variables x1 , . . . , x N ; s1 , . . . , s N with no complements, where the OR and AND Boolean operators are deÞned in the binary domain of mirrored threshold decomposition, respectively, as xi + x j = max(xi , x j )
(20)
xi x j = min(xi , x j )
(21)
A different, but equivalent, way to deÞne a stack Þlter in the binary domain of mirrored threshold decomposition is through the use of a Boolean table lookup, namely a truth table with N∗ entries, where each entry corresponds to a possible binary input. In this case, the PBF f (·) can be represented as a length N ∗ vector z = [z 1 , z 2 , . . . , z N ∗ ]T
= [ f (ξ 1 ), f (ξ 2 ), . . . , f (ξ N ∗ )]T
(22)
where each z i = f (ξi ) ∈ {−1, 1} satisÞes the stacking property, Eq. (19), and ξi ∈ {−1, +1}2N is a 2N-component binary vector that can be outputted by the threshold operator. Note that each component of the decision vector z is related through f (·) to a 2N-component binary vector. It will be seen in Section VIII that an optimization algorithm for the design of stack Þlters Þnds the optimal values for the decision vector z. As deÞned in Eq. (18), stack Þlter input signals are assumed to be quantized to a Þnite number of signal levels. This deÞnition, however, can be extended to admit real-valued input signals.
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 183
DeÞnition III.2 (Continuous Stack Filters) Given a set of N real-valued samples X = [X 1 , X 2 , . . . , X N ]T , the output of a stack Þlter deÞned by a PBF f (·) is given by S f (X) = max[ℓ ∈
R:
f (T ℓ (X 1 ), . . . , T ℓ (X N );
T ℓ (−X 1 ), . . . , (T ℓ (−X N )) = 1]
(23)
where the thresholding function T ℓ (·) is deÞned in Eq. (11) and R denotes the set of real numbers. The link between the continuous stack Þlter S f (·) and the corresponding PBF f (·) is given by the following property. Property III.1 Let X = [X 1 , X 2 , . . . , X N ]T and S = [−X 1 , −X 2 , . . . , −X N ]T be, respectively, a real-valued vector and its corresponding mirrored vector that are inputted to a stack Þlter S f (·) deÞned by the positive Boolean function f (x1 , . . . , x N ; s1 , . . . , s N ). The PBF with the sum-of-products expression K f (x1 , . . . , x N ; s1 , . . . , s N ) = x j sk i=1
j∈Pi ,k∈Q i
where Pi and Q i are subsets of {⭋, 1, . . . , N }, has the stack Þlter representation 5 S f (X) = max min{X j Sk : j ∈ P1 , k ∈ Q 1 }, . . . , 6 min{X j Sk : j ∈ PK , k ∈ Q K } with Pi and Q i not being the empty set, ⭋, at once.
Thus, given a positive Boolean function f (x1 , . . . , x N ; s1 , . . . , s N ) which characterizes a stack Þlter in the binary domain of mirrored threshold decomposition, it is possible to Þnd the equivalent Þlter in the real domain by replacing the binary AND and OR operations acting on the xi Õs and si Õs with min and max operations acting on the real-valued X i and Si samples. For example, consider the PBF f (x1 , x2 , x3 ; s1 , s2 , s3 ) = x1 s2 + x2 s3 = x3 s1 ; in the domain of real numbers, this stack Þlter can be expressed as S(X 1 , X 2 , X 3 ) = max(min(X 1 , S2 ), min(X 2 , S3 ), min(X 3 , S1 )) IV. Integer Domain Filters of Linearly Separable PBFs Stack Þlters and stack smoothers both use PBFs in the binary domain with the difference that stack Þlters use a mirrored threshold decomposition while stack
184
JOSE« L. PAREDES AND GONZALO R. ARCE
smoothers use the standard threshold decomposition architecture. In general, stack Þlters and stack smoothers can be implemented by maxÐminnetworks in the integer domain. Although simple in concept, maxÐminnetworks lack an intuitive interpretation. However, if the PBFs in the stack Þlter representation are further constrained, a number of more appealing Þlter structures emerge. These Þlter structures are more intuitive to understand and, in many ways, they are similar to linear FIR Þlters. Yli-Harja et al. (1991) describe the various types of stack smoothers attained when the PBFs are constrained to be linearly separable. Weighted order statistic (WOS) smoothers and weighted median smoothers are, for instance, obtained if the PBFs are restricted to be linearly separable and self-dual linearly separable, respectively. A Boolean function f (x) is said to be linearly separable if and only if it can be expressed as N Wi xi − T (24) f (x1 , . . . , x N ) = sgn i=1
where xi are binary variables, and the weights Wi and threshold T are nonnegative and real-valued (Sheng, 1969). A self-dual linearly separable Boolean function is deÞned by further restricting Eq. (24) as N Wi xi (25) f (x1 , . . . , x N ) = sgn i=1
A Boolean function f (x) is said to be self-dual if and only if f (x1 , x2 , . . . , x N ) = 1 implies f (xø1 , xø2 , . . . , xøN ) = −1, and f (x1 , x2 , . . . , x N ) = −1 implies f (xø1 , xø2 , . . . , xøN ) = 1, where xø denotes the Boolean complement of x (Muroga, 1971). Furthermore, it has been shown in Yli-Harja et al. (1991) that if linearly separable PBFs are further restricted to be isobaric∗ and self-dual isobaric, the integer domain stack smoothers reduce to order statistic (OS) smoothers and to the median smoother, respectively. Within the mirrored threshold decomposition representation, a similar strategy can be taken where the separable Boolean functions are progressively constrained, which leads to a series of stack Þlter structures that can be easily implemented in the integer domain. In particular, weighted order statistic Þlters, weighted median Þlters, and order statistic Þlters emerge by appropriately selecting the appropriate PBF structure in the binary domain. Figure 3 depicts the relationship among subclasses of stack Þlters and stack Þlters and stack smoothers. As this Þgure shows, stack Þlters are much richer than stack smoothers. The class of WOS Þlters, for example, contains all WOS and OS smoothers, whereas even the simplest OS Þlter is not contained in the entire class of stack smoothers. ∗
Isobaric refers to the special case where all weights Wi are set to unity.
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 185
Figure 3. Relationship among subclasses of stack Þlters and stack smoothers. (Reproduced C 1999 IEEE.) with permission from Paredes and Arce 1999,
The analysis of each type of linearly separable Boolean function and its relationship to the equivalent integer domain Þlter can be done by following the theory established by Yli-Harja et al. (1991). Such analysis is not included here as it is a straightforward generalization of the results in Yli-Harja et al. (1991). Once the linearly separable functions have been determined, however, there are a number of differences between the methods described in Yli-Harja et al. (1991) for WOS smoothers and the methods applicable to WOS Þlters. These differences emerge as a consequence of the different characteristics of the threshold decomposition representations used. The remainder of this section focuses on the analysis of the properties of mirrored threshold decomposition and its effects on the equivalent integer domain Þlters. Without loss of generality, we restrict the discussion to self-dual linearly separable PBFs, which deÞne the class of weighted median Þlters admitting real-valued weights. A. Weighted Median (WM) Filters WM Þlters are generated if the positive Boolean function that deÞnes the stack Þlter in Eq. (18) is constrained to be self-dual and linearly separable. Thus, in
186
JOSE« L. PAREDES AND GONZALO R. ARCE
the binary domain of mirrored threshold decomposition, WM Þlters are deÞned in terms of the thresholded vectors xm = [x1m , . . . , x Nm ]T and the corresponding thresholded mirror vector sm = [s1m , . . . , s Nm ]T as S(X 1 , X 2 , . . . , X N ) =
M 1 sgn(WT xm + |H|T sm − T0 ) 2 m=−M+1
(26)
where W = [W1 , W2 , . . . , W N ]T and |H| = [|H1 |, |H2 |, . . . , |HN |]T are 2N positive-valued weights which uniquely characterize the WM Þlter. The constant T0 is 0 or 1 if the weights are real-valued or integer-valued adding up to an odd integer, respectively. The symbol | · | represents the absolute-value operator and is used in the deÞnition of binary domain WM Þlters for reasons that will become clear shortly. The role that Wi Õs and |Hi |Õs play in WM Þltering is very important as is described next. Since the threshold logic gate sgn(·) in Eq. (26) is self-dual and linearly separable, and since the xim and sim respectively represent X i and its mirror sample Si = −X i , the integer domain representation of Eq. (26) is given by (Muroga, 1971; Paredes and Arce, 1999; Yli-Harja et al., 1991) Y = MEDIAN(W1 ⋄ X 1 , |H1 | ⋄ S1 , . . . , W N ⋄ X N , |HN | ⋄ S N )
(27)
where Wi ≥ 0 and |Hi | ≥ 0, and where ⋄ is the replication operator deÞned as Wi times 2 30 1 Wi ⋄ X i = X i , X i , . . . , X i . At this point it is convenient to associate the sign of the mirror sample Si with the corresponding weight Hi as Y = MEDIAN(W1 ⋄ X 1 , H1 ⋄ X 1 , . . . , W N ⋄ X N , HN ⋄ X N ) = MEDIAN( W1 , H1 ⋄ X 1 , . . . , W N , HN ⋄ X N )
(28) (29)
with Wi ≥ 0 and Hi ≤ 0, where we have used the equivalence Hi ⋄ X i = |Hi | ⋄ sgn(Hi )X i (Arce, 1998; Paredes and Arce, 1999). Thus, weighting in the WM Þlter structure is equivalent to uncoupling the weight sign from its magnitude, merging the sign with the observation sample, and replicating the ÒsignedÓsample according to the magnitude of the weight (Arce, 1998). Notice that each sample X i in Eq. (27) is weighted twiceÑonce positively by Wi and once negatively by Hi . In Eq. (29), the double weight Wi , Hi is deÞned to represent the positive and negative weighting of X i . In Arce (1998), a single real-valued weight was associated with each observation sample, in much the same way linear FIR Þlters only use N Þlter weights. Double weighting emerges, however, through the analysis of mirrored threshold decomposition and the stack Þlter representation. At Þrst, the WM Þlter structure in Eq. (29) seems redundant. After all, linear FIR Þlters only require a set of N weights, albeit real-valued. The reason for this is the
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 187
associative property of the sample mean. As shown in Arce (1998), the linear Þlter structure analogous to Eq. (28) is Yø = MEAN(W1 · X 1 , |H1 | · S1 , . . . , W N · X N , |HN | · S N ) = MEAN((W1 + H1 ) · X 1 , . . . , (W N + HN ) · X N )
(30) (31)
where Wi ≥ and Hi ≤ 0 collapse to a single real-valued weight in Eq. (31). For the sample median, however, N N = MEDIAN (Wi + Hi ) ⋄ X i |i=1 (32) MEDIAN Wi , Hi ⋄ X i |i=1
thus the weight pair Wi , Hi is needed in general. Weighted median Þlters have an alternate interpretation. Extending the concepts in Arce (1998), we can show that the WM Þlter output in Eq. (29) is the value β minimizing the cost function G 1 (β) =
N i=1
(Wi |X i − β| + |Hi ||X i + β|)
(33)
where β can only be one of the samples X i or one of the mirrored samples Si since Eq. (33) is piecewise linear and convex. Figure 4 depicts the effects of double weighting in WM Þltering where the absence of double weighting,
Wi , Hi , distorts the shape of the cost function G 1 (β) and can lead to a distorted global minimum.
Figure 4. An observation vector [X 1 , X 2 , X 3 , X 4 , X 5 ] = [−7, −2, 1, 5, 8] Þltered by the two sets of weights 3, 2, 2, −3, 1 (solid line) and 3, 2, 2, 2, −3, 1 (dashed line), respectively. Double weighting of X4 shows the distinct cost function and minimum attained. (Reproduced C 1999 IEEE.) with permission from Paredes and Arce 1999,
188
JOSE« L. PAREDES AND GONZALO R. ARCE
Some interesting properties of the WM Þlter emerge naturally as a consequence of the mirrored threshold decomposition. The proofs of these properties can be found in Paredes and Arce (1999). In the following, the WM Þlter is speciÞed by the weight pairs Wi , Hi , Wi ≥ 0, Hi ≤ 0, i = 1, . . . , N . Property IV.1 Filtering the input observation vector X by a WM Þlter with N weights Wi , Hi |i=1 produces an output whose mirror is equal to the output generated by the same WM Þlteracting over the corresponding mirrored input observations N N = MEDIAN Wi , Hi ⋄ Si |i=1 (34) −MEDIAN Wi , Hi ⋄ X i |i=1
Remark IV.1 Property IV.1 establishes that the WM Þltering operation can be thought of as an odd nonlinear function since Eq. (34) can be rewritten as N N ) = −MEDIAN( Wi , Hi ⋄ −X i |i=1 ). MEDIAN( Wi , Hi ⋄ X i |i=1
Property IV.2 If an observation vector X is Þltered by a WM Þlter with N weights Wi , Hi |i=1 , the mirror sample of the resultant value is equeal to N acting over the same the output of a WM Þlter with weights −Hi , −Wi |i=1 observation vector X, that is, N N (35) = MEDIAN −Hi , −Wi ⋄ X i |i=1 −MEDIAN Wi , Hi ⋄ X i |i=1
Since WM Þlters are characterized in the binary domain by positive Boolean functions, it turns out that there is a close relationship between the PBF of a WM Þlter with weights Wi , Hi and the PBF of a WM Þlter with weights
−Hi , −Wi . The following property describes this relationship.
Property IV.3 The PBF g(x; s) of a WM Þlter with weights −Hi , −Wi can be directly obtained from the PBF f (x; s) of a WM Þlter with weights
Wi , Hi . The PBF g(x; s) is found by interchanging xi and si whenever they appear in f (x; s). For example, if f (x; s) = x1 x2 + x1 s3 + x2 s3 then g(x; s) = s1 s2 + s1 x3 + s2 x3 . Since g(x; s) is derived from f (x; s), g(x; s) is referred to as the mirrored PBF of f (x; s). This notation emphasizes the fact that if the same input vector X is Þltered by the WM Þlters with PBFs f (x; s), and g(x; s), the output of the latter Þlter is the mirror sample of the output of the former Þlter. Before we end this section, the ßexible Þltering capabilities of WM Þlters should be demonstrated. In particular, Figure 5b shows the output of a highpass WM Þlter when the two-tone signal of Figure 5a is inputted. The two-tone signal has normalized frequencies of 0.015 and 0.5 Hz and isÞltered by the WM Þlter with output Yk = MEDIAN Wℓ , Hℓ ⋄ X k+ℓ |3ℓ=−3 , where X k is the center sample of the moving window, and the set of weights Wℓ , Hℓ |3ℓ=−3
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 189
Figure 5. Example of a WM Þlter acting as a Òhigh-passÓÞlter. (a) The two-tone input signal. (b) The WM Þlter output. (Reproduced with permission from Paredes and Arce 1999, C 1999 IEEE.)
are, respectively, 1, −2, 1, 0, 1, −1, 1, −1, 1, −1, 1, 0, 3, 0. Note that the low-frequency component has been Þltered out. In later sections, we will learn how to obtain the Þlter coefÞcients. Finally, it should be mentioned that the deÞnitions of WOS Þlters and OS Þlters, from the corresponding PBFs in the mirrored threshold decomposition, follow a procedure similar to that used for WM Þlters. For instance, WOS Þlters are deÞned as Y = T ′ th LARGEST( W1 , H1 ⋄ X 1 , . . . , W N , HN ⋄ X N )
(36)
with Wi ≥ 0 and Hi ≤ 0. OS Þlters have a deÞnition similar to that of WOS Þlters, except that all weights Wi and |Hi | are set equal to 1 and only the threshold T can be varied.
V. Analysis of WM Filters Using Threshold Logic As described in Section III, WM Þlters can be completely characterized in the binary domain by a self-dual linearly separable PBF. Binary representations can be used to better understand the properties of the WM Þlters. However, a
190
JOSE« L. PAREDES AND GONZALO R. ARCE
real implementation of a WM Þlter in the binary domain requires a considerable number of binary operations (Wendt et al., 1986). Therefore, conversion methods which map the ÞlterÕs binary representation to its integer-valued representation, and vice-versa, are needed. In this section, we present Þrst the problem of Þnding the PBF corresponding to an integer-valued WM Þlter with weights Wi , Hi , and second the problem of Þnding the integer-valued WM Þlter corresponding to a given PBF.
A. Finding the PBF Corresponding to a Weighted Median Filter In the binary domain, WM Þlters are characterized by self-dual linearly separable PBFs which, as shown in Eq. (26), can be efÞciently expressed as threshold logic gates. The argument of the threshold logic function, WT x + |H|T s, deÞnes a 2N -dimensional hyperplane that separates the binary vectors (x; s) = (x1 , . . . , x N ; s1 , . . . , s N ) for which f (x; s) = +1 where WT x + |H|T s ≥ 0, from those binary vectors (x; s) for which f (x; s) = −1 where WT x + |H|T s < 0. Therefore, to Þnd the Boolean function corresponding to a WM Þlter, we must determine the minimum sum-of-products of all the binary vectors (x; s) for which WT x + |H|T s ≥ 0. A simple method to attain this was described in Yli-Harja et al. (1991) for the analysis of WM smoothers. For WM Þlters, that method can be modiÞed as follows. In mirrored threshold decomposition, two sets of signals are present: (1) the thresholded signals xim emerging from the X i Õs and (2) the thresholded signals sim associated with the mirror Si samples. In determining the equivalence between the integer and binary domains, the xim Õs are associated with the positive weights, Wi , whereas the absolute values of the negative weights, |Hi |, are linked to the sim signals. The problem of Þnding the PBFs corresponding to a given WM Þlter with the set of weights {W, H} reduces thus to combine in a logical OR operation all the possible logical products of the binary samples whose sum of the corresponding weights (magnitude values) is greater than N (Wi + |Hi |)/2. The resulting sum-of-products or equal to the threshold ( i=1 can then be minimized by using rules of Boolean algebra. This procedure, as shown in Paredes and Arce (1999), is equivalent to Þnding the corresponding PBF by minimizing the sum-of-products of all the binary vectors (x; s) that satisfy WT x + |H|T s ≥ 0. Example V.1 Consider the weighted median Þlter Y = MEDIAN(W1 ⋄ X 1 , W2 , H2 ⋄ X 2 , W3 ⋄ X 3 ) = MEDIAN(1 ⋄ X 1 , 2, −1 ⋄ X 2 , 1 ⋄ X 3 )
(37) (38)
The logical products and their corresponding sum of weights that are greater
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 191
than or equal to the threshold (equal to 2.5) are listed as follows: x1 x2 x3 s2 x1 x2 x3 x1 x2 s2 x1 x3 s2
=⇒ =⇒ =⇒ =⇒
W1 + W2 + W3 + |H2 | W1 + W2 + W3 W1 + W2 + |H2 | W1 + W3 + |H2 |
x1 x2 x2 x3 s2 x 2 s2 x2 x3
=⇒ =⇒ =⇒ =⇒
W1 + W2 W2 + W3 + |H2 | W2 + |H2 | W2 + W3
where the thresholded mirrored sample s2 is associated with the absolutevalued weight |H2 |. The positive Boolean function is formed by combining these products as f (x; s) = x1 x2 x3 s2 + x1 x2 x3 + x1 x2 s2 + x1 x3 s2 + x2 x3 s2 + x1 x2 + x2 s2 + x2 x3
(39)
After some Boolean algebra simpliÞcations, Eq. (39) becomes f (x; s) = x1 x3 s2 + x1 x2 + x2 s2 + x2 x3 . Note that the Boolean representation in Eq. (39) contains the term s2 , in addition to all the xi Õs. This is because in the integer domain WM Þlter representation, only the sample X 2 is given a positive weight and a negative weight. Note also that Þnding the PBF corresponding to a WM Þlter with a single real-valued weight per sample as in Arce (1998) can be thought of as a special case of the preceding where either Wi or Hi is set to zero. For instance, the PBF related to the 3-point median Þlter with weights W1 , H2 , W3 =
1, −1, 1 is x1 s2 + x1 x3 + x1 s2 . B. Finding the Weighted Median Filter Corresponding to a PBF Here we present the reverse problem of that described in the previous subsection. That is, given a positive Boolean function f (x; s), the goal is to Þnd the weights Wi and |Hi | of the corresponding WM Þlter such that f (x; s) can be realized by the single threshold logic element sgn(WT x + |H|T s). Consequently, the WM Þlter in the integer domain corresponding to the PBF can be implemented by Eq. (26). Thus the goal here is to Þnd the hyperplane that divides the 2N -dimensional space such that the set {x; s} = {(x; s) ∈ {−1, +1}2N | f (x; s) = 1} lies on one side of the hyperplane and the set {x; s} = {(x; s) ∈ {−1, +1}2N | f (x; s) = −1} lies on the other side (Hu, 1965). Recall that a Boolean function is said to be linearly separable if it is possible to Þnd weights (W, |H|) such that f (x; s) can be expressed as +1 if WT x + |H|T s ≥ T ′ (40) f (x; s) = −1 if WT x + |H|T s < T ′
192
JOSE« L. PAREDES AND GONZALO R. ARCE
where W = [W1 , . . . , W N ]T and |H| = [|H1 |, . . . , |HN |]T are nonnegative real-valued weights, x = [x1 , . . . , x N ]T and s = [s1 , . . . , s N ]T are threshold binary signals, and T ′ is a nonnegative real number to be deÞned shortly. Furthermore, recall that f (x; s) in Eq. (40) will lead to a weighted median Þlter in the integer domain if and only if f (x; s) is a self-dual function. Theorems V.1 and V.2, whose proofs can be found in Paredes and Arce (1999), show two particular cases for which f (x; s) is a self-dual function. Theorem V.1 A PBF that satisÞes Eq. (40) using strict inequality is a selfdual PBF if T ′ = 0.
Theorem V.2 A PBF given by Eq. (40) is a self-dual PBF for T ′ = 1, if the N weights Wi , |Hi | are constricted to be nonnegative integers, with i=1 (Wi + |Hi |) odd.
The problem of Þnding the weighted median Þlter corresponding to a positive Boolean function is formally addressed next. Let f (x; s) be a positive Boolean function represented by its minimum sumof-products f (x; s) = π1 + π2 + . . . + π p
(41)
where πi is a minterm given by the logical product of some binary variables x j ∈ {−1, +1} and/or s j ∈ {−1, +1}. For instance, πi could be x1 x2 x3 or x1 s1 x2 or s1 s2 s3 . Remarks V.1 through V.3 hold directly from the deÞnitions. Remark V.1 Since f (x; s) is a PBF, the πi for i = 1, 2, . . . , p contain no complements of binary variables. Remark V.2 Since f (x; s) is given by its minimum sum-of-products, the πi for i = 1, 2, . . . , p are disjunctive terms and they are sufÞcient to completely deÞne f (x; s) (Muroga, 1971). Remark V.3 If in a binary vector (x; s) = [x1 , . . . , x N ; s1 , . . . , s N ] all the variables presented in any one of the πi Õs are set to +1, then by deÞnition of minterm, f (x; s) = +1, regardless of the values taken by the other variables of the vector (x; s) that are not present in πi . Next assume that f (x; s) is linearly separable, such that it can be realized by Eq. (40) with an appropriate set of weights W, |H|. Associate the weights Wi and |Hi | to the binary variables xi and si , respectively (i.e., xi is related to Wi and si is related to |Hi | for i = 1, 2, . . . , N , where N is the largest index of the binary variables presented in f (x; s)). Since f (x; s) has to satisfy Eq. (40) and if we invoke Remarks V.1 and V.3, the sum of the weights related to the binary variables presented in πi has to be greater than or equal to the difference between T ′ and the sum of the weights related to the binary variables
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 193
not present in πi . Thus, the problem of Þnding the WM Þlter corresponding to a positive Boolean function f (x; s) reduces to solving an optimization problem in 2N -dimensional space, where the solution W1 , . . . , W N , |H1 |, . . . , |HN | has to satisfy not only the inequality constraints generated by each logical product, πi , but the restriction of taking only positive values. These constraints can be written as follows: αi1 W1 + · · · + αi N W N + ξi1 |H1 | + · · · + ξi N |HN | ≥ T ′
i = 1, 2, . . . , p
W1 ≥ 0, W2 ≥ 0, . . . , W N ≥ 0 where the αi j and ξi j satisfy, respectively, ⎧ if x j ∈ πi ⎨ 1 if x j ∈ / πi and αi j = −1 ⎩ 0 / f (x; s) if x j ∈ ξi j
⎧ ⎨ 1 = −1 ⎩ 0
(42)
x j ∈ f (x; s)
if s j ∈ πi if s j ∈ / πi and s j ∈ f (x; s) / f (x; s) if s j ∈
Thus αi j = 1 (ξi j = 1) if x j (s j ) is present in πi ; αi j = −1 (ξi j = −1) if x j (s j ) is not present in πi ; and αi j = 0 (ξi j = 0) if x j (s j ) is not present in any πi Ñfor i = 1, . . . , p and j = 1, . . . , N . The values taken by αi j (ξi j ) take into account the fact that the PBF may not be a function of the xi Õs(si Õs). For instance, the PBF may be a function of x1 but not a function of s1 . The preceding results in an optimization problem with fewer unknowns since the weights corresponding to the missing variables are set to zero (Nieweglowski et al., 1993). Remark V.2 states that the πi for i = 1, 2, . . . , N are disjunctive terms, thus each inequality constraint in Eq. (42) deÞnes a region in the 2N -dimensional space. Each region is not completely contained in any other region deÞned by another constraint. Unnecessary constraints emerge if the positive Boolean function is not deÞned by the minimum sum-of-products. We are interested in Þnding, among the solutions of system of inequalities Eq. (42), the solution that minimizes a certain cost function. As in Yli-Harja et al. (1991), we choose the sum of weights as the cost function. Thus, the optimization problem reduces to N Wi + |Hi | (43) min Wi ,|Hi |
i=1
subject to the constraints given by Eq. (42).
194
JOSE« L. PAREDES AND GONZALO R. ARCE
This optimization problem can be enormously simpliÞed if the nonlinear operators | · | are pulled out of the cost function in Eq. (43), as well as out of the constraints given by Eq. (42), and a new cost function and set of constraints are deÞned. This can be easily done if |Hi | is redeÞned as Hi+ and the set of constraints Hi+ ≥ 0, i = 1, 2, . . . , N , are added to Eq. (42). This results in the following optimization: N Wi + Hi+ (44) min+ Wi ,Hi
i=1
subject to αi1 W1 + · · · + αi N W N + ξi1 H1+ + · · · + ξi N HN+ ≥ T ′ W1 ≥ 0, W2 ≥ 0, . . . , W N ≥ 0,
H1+
≥ 0,
H2+
i = 1, 2, . . . , p
≥ 0, . . . , HN+ ≥ 0 (45)
The optimization in Eqs. (44) and (45) reduce to a well-known linear programming problem. Methods of solving the linear programming problem can be found in Hu (1965), Muroga (1971), Sheng (1969), and Zukhovitskiy and Avdeyeva (1966). If the conditions of Theorem V.1 are used in Eq. (45) (i.e., T ′ = 0 and strict inequalities are used in the inequality constraints generated by each logic product), it is impossible to Þnd a solution to the linear programming problem since the feasible region is an open set (Nieweglowski et al., 1993). To overcome this problem, we restrict our solution to the case stated in Theorem V.2 (i.e., integerN (Wi + Hi+ ) is an odd number). This, however, valued weights and where i=1 is not a true limitation, since it can be proved that integer-valued weighted median Þlters are equivalent to real-valued weighted median Þlter realization as long as both synthesize the same positive Boolean function (Nieweglowski et al., 1993). By equivalence we mean both Þlters produce the same output signals under the same input signals. Once the linear programming problem has been solved, the weighted median Þlter in the integer domain corresponding to the given positive Boolean function is formed as Y = MEDIAN W1 , −H1+ ⋄ X 1 , W2 , −H2+ ⋄ X 2 , . . . , W N , −HN+ ⋄ X N (46)
The following example illustrates this procedure. Example V.2 Find the WM Þlter corresponding to the self-dual PBF given by f (x; s) = x1 x3 s2 + x1 x3 s3 + x3 s2 s3 + x1 x4 s2 + x1 x4 s3 + x3 x4 + s2 s3 x4
(47)
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 195
Clearly, the PBF in Eq. (47) does not depend on s1 , x2 , and s4 , therefore their corresponding weights, H1+ , W2 , and H4+ , are set to zero, thus the optimization problem is reduced to a Þve-dimensional space instead of an eight-dimensional space. Next we list the logical products, πi Õs, and their corresponding constraints that they generate according to Eq. (42): x 1 x 3 s2 x3 s2 s3 x 1 x 4 s3 s2 s3 x4 x 1 x 3 s3 x 1 x 4 s2 x3 x4
=⇒ W1 + W3 − W4 + H2+ − H3+ =⇒ −W1 + W3 − W4 + H2+ + H3+ =⇒ W1 − W3 + W4 − H2+ + H3+ =⇒ −W1 − W3 + W4 + H2+ + H3+ =⇒ W1 + W3 − W4 − H2+ + H3+ =⇒ W1 − W3 + W4 + H2+ − H3+ =⇒ −W1 + W3 + W4 − H2+ + H3+
≥1 ≥1 ≥1 ≥1 ≥1 ≥1 ≥1
(48)
Finding the weights W1 , W3 , W4 , H2+ , and H3+ reduces to minimize the sum of weights W1 + W3 + W4 + H2+ + H3+ subject to the constraints given in Eq. (48) and W1 ≥ 0, W3 ≥ 0, W4 ≥ 0, H2+ ≥ 0, H3+ ≥ 0. The solution using linear programming is W1 = 1, W3 = 2, W4 = 2, H2+ = 1, and H3+ = 1, and by Eq. (46) the integer-valued WM Þlter corresponding to f (x; s) is as follows: Y = MEDIAN( 1, 0 ⋄ X 1 , 0, −1 ⋄ X 2 , 2, −1 ⋄ X 3 ,
2, 0 ⋄ X 4 ). VI. Recursive Weighted Median Filters and Their Nonrecursive WM Filter Synthesis Thus far we have presented stack Þlters where the Þlter output is computed based on input samples only. The underlying stack Þltering operations are thus nonrecursive, as prior outputs do not affect future outputs. The stack Þltering characteristics can be signiÞcantly enriched if the previous outputs are taken into account to compute future outputs. Recursive stack Þlters, taking advantage of prior outputs, exhibit a signiÞcant advantage over their nonrecursive counterparts, particularly if negative as well as positive weights are used. Recursive WM (RWM) Þlters and recursive WOS Þlters are special cases of recursive stack Þlters, which can be thought of as analogous to linear inÞnite impulse response (IIR) Þlters but with improved robustness and stability characteristics. Recursive WM Þlter structures admitting weight pairs per sample are particularly important due to the fact that they can be used to model resonances which appear in many natural phenomena such as speech. Moreover, recursive WM Þlters often lead to computational complexity reduction since they can synthesize nonrecursive WM Þlters of much larger window size.
196
JOSE« L. PAREDES AND GONZALO R. ARCE
DeÞnition VI.1 Given an N -input observation Xk = [X k−L , . . . , X k , . . . , X k+L ], the recursive counterpart of Eq. (26) is obtained by replacing the leftmost L samples of the input vector Xk with the previous L output samples Yk−L , Yk−L+1 , . . . , Yk−1 (Nodes and Gallagher, 1982). Thus Eq. (26) becomes Yk =
M 1 sgn WTR rkm + |H R |T zmk − T0 2 m=−M+1
(49)
m m m m m , . . . , sk−1 , ]T and zmk = [sk−L , . . . , yk−1 , xkm , . . . , xk+L where rkm = [yk−L m m T sk , . . . , sk+L ] are the mirrored threshold decomposition of the redeÞned input observation X′k = [Yk−L , . . . , Yk−1 , X k , . . . , X k+L ]T · W R = [W R−L , . . . , W R0 , . . . , W R L ]T and |H R | = [|H R−L |, . . . , |H R0 |, . . . , |H R L |]T are 2N nonnegative Þlter weights with N = 2L + 1, and T0 is as deÞned in Eq. (26).
Clearly, Yk in Eq. (49) is a function of previous outputs as well as the input signal. Unlike linear IIR Þlters, recursive WM Þlters are always stable under the bounded-input bounded-output (BIBO) criterion regardless of the values taken by the Þlter coefÞcients. Property VI.1 RWM Þlters, as deÞnedin Eq. (49), are stable under the BIBO criterion, regardless of the values taken by the feedback weights W Rℓ , |H Rℓ | for ℓ = −1, . . . , −L . Proof. Given a bounded input signal X n such that |X n | < Mx , and initial conditions Y−k , k = 1, . . . , L , denote M y = max(|Y−L |, . . . , |Y−1 |) and let M = max(Mx , M y ). Since Y0 is the output of a median Þlter with inputs [Y−L , . . . , Y−1 , X 0 , . . . , X L ], Y0 is restricted to the dynamic range of these inputs. But |Y−k | < M for k = 1, . . . , L , and |X n | < M for all n, hence, we have |Y0 | < M. If we use this argument recursively for [Y1 , Y2 , . . .] with Yi depending on [Yi−L , . . . , Yi−1 , X i , . . . , X i+L ], i = 1, 2, . . . , it follows by induction that |Yn | < M for all n, hence, the output is bounded. Like their nonrecursive counterparts, recursive WM Þlters can be represented in the binary domain by a positive Boolean function f (rk ; zk ) having feedback elements in its structure. For instance, the recursive 3-point WM Þlter Yk = MEDIAN( W R−1 , H R−1 ⋄ Yk−1 , W R0 , H R0 ⋄ X k , W R1 , H R1 ⋄ X k+1 ) = MEDIAN( 1, 0 ⋄ Yk−1 , 1, 0 ⋄ X k , 0, −1 ⋄ X k+1 )
(50)
has the following recursive PBF: yk = f (rk ; zk ) = yk−1 xk + yk−1 sk+1 + xk sk+1
(51)
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 197
Note in Eq. (51) that the Boolean expression contains the term yk−1 , which is the binary representation of the previous output Yk−1 , producing a feedback operation over the output at time k. Note also that the conversion between recursive positive Boolean functions and integer-valued recursive WM Þlters follows the same procedures as those described in Section V. For short notation, RWM Þlters are denoted with double brackets and the weight related to the center sample of the window is underlined. For instance, the RWM Þlter given by Eq. (50) becomes
1, 1, −1. Recursive WM Þlters can be used to synthesize a nonrecursive WM Þlter of much larger window size. To date, there is not a known method of computing the recursive WM Þlter equivalent to a nonrecursive one. However, a method, in the binary domain, can be used to Þnd nonrecursive WM Þlter approximations of a recursive WM Þlter. The method consists of replacing the feedback terms in the PBF of the recursive WM Þlter with input binary samples (i.e., yk−1 is replaced by xk−1 ; yk−2 is replaced by xk−2 ; and so on). Thus the resultant binary expression is nonrecursive, depending on binary input samples only. The next example illustrates this procedure. Example VI.1 Find the nonrecursive WM approximation of the recursive 3-point WM Þlter given by Yk = MEDIAN( 1, 0 ⋄ Yk−1 , 1, 0 ⋄ X k ,
0, −1 ⋄ X k+1 ). Consider the following three approximations of increasing order: First-order. Replace yk−1 by xk−1 in Eq. (51) and with the resultant PBF Þnd the corresponding WM Þlter in the integer domain using the method described in Section V. This results in the Þrst-order approximation: (52) Yk1 = MEDIAN 1, 0 ⋄ X k−1 , 1, 0 ⋄ X k , 0, −1 ⋄ X k+1 Second-order. RedeÞne yk in Eq. (51) as a function of yk−2 and then replace yk−2 by xk−2 . This can be done by noting that yk−1 = f (rk−1 ; zk−1 ) = yk−2 xk−1 + yk−2 sk + xk−1 sk ; therefore, Eq. (51) becomes yk = (yk−2 xk−1 + yk−2 sk + xk−1 sk )(xk + sk+1 ) + xk sk+1 = yk−2 xk−1 xk + yk−2 sk xk + xk−1 sk xk + yk−2 xk−1 sk+1 + yk−2 sk sk+1 + xk−1 sk sk+1 + xk sk+1
(53)
Replacing yk−2 by xk−2 in Eq. (53) and using the conversion method described in Section V leads to the second-order approximation: Yk2 = MEDIAN( 1, 0 ⋄ X k−2 , 1, 0 ⋄ X k−1 , 2, −1 ⋄ X k , 0, −2 ⋄ X k+1 ) (54) Note in Eq. (54) that sample X k is weighted negatively and positively. This occurs naturally as a consequence of mirrored threshold decomposition.
198
JOSE« L. PAREDES AND GONZALO R. ARCE
Third-order. RedeÞne yk in Eq. (51) as a function of yk−3 and then replace yk−3 by xk−3 . This is possible if yk−2 in Eq. (53) is redeÞned as yk−3 xk−2 + yk−3 sk−1 + xk−2 sk−1 and yk−3 is subsequently replaced by xk−3 . After some Boolean simpliÞcations, the nonrecursive WM Þlter corresponding to the resultant PBF is found by using the method in Section V. Thus the third-order nonrecursive WM Þlter is given by Yk3 = MEDIAN 1, 0 ⋄ X k−3 , 1, 0 ⋄ X k−2 , 2, −1 ⋄ X k−1 , (55)
4, −2 ⋄ X k , 0, −4 ⋄ X k+1
In illustration of the effectiveness of the various nonrecursive WM approximations of the recursive WM Þlter, white Gaussian noise is applied to the recursive WM Þlter and to the various nonrecursive approximations. The results are shown in Figure 6. Observe in Figure 6 that the approximation improves with the order, as expected. Figure 6c shows that the output of the nonrecursive WM Þlter of length 5 is very close to the output of a-RWM Þlter of length 3. This corroborates that recursive WM Þlters can synthesize a nonrecursive WM Þlter of much larger window. Notice also in Eqs. (54) and (55) that the nonrecursive realizations of the recursive 3-point WM Þlter given by Eq. (50) require the use of weight pairs for some of the input samples. Indeed, binary representations having both xi and si as part of the positive Boolean function will inevitably lead to having a weight pair Wi , Hi on X i . In illustration of the importance of the double weighting operation on the Þlter output, the same input signal used with the previous nonrecursive approximations is next fed into the nonrecursive WM Þlter given by Eq. (55), but with the positive weight related to X k−1 set to zero (i.e., Wk−1 , Hk−1 has been changed from 2, −1 to 0, −1). The output of this Þltering operation and the output of the recursive 3-point WM Þlter are shown in Figure 6d. Comparing Figure 6c and Figure 6d, we can easily see the strong inßuence of double weighting on the Þlter output.
Some interesting variants of the recursive 3-point WM Þlter and their corresponding approximate nonrecursive WM Þlters are presented in Table 1, where the underlined weight is related to the center sample of the window. For short notation, only the nonzero weights are listed. Particular attention has to be given to those cases where the weight related to the feedback sample Yk−1 is negative because this yields a PBF that is a function of sk−1 and not of yk−1 . Thus, the binary expression sk = g(rk ; zk ) has to be previously computed since the approximate method replaces the feedback element yk−1 or sk−1 by a delayed version of the corresponding binary Þltering operation (i.e., f (rk−1 ; zk−1 ) or
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 199
,, -Figure 6. Output of the recursive 3-point WM Þlter 1, 1, −1 (solid line) and its nonrecursive approximations (dashed line): (a) Þrst order, (b) second order, (c) third order, and (d) third order when Wk−1 is set to zero. (Reproduced with permission from Paredes and Arce 1999, C 1999 IEEE.)
g(rk−1 ; zk−1 ), respectively). Property IV.3 of WM Þlters, discussed in Section IV, can be used to compute the mirrored positive Boolean function g(rk ; zk ). For instance, for the recursive WM Þlter
−1, 1, 1, deÞned by the PBF f (rk ; zk ) = sk−1 xk + sk−1 xk+1 + xk xk+1 , the mirrored positive Boolean function associated with it is g(rk ; zk ) = yk−1 sk + yk−1 sk+1 + sk sk+1 .
JOSE« L. PAREDES AND GONZALO R. ARCE
200
TABLE 1 Recursive 3-point WM Filters and Their Approximate Nonrecursive Counterparts (Reproduced with permission from Paredes C 1999 IEEE) and Arce 1999, Recursive 3-point WM Þlter ,,
-1, 1, 1 -,, 1, 1, −1 ,, -1, −1, 1 ,, -−1, 1, 1 ,, -1, −1, −1 -,, −1, 1, −1 -,, −1, −1, 1 -,, −1, −1, 1
First-order approximation ,
,
1, 1, 1,
-
1, 1, −1 , 1, −1, 1 , −1, 1, 1 , 1, −1, −1 , −1, 1, −1 , −1, −1, 1 , −1, −1, −1
Second-order approximation ,
,
1, 1, 2, 1
1, 1, 2, −1, −2 , 1, −1, 1, −2, 2 , 1, −1, 2, −1, 2 , 1, −1, −2, −1 , 1, −1, 2, −1 , 1, 1, −2, 1 , 1, 1, 1, −2, −2
Third-order approximation ,
,
1, 1, 2, 3, 2
1, 1, 2, −1 , 4, −2, −4 , 1, −1, 1, −2 , 2, −4, 4 , −1, 1, 1, −2 , 4, −2, 4 , 1, −1, −2, −3, −2 , −1, 1, −2, 3, −2 , −1, −1, 2, −3, 2 , −1, −1, 2, −1 , 2, −4, −4
VII. Weighted Median Filters with N Weights The double weighting operation in Eq. (29) naturally emerges as a result of the mirrored threshold decomposition and the stack Þlter representation. A particular case of this Þltering structure is obtained if a single weight is assigned to each observation sample (i.e., each observation sample is weighted by a positive weight or a negative weight). In the binary representation, this single weighting operation is equivalent to further constraining the positive Boolean function that characterizes the WM Þlters. In fact, if the PBF f (·) depends on only one of the binary representations of each observation sample, the equivalent integer domain WM Þlter weights each sample by a single weight. More precisely, in mirrored threshold decomposition, two binary representations are associated to each observation sample: (1) the thresholded signals xi emerging from the observation samples X i and (2) the thresholded signals si associated with the mirror samples Si . In the integer domain representation, the xi Õs are associated with the positive weights, Wi , whereas the absolute values of the negative weights, |Hi |, are linked to the si signals. Thus, if the PBF f (·) is a function of xi and not of si , the equivalent integer domain WM Þlter weights the observation sample X i by a positive weight. On the other hand, if f (·) depends on si and not on xi , the integer domain WM Þlter weights the observation sample X i by a negative weight. Let us illustrate this with an example. The equivalent integer domain WM Þlter that corresponds
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 201
to the PBF f (x1 , x2 , x3 ; s1 , s2 , s3 ) = x1 s2 + x3 s2 + x1 x3 is Y = MEDIAN(1 ⋄ X 1 , −1 ⋄ X 2 , 1 ⋄ X 3 )
(56)
Note in Eq. (56) that the integer domain WM Þlter assigned a single weight to each sample. This is due to the fact that the PBF that describes this Þlter has only one binary representation for each observation sample. Extending these results, we can show that the integer domain representation of a selfdual linearly separable PBF that has only one binary representation of each observation sample is as follows: Y = MEDIAN |W1 | ⋄ sgn(W1 )X 1 , |W2 | ⋄ sgn(W2 )X 2 , . . . , (57) |W N | ⋄ sgn(W N )X N
where Wi ∈ R. Note that the weight signs are uncoupled from the weight magnitude values and are merged with the observation samples. The weight magnitudes play the equivalent role of positive weights in the framework of weighted median smoothers. It has been shown, in Arce (1998) and Arce and Paredes (2000), that this Þltering structure (single real-valued weight per sample) yields excellent performance in applications that require frequency selection type of Þltering characteristics. In the rest of this article, WM Þlter refers to the Þltering operation given by Eq. (57). A. Weighted Median Filter Computation The computation of the WM Þlter is best illustrated by means of an example. Consider Þrst the case where the weights are integer-valued and where these add up to an odd integer number. Let the window size be 5, deÞned by the symmetric weight vector W = 1, −2, 3, −2, 1. For the observation vector X(n) = [2, −6, 9, 1, 12]T , the weighted median Þlter output is found as Y (n) = MEDIAN[1 ⋄ 2, −2 ⋄ −6, 3 ⋄ 9, −2 ⋄ 1, 1 ⋄ 12] = MEDIAN[1 ⋄ 2, 2 ⋄ 6, 3 ⋄ 9, 2 ⋄ −1, 1 ⋄ 12] = MEDIAN[2, 6, 6, 9, 9, 9, −1, −1, 12] = MEDIAN[−1, −1, 2, 6, 6, 9, 9, 9, 12] =6 where the median Þlter output value is underlined in Eq. (58).
(58)
202
JOSE« L. PAREDES AND GONZALO R. ARCE
Next consider the case where the WM Þlter weights add up to an even integer with W = 1, −2, 2, −2, 1. Furthermore, assume the observation vector consists of a set of constant-valued samples X(n) = [5, 5, 5, 5, 5]T . The weighted median Þlter output in this case is found as Y (n) = MEDIAN[1 ⋄ 5, −2 ⋄ 5, 2 ⋄ 5, −2 ⋄ 5, 1 ⋄ 5] = MEDIAN[1 ⋄ 5, 2 ⋄ −5, 2 ⋄ 5, 2 ⋄ −5, 1 ⋄ 5] = MEDIAN[5, −5, −5, 5, 5, −5, −5, 5]
(59)
= MEDIAN[−5, −5 − 5, −5, 5, 5, 5, 5] =0 where the median Þlter output is the average of the underlined samples in Eq. (59). Note that in order for the WM Þlter to have band- or high-pass frequency characteristics where constant signals are annihilated, the weights, absolute values must add to an even number such that averaging of the middle rank samples occurs. When the WM Þlter absolute-value weights add to an odd number, the output is one of the signed input samples and consequently the Þlter is unable to suppress constant-valued signals. In general, the WM Þlter output can be computed without replicating the sample data according to the corresponding weights, as this increases the computational complexity. A more efÞcient method to Þnd the WM is shown next, which not only is attractive from a computational perspective but also admits real-valued weights. The weighted median Þlter output for noninteger weights can be determined as follows (Arce, 1998): N |Wi |. 1. Calculate the threshold T0 = 12 i=1 2. Sort the ÒsignedÓobservation samples sgn (Wi )X i . 3. Sum the magnitude of the weights corresponding to the sorted ÒsignedÓ samples, beginning with the maximum and continuing down in order. 4. The output is the signed sample whose magnitude weight causes the sum to become greater than or equal to T0 . For band- and high-pass characteristics, the output is the average of the signed sample whose weight magnitude causes the sum to become greater than or equal to T0 and the next smaller signed sample.
The following example illustrates this procedure. Consider the window-size-5 WM Þlter deÞned by the real-valued weights W1 , W2 , W3 , W4 , W5 = 0.1, 0.2, 0.3, −0.2, 0.1. The output for this Þlter operating on the observation set [X 1 , X 2 , X 3 , X 4 , X 5 ]T = [−2, 2, −1, 3, 6]T is found as follows. Summing the 5 absolute weights gives the threshold T0 = 21 i=1 |Wi | = 0.45. The ÒsignedÓ observation samples, sorted observation samples, their corresponding weights,
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 203
and the partial sum of weights (from each ordered sample to the maximum) are Observation samples Corresponding weights Sorted signed observation samples Corresponding absolute weights Partial weight sums
−2 0.1 −3 0.2 0.9
2 0.2 −2 0.1 0.7
−1 0.3 −1 0.3 0.6
3 −0.2 2 0.2 0.3
6 0.1 6 0.1 0.1
Thus, the output is −1 since when we start from the right (maximum sample) and sum the weights, the threshold T0 = 0.45 is not reached until the weight associated with −1 is added. The underlined sum value indicates that this is the Þrst sum which meets or exceeds the threshold. To warrant high- or bandpass characteristics, the WM Þlter output would be modiÞed so as to compute the average between −1 and −2, which would lead to −1.5 as the output value. It should be noted that as a result of the negative weights, the computation of the weighted median Þlter is not shift invariant. Consider the previous case involving Eq. (58) and add a shift of 2 on the samples of X such that X i′ = X i + 2. The weighted median Þltering of X′ = [4, −4, 11, 3, 14]T with the weight vector W = 1, −2, 3, −2, 1 leads to the output Y ′ (n) = 4 which does not equal the previous output in Eq. (58) of 6 plus the appropriate shift.
B. Recursive Weighted Median Filters It is natural now to extend the weighted median Þlters having N weights only deÞned in Eq. (57) to other more general signal processing structures. Here, the class of recursive weighted median Þlters admitting a single realvalued weight per sample is deÞned. These Þlters are analogous to the class of IIR linear Þlters. Much as IIR linear Þlters provide several advantages over linear FIR Þlters, recursive WM Þlters also exhibit characteristics superior to those of nonrecursive WM Þlters. Recursive WM Þlters can synthesize nonrecursive WM Þlters of much larger window sizes. In terms of noise attenuation, recursive median smoothers have far superior characteristics than those of their nonrecursive counterparts (Arce, 1986; Arce and Gallagher, 1988). DeÞnition VII.1 (Recursive Weighted Median Filters) Given a set of N N and a set of M + 1 real-valued feedreal-valued feed-back coefÞcients Ai |i=1 M forward coefÞcients Bi |i=0 , the noncausal recursive WM Þlter output is deÞned
204
JOSE« L. PAREDES AND GONZALO R. ARCE
Figure 7. Structure of a recursive WM Þlter. (Reproduced with permission from Arce and C 2000 IEEE.) Paredes 2000,
as (Arce and Paredes, 2000) Y (n) = MEDIAN(|A N | ⋄ sgn(A N )Y (n − N ), . . . , |A1 | ⋄ sgn(A1 )Y (n − 1), |B0 | ⋄ sgn(B0 )X (n), . . . , |B M | ⋄ sgn(B M )X (n + M))
(60)
The recursive WM operation is schematically described in Figure 7. The recursive WM Þlter output for noninteger weights can be determined as follows (Arce and Paredes, 2000):. / N M |A | + |B | . 1. Calculate the threshold T0 = 12 ℓ k ℓ=1 k=0 2. Jointly sort the ÒsignedÓpast output samples sgn (Aℓ )Y (n − ℓ) and the ÒsignedÓinput observations sgn (Bk )X (n + k). 3. Sum the magnitudes of the weights corresponding to the sorted ÒsignedÓ samples, beginning with the maximum and continuing down in order. 4. Choose the output as the average of the signed sample whose weight magnitude causes the sum to become greater than or equal to T0 and the next smaller signed sample. For ÒselectionÓtype Þltering, choose the output as the signed sample whose weight magnitude causes the sum to become greater than or equal to T0 . The following example illustrates this procedure. Consider the window-size6 RWM Þlter deÞned by the real-valued weights
A2 , A1 , B0 , B1 , B2 , B3 =
0.2, 0.4, 0.6, −0.4, 0.2, 0.2. The output for this Þlter operating on the observation set [Y (n − 2), Y (n − 1), X (n), X (n + 1), X (n + 2), X (n + 3)]T = [−2, 2, −1, 3, 6, 8]T is found as follows. Summing the absolute weights gives the threshold T0 = 12 (|A1 | + |A2 | + |B0 | + |B1 | + |B2 | + |B3 |) = 1. The ÒsignedÓset of samples spanned by the ÞlterÕs window, the sorted set, their
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 205
corresponding weights, and the partial sum of weights (from each ordered sample to the maximum) are Sample set in the window Corresponding weights Sorted signed samples Corresponding absolute weights Partial weight sums
−2 2 −1 3 6 0.2 0.4 0.6 −0.4 0.2 −3 −2 −1 2 6 0.4 0.2 0.6 0.4 0.2 0.8 0.4 2.0 1.6 1.4
8 0.2 8 0.2 0.2
Thus, the output is (−1 − 2)/2 = −1.5 since when we start from the right (maximum sample) and sum the weights, the threshold T0 = 1 is not reached until the weight associated with −1 is added. The underlined sum value indicates that this is the Þrst sum which meets or exceeds the threshold. VIII. Stack Filter Optimization In a practical implementation, the stack Þlter may be of little use if the PBF f (·) that deÞnes the stack Þlter is not determined in some optimal fashion. As mentioned in Section III, the PBF f (·) can be represented as a vector z with N ∗ components where the decision variable z i ∈ {−1, +1} satisÞes the stacking property. The optimization problem thus reduces to Þnding the optimal decision for each component of the vector z such that a performance cost criterion is minimized (Lin et al., 1990; Paredes and Arce, 2001, in press.). Before presenting the stack Þlter optimization, we should deduce the total number of possible binary vectors that can be outputted by the mirrored threshold operator acting on an observation window of size N since this, in turn, deÞnes the number of components of the vector z that characterizes the stack Þlter, and, consequently, the number of binary variables to be optimized. A. Thresholded Signals Generated by Mirrored Threshold Decomposition Although mirrored threshold decomposition outputs binary vectors of length 2N , not all the 22N possible binary vectors can appear at the output of the threshold operator. This is due to the dependency between the sample vector X and its mirror S that leads to certain constraints in the components of the binary vector [xT; sT ]T . To illustrate this point, consider a three-sample vector X = [X 1 , X 2 , X 3 ]T and its corresponding mirror vector S = [S1 , S2 , S3 ]T = [−X 1 , −X 2 , −X 3 ]T . The binary vector [xT; sT ]T = [x1 , x2 , x3 ; s1 , s2 , s3 ]T = [−1, 1, 1; −1, 1, 1]T is not, for instance, a valid thresholded signal generated by the mirrored threshold decomposition since it implies that for a given
206
JOSE« L. PAREDES AND GONZALO R. ARCE
m, X 1 < m, S1 < m and X 2 ≥ m, S2 ≥ m, X 3 ≥ m, S3 ≥ m. That is, the preceding conditions imply that X 1 and S1 are the two smallest samples of the set {X 1 , X 2 , X 3 , S1 , S2 , S3 }. This cannot be true since Si takes on a symmetric value about the origin from X i ; therefore, if X 1 is the smallest sample of the subset {X 1 , X 2 , X 3 }, then S1 must be the greatest sample of the subset {S1 , S2 , S3 }. Extending the concepts used in the preceding example, we can show that there are numerous binary vectors of length 2N that are never outputted by the mirrored threshold decomposition. Before stating the preceding results in a more formal manner, let us clarify some notation. Since xi and si are binary variables taking values in {−1, +1}, the operations of conjunction and complement are deÞned as x1 −1 −1 +1 +1
x2 −1 +1 −1 +1
x1 ∧ x2 −1 −1 −1 +1
(61)
and x −1 +1
xø +1 −1
(62)
Furthermore, for any two vectors, a ∧ b = (a1 ∧ b1 , . . . , a N ∧ b N ) and aø = (aø1 , . . . , aøN ). Let w(a) be the Hamming weight of a, that is, the number of elements equal to +1. Theorem VIII.1 states formally the conditions satisÞed by the components of the inadmissible binary vectors and Theorem VIII.2 gives the total number of those vectors. Theorem VIII.1 A binary vector [xT; sT ]T = [x1 , . . . , x N ; s1 , . . . , s N ]T where w(x ∧ s) > 0 and w(ø x ∧ sø) > 0 can never appear at the output of mirrored threshold decomposition. Proof. (by contradiction). Assume that the vector [xT ; sT ]T is outputted by the mirrored threshold decomposition and that w(x ∧ s) > 0 and w(ø x∧ sø) > 0. Since w(x ∧ s) > 0 and w(ø x ∧ sø) > 0, there are at least two pairs of components (xi , si ) and (x j , s j ) of [xT; sT ]T that satisfy xi = si = −1 and x j = s j = +1. According to Eqs. (11) and (12), there exists a threshold value m > 0
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 207
for which X i < m, Si < m and X j ≥ m, S j ≥ m. Since S j = −X j , either S j or X j takes a negative value; therefore, for m > 0 either s j or x j has to be equal to −1, and this is a contradiction to our previous assumptions. Theorem VIII.1 states that not all the 22N possible binary vectors of length 2N can be generated as a result of the threshold operator. The following theorem gives the number of binary vectors that do not appear at the output of the threshold decomposition operator. Theorem VIII.2 For an observation window of size N, the number of binary vectors of length 2N that are not generated by the mirrored threshold decomposition is N −1 % & N (63) (2k − 1)(2 N −k − 1) k k=1
Proof. The proof relies on the structure of mirrored threshold decomposition. Let [xT ; sT ]T = [x1 , . . . , x N ; s1 , . . . , s N ]T be a 2N -component binary vector. We have to count all binary vectors for which w(x ∧ s) > 0 and w(ø x ∧ sø) > 0 since those vectors are not generated by the decomposition. Let k be the number of samples in the set {x1 , x2 , . . . , x N } taking on the value −1. Let k = 1, such that xi = −1 for some i ∈ {1, . . . , N }, then all binary vectors [xT; sT ]T for which si = −1 and s j = +1 for at least one j ∈ {1, . . . , N }, j = i, cannot be generated by the decomposition. There are 2 N −1 − 1 of those binary vectors. Since i can take on any value in {1, . . . , N }, there are N1 (2 N −1 − 1) binary vectors [xT; sT ]T with one −1 in the Þrst N components that are not outputted by the threshold operator. Let k = 2, such that xi = −1 and xℓ = −1 for some i, ℓ ∈ {1, 2, . . . , N }, i = ℓ. Those binary vectors [xT; sT ]T whose components [si , sℓ ] take on the values [−1, −1], [1, −1], or [−1, 1] and s j = +1 for at least one j ∈ {1, 2, . . . , N }, j = i, j = ℓ, are not generated by the decomposition. There are (2 N −2 − 1)(22 − 1) of those vectors. Since i and ℓ can take on any value in {1, 2, . . . , N }, i = ℓ, there are N2 (2 N −2 − 1)(22 − 1) binary vectors with two (−1)Õs in the Þrst N components that are not outputted by the threshold operator. Following a similar approach, we can show that for 3 ≤ k ≤ N − 1 there are Nk (2 N −k − 1)(2k − 1) binary vectors with k (−1)Õs in the Þrst N components that are not outputted by the threshold operator. Adding all the binary vectors for k = 1, . . . , N − 1 completes the proof. Theorem VIII.2 provides the key to Þnding the number of possible binary vectors of length 2N that can be generated by the mirrored threshold decomposition.
208
JOSE« L. PAREDES AND GONZALO R. ARCE
Theorem VIII.3 For an observation window of size N, the number of possible binary vectors of length 2N that can be outputted by the mirrored threshold decomposition is N ∗ = 2(3 N ) − 2 N . B. Stack Filter Optimization A criterion widely used for the design of stack smoothers is the minimization of the mean absolute error (MAE) between the ÞlterÕs output and some desired signal; we use the same criterion for the optimization of stack Þlters. Let {X (·)} be a discrete-time process given as the input to a stack Þlter. {X (·)} can be thought of as a noise-corrupted version of some desired process {D(·)}. Furthermore, assume that both processes are jointly stationary and that both take on integer values in {−M, −M + 1, . . . , 0, . . . , M} for some integer value M. Let X(n) be the N-component observation vector at time n, that is, & % & % N −1 T N −1 , . . . , X (n), . . . , X n + X(n) = X n − 2 2 = [X 1 (n), X 2 (n), . . . , X N (n)]T where X i = X (n − (N − 1)/2 + i − 1) and N is the size of the observation window, assumed here to be an odd number. Under the MAE criterion the goal is to Þnd the 2N -variable PBF f (·), or equivalently the decision vector z, so as to minimize the cost function J ( f ) = E|D(n) − S f (X(n))|
(64)
where E denotes the statistical expectation and S f (·) is the output of the stack Þlter given by Eq. (18). Note that the optimal stack Þlter can be found through an exhaustive search over the set of all possible PBF f (·) of length N ∗ . This, however, is computationally demanding since the space to be searched is quite large. Using mirrored threshold decomposition, we can rewrite the cost function in Eq. (64) as " " M " 1 "" " (65) d m − f x1m , . . . , x Nm ; s1m , . . . , s Nm " J( f ) = E " " 2 "m=−M+1 where d m is the thresholded signal of the desired process at level m given by Eq. (11). In Eq. (65) the discrete-time index n has been dropped for the sake of notational simplicity. Since f (·) satisÞes the stacking property and since the sequence obtained by thresholding D(n) obeys the stacking property, every nonzero term of the sum in Eq. (65) is either always positive or always negative (i.e., it is either +2
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 209
or −2). Therefore, the absolute value and the sum operators in Eq. (65) can be interchanged, which leads to J( f ) =
M " " 1 E "d m − f x1m , . . . , x Nm ; s1m , . . . , s Nm " 2 m=−M+1
(66)
Thus, the multilevel mean absolute error criterion reduces to the sum of the absolute errors on each threshold level. Following the derivations for the optimization of stack smoothers as in Coyle and Lin (1988), we can show that the minimization of the cost function in Eq. (66) reduces to solving the binary linear program min z subject to the constraints zi ≤ z j
N∗
if ξi ≤ ξ j z i = −1
(67)
Ci z i
i=1
or
and
H (ξi , ξ j ) = 1 z i = +1
(68) (69)
where Ci is a weight that depends on the statistical model assumed for the input and desired signals, and H (ξi , ξ j ) is the Hamming distance between the binary sequences ξi and ξ j . Note that the set of constraints given in Eqs. (68) and (69) deÞnes an N ∗ -dimensional polytope whose vertices contain the feasible solutions of the linear problem. A linear programming problem equivalent to Eqs. (67)Ð(69) can be formulated if the decision variables z i are allowed to take continuous values. That is, minimize Eq. (67) subject to the constraints of Eq. (68) and −1 ≤ z i ≤ +1
(70)
Since the set of constraints imposed by the mirrored local stacking property leads to a constraint matrix that is totally unimodular (Coyle and Lin, 1988), the solution to the relaxed linear program Eq. (70) turns out to be the solution to the binary linear program Eqs. (67)Ð(69)(Parker and Rardin, 1988). Note that in the linear program, the set of constraints given in Eq. (68) are forced by the stacking property and the condition that the Hamming distance between binary sequences is equal to one. These two conditions, stacking property and Hamming distance equal to one, are referred to here as the mirrored local stacking property. It turns out that the stacking property leads to a set of constraints that has many redundant equations which can be reduced to an equivalent set of constraints if the condition H (·, ·) = 1 is also taken into account. To illustrate this
210
JOSE« L. PAREDES AND GONZALO R. ARCE
point better, consider the binary sequences ξi = [−1, −1, −1; −1, −1, +1]T , ξ j = [−1, −1, −1; −1, +1, +1]T , and ξ k = [−1, −1, −1; +1, +1, +1]T . The stacking property forces the constraints z i ≤ z j , z i ≤ z k , and z j ≤ z k , whereas the mirrored local stacking property forces the constraints z i ≤ z j and z j ≤ z k . Since z i ≤ z j and z j ≤ z k implies z i ≤ z k , the set of inequality constraints forced by the mirrored local stacking property is not redundant and is equivalent to the set of constraints forced by the stacking property. It is interesting to observe that the relaxed linear program Eq. (70) leads to a PBF f (·) that makes soft decisions, (i.e., for a given input vector ξi , z i = f (ξi ) is in the interval [−1, 1]). In fact, taking on continuous values, z i can be expressed as a function of the probability that the PBF f (·) outputs +1 when the vector ξi is observed (Coyle and Lin, 1988). That is, z i = 2 Prob(+1|ξi ) − 1
(71)
where Prob(+1|ξi ) = 1 − Prob(−1|ξi ) is the probability that f (·) outputs +1 whenever ξi is observed. Note that if z i is increased, the probability that f (·) outputs +1 when the binary sequence ξi is observed is also incremented. Although the complexity of the linear program is of polynomial type∗ , the number of variables in the linear program grows exponentially by a factor N ∗ . with the size of the observation window. Therefore, this approach to the design of stack Þlters becomes computationally prohibitive when the size of the observation window is greater than 5.
C. Adaptive Optimization Algorithm In this section, we present an adaptive optimization algorithm for the design of stack Þlters. In the adaptive approach, the goal is to update the components of the decision vector z at each iteration of the algorithm such that Eq. (66) is minimized. Moreover, the updating operation has to take into account the stacking property (i.e., z, or equivalently f (·), has to satisfy the stacking constraint, Eq. (19)), at each iteration of the algorithm. During the updating operation, we allow z to take continuous values in the interval [−1, 1], but it should always satisfy the stacking property or equivalently the mirrored local stacking property. Thus, at each iteration the decision vector will be at some interior point in the polytope deÞned by Eqs. (68) and (69). At the end of the optimization algorithm, the optimal soft decision vector is mapped to a binary vector through the use of a soft-to-hard transformation. ∗ A signiÞcant reduction if complexity has been achieved with the linear program compared with the complexity of an exhaustive search of the optimal solution in the space of possible PBFs of length N ∗ .
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 211
The basic idea of the adaptive algorithm is as follows. Let ξi be the mirrored threshold decomposition of the input vector at a certain threshold level m (i.e., ξi = [(xm )T ; (sm )T ]T ). Furthermore, let d m be the desired signal at threshold level m. At each iteration, the adaptive algorithm tries to increase the probability that the PBF f (·) makes the correct decision when ξi is observed. To achieve this, the decision variable z i will increase (decrease if d m = −1) proportionally to d m . As a second stage of the adaptive algorithm, the stacking constraint is checked at each iteration. If the stacking property is violated, that is, z i becomes greater than (smaller than) z k even though ξi ≤ (≥)ξ k , an operation that forces the stacking constraint takes place. This operation can be a swap operation in which z i is reset to its previous value and z k is incremented (decremented if d m = −1) by a factor proportional to d m . This second stage is repeated to ensure that the new updated value of z k does not violate the stacking constraint. The concepts behind the proposed algorithm can be further described as follows. Consider that at the iteration n ′ the binary vector ξi = [(xm )T ; (sm )T ] = [−1, 1, 1; 1, 1, −1]T is observed; furthermore, assume that z i (n ′ ) is the decision variable that is associated with ξi . The updating consists of modifying only the ith component of the decision vector z(n ′ ) as follows: z i (n ′ ) = z i (n ′ − 1) + μd m
(72)
where μ is a step-size deÞned subsequently and d m is the desired signal. The other components of the decision vector are left without modiÞcations (i.e., z j (n ′ ) = z j (n ′ − 1) for j = 1, 2, . . . , N ∗ , j = i). The second stage of the algorithm is to ensure that z i (n ′ ) does not violate the stacking constraint. Assuming that d m = 1, z i is incremented by a factor μ. This increment, however, can violate the stacking constraint, for example, with the vector ξ k = [−1, 1, 1; 1, 1, 1]T which is associated with the decision variable z k . A swap operation takes place between z i and z k as follows: z i (n ′ ) = z i (n ′ − 1)
z k (n ′ ) = z k (n ′ − 1) + μd m That is, z i is reset to its previous value while z k is incremented by μ. This increment on z k can violate the stacking constraint; hence, the second stage of the algorithm is repeated. This process continues until no stacking constraint is violated. Note that by incrementing (decrementing if d m = −1) z i , the probability of making a correct decision is increased. Thus, as the training progresses, the decision vector moves closer in the stochastic sense to the optimal decision vector.
212
JOSE« L. PAREDES AND GONZALO R. ARCE
Next, we formally describe the adaptive algorithm. Let X(n) and D(n) be the observation vector and the desired signal, respectively, at time n. Furthermore, let [(xm (n))T ; (sm (n))T ]T and d m (n) be their respective thresholded signals at level m, where m = −M + 1, . . . , 0, . . . M. For a Þxed discrete-time index n, mirrored threshold decomposition generates 2M thresholded signals that are fed to the adaptive algorithm. Since the adaptive algorithm updates the decision vector z based on thresholded signals, there are 2M updates at each discretetime index n. To account for those updates, let us deÞne n ′ = (2n + 1)M + m as the new time index and let us denote d(n ′ ) = d m (n). The following notation is introduced to describe the adaptive algorithm. Let the N ∗ possible binary vectors of length 2N generated by the mirrored threshold decomposition be ordered in some fashion as ξ 1 , ξ 2 , . . . , ξ N ∗ . Let us associate to each of those vectors an element of the soft decision vector z of length N ∗ . That is, the decision variable z i is associated with the binary vector ξi . Furthermore, let νi = νi+1 ∪ νi−1 be the set of decision variables that are subject to the mirrored local stacking constraints with respect to z i , where νi+1 = {z i | ξi ≤ ξ j , H (ξi , ξ j ) = 1}
νi−1 = {z j | ξi ≥ ξ j , H (ξi , ξ j ) = 1}
and
(73)
where H (ξi , ξ j ) denotes the Hamming distance between the binary sequences ξi and ξ j . Thus, the mirrored local stacking constraint implies that z i ≤ z j for all z j ∈ νi+1 , whereas for all z j ∈ νi−1 the mirrored local stacking constraint implies that z i ≥ z j . The structure of the adaptive algorithm is as follows. At the iteration n ′ , the binary signals [(xm (n))T ; (sm (n))T ]T and d(n ′ ) are presented at the input of the adaptive algorithm and the decision vector is updated based on these thresholded signals. Assume that ξi = [(xm (n))T ; (sm (n))T ]T for i ∈ {1, 2, . . . , N ∗ }. The updating operation on the decision vector z is as follows z(n ′ ) = z(n ′ − 1) + μ(n ′ )d(n ′ )1i
(74)
where 1i is a vector of length N ∗ whose components are all zeros except the ith component which is set to one. The second step of the algorithm is to check if the stacking constraints are satisÞed after the update has been performed. If z i and z j obey the stacking ′ constraints for all z j ∈ νid(n ) , the updated value z i (n ′ ) does not cause violation of the stacking constraint; therefore, no enforcing is needed and the adaptive algorithm continues with the next iteration. On the other hand, if the stacking ′ constraint is violated, pick any z j ∈ νid(n ) for which the stacking constraint is
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 213
violated and perform a swap operation between z i and z j . The swap operation is as follows: z(n ′ ) = z(n ′ ) + μ(n ′ )d(n ′ )(1 j − 1i )
(75)
Since z j has been modiÞed in the previous step, the stacking constraints must be checked again with respect to z j (i.e., the second stage is repeated for i = j). This second stage continues until no stacking constraint is violated. In Eqs. (74) and (75), μi (n ′ ) is the step-size of the algorithm. It is adjusted as the training progresses to ensure that at any iteration all the components of the decision vector z(n ′ ) are in the interval [−1, 1]. The step-size is deÞned as 0 if |z i (n ′ − 1) + μ0 d(n ′ )| > 1 μ(n ′ ) = (76) otherwise μ0 where μ0 is set to 1/N ′ and N ′ is a large positive integer number. Thus, no ambiguity is present in the bounds of the component of the decision vector z. It will be seen shortly that by changing dynamically the step-size, a signiÞcant reduction of computation can be achieved. At the end of the training, it is quite possible that the decision vector is at some interior point of the polytope and since the optimal solution is always at some vertex of the polytope, a soft-to-hard transformation that approximates the soft decision vector to the optimal vertex is necessary. Although there are different ways to achieve this approximation, the simplest and most reasonable way is to apply the sgn(·) operator on each component of the soft decision vector, that is, zö = sgn(z)
(77)
where zö is the optimal binary decision vector, z is the optimal soft decision vector obtained using the optimization algorithm, and sgn(·) is deÞned as in Eq. (13). The adaptive algorithm has an interesting interpretation. Suppose that no stacking constraint is imposed on the decision variables; then the best decision for a binary sequence ξi depends only on the difference of frequencies of occurrence of the correct decisions, which can be 1 or −1, when ξi is observed. In fact, if μ0 = 1/N ′ and N ′ is set to the number of occurrences of the binary sequence ξi during the training, the decision variable z i at the end of the training is z i = z i (0) + Prob(+1|ξi ) − Prob(−1|ξi )
(78)
where z i (0) is the initial value for the decision variable z i and Prob(+1|ξi ) is the probability that the correct decision is 1 when the observation vector is ξi .
214
JOSE« L. PAREDES AND GONZALO R. ARCE
Note that the second stage of the algorithm checks only the mirrored local stacking constraints of the decision variable that has just been incremented or decremented. This reduces signiÞcantly the number of comparisons that the algorithm performs at each iteration. Note also that the optimization algorithm requires only increments, decrements, and local comparisons on the decision variables. The initial values for the component of the decision vector (i.e., z (0)) cannot be chosen arbitrarily since z has to satisfy the stacking constraints at all times. Moreover, each component of z(0) has to be in the interval [−1, 1]. We set the initial decision vector as z(0) = 0 where 0 is a vector of length N ∗ whose components are all zeros. Note that by choosing the all-zero vector as the initial condition, we are assuming that at the beginning of the adaptive algorithm the probability that the correct decision is 1 is equal to the probability that the correct decision is −1 when ξi is observed (i.e., Prob(+1|ξi ) = Prob(−1|ξi ) = 12 for i = 1, 2, . . . , N ∗ ). Note that the proposed adaptive optimization algorithm can also be used for the design of stack smoothers since stack smoothers are a subclass of stack Þlters. In fact, a stack smoother is simply a stack Þlter whose PBF does not depend on the variables s1 , . . . , s N (i.e., they are Þctitious).
D. Fast Adaptive Optimization Algorithm The speed of the adaptive optimization algorithm described in the previous section can be increased signiÞcantly by exploiting the fact that for each observation vector of length N, there are at most 2N + 1 different binary vectors [xT; sT ]T among the 2M binary vectors generated by the mirrored threshold decomposition (Lin and Kim, 1994; Paredes and Arce, 1999; Yin and Neuvo, 1994). Thus, if the update of the decision vector z is performed each time a different binary vector is present, the training time is cut down from 2M to 2N + 1 (Lin and Kim, 1994). To understand this point better, let Y be a 2N-component vector whose Þrst N components are the observation vector X and the last N components are the mirrored observation vector S (i.e., Y = [XT; ST ]T ). Furthermore, let Y(k) be the kth order statistic of Y, where Y(1) ≤ Y(2) ≤, . . . , ≤ Y(2N ) . Note that Y(k) is the joint order statistic of the observation samples and the mirrored samples. For any integer threshold level m ∈ (Y(k) , Y(k+1) ], mirrored threshold decomposition generates the same binary vector. Consequently, for a Þxed observation window of size N , there are 2N + 1 different binary vectors. They
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 215
are
[(xm )T ; (sm )T ]T =
⎧ [1, 1, . . . , 1]T ⎪ ⎪ ⎪ [x m , . . . , x m ; s m , . . . , s m ]T ⎨ 1
N
1
⎪ ⎪ ⎪ ⎩ [−1, −1, . . . , −1]T
N
for Y(0) < m ≤ Y(1) for Y(k) < m ≤ Y(k+1) 1 ≤ k ≤ 2N − 1 for Y(2N ) < m ≤ Y(2N +1) (79)
where Y(0) and Y(2N +1) are always assigned the values −M and M, respectively. Note that if m takes on integer values in the set {Y(0) , Y(1) , Y(2) , . . . , Y(2N +1) } then the binary vectors {[(xm )T; (sm )T ]T } outputted by the decomposition are different.∗ The basic idea in the fast adaptive algorithm is to combine into a single update all those updates where the observed binary vectors are the same. That is, for all the binary vectors [(xm )T ; (sm )T ]T , m ∈ (Y(k) , Y(k+1) ] only one update of the decision vector is performed. This single update, however, has to take into account the number of times that the binary vector [xT ; sT ]T is repeated in each interval (Y(k) , Y(k+1) ]. In order to achieve that, we use a variable step-size that dynamically sets its values according to the number of repetitive thresholded signals. That is, 0 if |z i (n ′ − 1) + μ0 (Y(k) − Y(k−1) ) d(n ′ )| > 1 μ(n ′ ) = otherwise μ0 (Y(k) − Y(k−1) ) (80) where k = 1, . . . , 2N + 1 and n ′ is redeÞned as n ′ = n(2N + 1) + k since now for each window position only 2N + 1 updates take place. Thus, the fast adaptive algorithm reduces to updating the decision vector and checking the stacking constraints according to Eqs. (74) and (75), respectively, using the variable step-size given by Eq. (80), the new index n ′ = n(2N + 1) + k, and the threshold levels m = Y(k) for k = 1, 2, . . . , 2N + 1. Since the total number of updates at each window position is reduced from 2M to at most 2N + 1, the training time is reduced by the same proportion. In presenting the fast adaptive algorithm, we have assumed that the desired signal D(n) is either one of the observation samples or one of the mirrored smaples. This assumption, however, is not necessary but since it provides accurate results and simpliÞes the notation signiÞcantly, we Þnd it useful. Table 2 summarizes the fast adaptive optimization algorithm. ∗ Strictly speaking the threshold signals [(xm )T ; (sm )T ]T for m ∈ {Y(0) , Y(1) , Y(2) , . . . , Y(2n+1) } are different if and only if Y(k) = Y(k+1) for k = 0, 1, . . . , 2N .
216
JOSE« L. PAREDES AND GONZALO R. ARCE TABLE 2 Fast Adaptive Optimization Algorithm for Stack Filters Initialize z(0) = 0 DeÞne νi+1 = {z j |ξi ≤ ξ j , H (ξi , ξ j ) = 1} and i = 1, 2, . . . , N ∗ νi−1 = {z j |ξi ≥ ξ j , H (ξi , ξ j ) = 1} For each observation vector X(n) and desired signalD(n), n = 0, 1, 2, . . . , do Y = [X(n)T ; S(n)T ]T For k = 1 to 2N + 1 n ′ = n(2N + 1) + k ξi (n ′ ) = T Y(k) (Y) d(n ′ ) = T Y(k) (D(n)) Update the decision vector z as follows: z(n ′ ) = z(n ′ − 1) + μ(n ′ )d(n ′ )1i Check and force stacking constraints for all z j ∈ νid (n ′ ) end end Apply a soft-to-hard transformation: zö = sgn(z)
E. Optimal WM Filtering: The Least Mean Absolute (LMA) Algorithm A particular class of stack Þlters that have been widely used in signal and image processing is the class of median based Þlters which includes WM smoothers and WM Þlters. Due to the importance of this subset of stack Þlters, signiÞcant research has been devoted to the study of WM smoothers (Arce and Gallagher, 1988; Nodes and Gallagher, 1982; Yin et al., 1996; Yli-Harja et al., 1991) and WM Þlters (Arce, 1998; Arce and Paredes, 2000; Shmulevich et al., 2001, in press.). In particular, optimization algorithms, for the design of median based Þlters, have been developed which do not use the WM binary representation as is the case of the adaptive optimization algorithm described in the previous section. We now consider the important issue of the design of optimal weighted median Þlters. Much as linear Þlters can be optimized using the Wiener Þlter theory, weighted median Þlters enjoy an equivalent theory for optimization. In order to develop the various optimization algorithms, we must extend the concept of threshold decomposition to admit real-valued inputs. Consider the set of realvalued samples X 1 , X 2 , . . . , X N , with X i ∈ R, and deÞne a weighted median Þlter by the corresponding real-valued weights W1 , W2 , . . . , W N . Decompose q each sample X i as xi = sgn(X i − q) where −∞ < q < ∞, and where +1 if X i ≥ q (81) sgn(X i − q) = −1 if X i < q Thus, each sample X i is decomposed into an inÞnite set of binary points taking
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 217
values in [−1, 1]. Threshold decomposition is reversible where the original real-valued sample X i can be perfectly reconstructed from the inÞnite set of thresholded signals as 1 ∞ q x dq Xi = 2 −∞ i 1 ∞ = sgn(X i − q) dq (82) 2 −∞ The sample X i can be reconstructed from its corresponding set of decomposed signals and consequently the real-valued X i has a unique threshold signal representation, and vice-versa: T .D.
T .D.
X i ⇐⇒ {zq }
(83)
where ⇐⇒ denotes the one-to-one mapping provided by the threshold decomposition operation acting on real-valued signals. Threshold decomposition in the real-valued sample domain also allows the order of the median and threshold decomposition operations to be interchanged without affecting the end result. Given N samples X 1 , X 2 , . . . , X N and their corresponding threshold q q q decomposition representations x1 , x2 , . . . , x N , the median of the decomposed signals at a Þxed value of q is q q +1 for X ((N +1)/2) ≥ q q q (84) y = MEDIAN x1 , x2 , . . . , x N = −1 for X ((N +1)/2) < q where X (i) is the ith order statistic of the vector [X 1 , X 2 , . . . , X N ] with X (1) ≤ X (2) ≤ . . . X (N ) . Reversing the threshold decomposition, we obtain Y as 1 ∞ q y dq Y = 2 −∞ 1 ∞ = sgn(X ((N +1)/2) − q) dq 2 −∞ = X ((N +1)/2)
(85)
Thus, applying the median operation on a set of samples and applying the median operation on a set threshold decomposed set of samples and reversing the decomposition give exactly the same result. With threshold decomposition, the weighted median Þlter operation can be implemented as N βö = MEDIAN |Wi | ⋄ sgn(Wi )X i |i=1 & % 1 ∞ N sgn[sgn(Wi )X i − q] dq|i=1 (86) = MEDIAN |Wi | ⋄ 2 −∞
218
JOSE« L. PAREDES AND GONZALO R. ARCE
N where |Wi | ⋄ sgn(Wi )X i |i=1 = |W1 | ⋄ sgn(W1 )X 1 , |W2 | ⋄ sgn(W2 )X 2 , . . . , |W N | ⋄ sgn(W N )X N . The expression in Eq. (86) represents the median operation of a set of weighted intergrals, each synthesizing a signed sample. Note that the same result is obtained if the weighted median of these functions, at each value of q, is taken Þrst and the resultant signal is integrated over its domain. Thus, the order of the integral and the median operator can be interchanged without affecting the result, which leads to 1 ∞ N ö dq (87) MEDIAN |Wi | ⋄ sgn[sgn(Wi )X i − q]|i=1 β= 2 −∞
In this representation, the ÒsignedÓsamples play a fundamental role; thus, we deÞne the ÒsignedÓobservation vector S as S = [sgn(W1 )X 1 , sgn(W2 )X 2 , . . . , sgn(W N )X N ]T = [S1 , S2 , . . . , S N ]T
(88)
The threshold decomposed signed samples, in turn, form the vector sq deÞned as sq = [sgn[sgn(W1 )X 1 − q], sgn[sgn(W2 )X 2 − q], . . . , q q q T sgn[sgn(W N )X N − q]]T = s1 , s2 , . . . , s N
(89)
If we let Wa be the vector whose elements are the magnitude weights, Wa = [|W1 |, |W2 |, . . . , |W N |]T , the WM Þlter operation can be expressed as 1 ∞ ö (90) sgn WaT Sq dq β= 2 −∞
The WM Þlter representation using threshold decomposition is compact although it may seem that the integral term may be difÞcult to implement in practice. Equation (90), however, is used for the purpose of analysis and not implementation. Next, the threshold decomposition architecture is used to develop the optimization algorithm for WM ÞltersÑnamely the least mean absolute (LMA) adaptive algorithm. The LMA algorithm shares many of the desirable attributes of the least mean square (LMS) algorithm including simplicity and efÞciency. Assume that the observed process {X (n)} is statistically related to some desired process {D(n)} of interest. {X (n)} is typically a transformed or corrupted version of {D(n)}. Furthermore, it is assumed that these processes are jointly stationary. A window of width N slides across the input process pointwise estimating the desired sequence. The vector containing the N samples in the
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 219
window at time n is X(n) = [X (n − N1 ), . . . , X (n), . . . , X (n + N2 )]T = [X 1 (n), X 2 (n), . . . , X N (n)]T
(91)
with N = N1 + N2 + 1. The running weighted median Þlter output estimates the desired signal as N ö = MEDIAN |Wi | ⋄ sgn(Wi )X i (n)|i=1 D(n)
where both the weights, WiÕs, and the samples, Xi(n), take on real values. The goal is to determine the weight values in W = [W1 , W2 , . . . , W N ]T which will minimize the estimation error. Under the MAE criterion, the cost to minimize is 5 6 ö J (W) = E |D(n) − D(n)| "' " ∞ T q " 1 "" sgn(D − q) − sgn Wa s dq "" (92) =E 2 " −∞
where the threshold decomposition representation of the signals is used. The absolute-value and integral operators in Eq. (92) can be interchanged since the integral acts on a strictly positive or a strictly negative function. This results in "6 1 ∞ 5"" (93) E sgn(D − q) − sgn WaT sq " dq J (W) = 2 −∞ Furthermore, since the argument inside the absolute-value operator in Eq. (93) can take on values only in the set {−2, 0, 2}, the absolute-value operator can be replaced by a properly scaled second-power operator. Thus 2 6 1 ∞ 5 E sgn(D − q) − sgn WaT sq dq (94) J (W) = 4 −∞
Evaluation of the gradient of the above results in ' T q 1 ∞ ∂ ∂ q J (W) = − sgn Wa s E e (n) dq ∂W 2 −∞ ∂W
(95)
where eq (n) = sgn(D − q) − sgn(WaT sq ). Since the sign function is discontinuous at the origin, its derivative introduces Dirac impulse terms that are inconvenient for further analysis. To overcome this difÞculty, we can approximate the sign function in Eq. (95) by a smoother differentiable function. A simple
220
JOSE« L. PAREDES AND GONZALO R. ARCE
approximation is given by the hyperbolic tangent function sgn(x) ≈ tanh(x) = (e x − e−x )/(e x + e−x ). Since ∂/∂ x tanh(x) = sech2 (x) = 2/(e x + e−x ), it follows that ∂ T q ∂ Wa s sgn WaT sq ≈ sech2 WaT sq ∂W ∂W
(96)
Evaluating the derivative in Eq. (96) and doing some simpliÞcations leads to ⎡
q ⎤ sgn(W1 )s1 ⎢ sgn(W2 )s2q ⎥ ∂ ⎥ sgn WaT sq ≈ sech2 WaT sq ⎢ .. ⎣ ⎦ ∂W . q sgn(W N )s N
Using Eq. (97) in Eq. (95) yields ∂ 1 ∞ 5 q q6 J (W) = − E e (n) sech2 WaT sq sgn(W j )s j dq ∂Wj 2 −∞
(97)
(98)
Using the gradient, we can Þnd the optimal coefÞcients through the steepest descent recursive update ∂ J (W) = W j (n) W j (n + 1) = W j (n) + 2μ − ∂Wj ∞ 5 q T 6 q 2 q +μ E e (n) sech Wa (n)s (n) sgn(W j (n))s j (n) dq −∞
(99)
Using the instantaneous estimate for the gradient and after doing some simpliÞcations, we derive the fast LMA WM Þlter adaptive optimization algorithm as W j (n + 1) = W j (n) + μ(D(n)
ö ö − D(n)) sgn(W j (n)) sgn(sgn(W j (n))X j (n) − D(n))
(100)
for j = 1, 2, . . . , N . The adaptive optimization algorithm is suitable for the design of recursive WM smoothers which do not admit negative weight values. Upon closer examination, it turns out that the constraint of having nonnegative weight can be accomplished by a projection operator that maps all the negative weights to zero. If we use this constraint, the adaptive optimization algorithm for recursive
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 221
WM smoothers reduces to ö ö W j (n + 1) = P(W j (n) + μ(D(n) − D(n)) sgn(X j (n) − D(n))) where P(·) is the projection operator deÞned as x x ≥0 P(x) = 0 x 1.
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 227
Figure 11. Image denoising using 3 × 3 recursive and nonrecursive WM Þlters: (a) original, (b) image with salt-and-pepper noise, (c) nonrecursive center WM Þlter, (d) recursive center WM Þlter, (e) optimal nonrecursive WM Þlter, and (f ) optimal RWM Þlter.
Figures 11d and 11c show their respective Þlter outputs with a center weight Wc = 5. The recursive WM Þlter is more effective in removing outliers than its nonrecursive counterpart is. A small 60 × 60 pixel area in the upper left part of the original (Fig. 11a) and noisy (Fig. 11b) images are used to train the recursive WM Þlter using the LMA algorithm. The same training data are used to train a nonrecursive WM Þlter. The initial conditions for the weights for both algorithms were the Þlter coefÞcients of the center WM Þlters just described. The step-size used was
JOSE« L. PAREDES AND GONZALO R. ARCE
228
TABLE 4 Results for Impulse Noise Removal
Image
Normalized mean square error
Normalized mean absolute error
Noisy image Recursive center WM Þlter Nonrecursive center WM Þlter Optimal nonrecursive WM Þlter Optimal RWM Þlter
2545.20 189.44 243.83 156.30 88.13
12.98 1.69 1.92 1.66 1.57
10−3 for both adaptive algorithms. The optimal weights found by the adaptive algorithms are 1.38 1.64 1.32 1.24 1.52 2.34 1.50 5.87 2.17 2.07 4.89 1.45 and 0.63 1.36 2.24 1.95 0.78 2.46 for the nonrecursive and recursive WM Þlters, respectively, where the underlined weight is associated with the center sample of the 3 × 3 window. The optimal Þlters determined by the training algorithms were used to Þlter the entire image. Figures 11f and 11e show the output of the optimal RWM Þlter and the output of the nonrecursive WM Þlter, respectively. The normalized mean square errors and the normalized mean absolute errors produced by each of the Þlters are listed in Table 4. As can be seen by a visual comparison of the various images and by the error values, recursive WM Þlters outperform nonrecursive WM Þlters.
C. Optimal Frequency Selection WM Filtering We now consider the design of a robust bandpass recursive WM Þlter using the LMA adaptive optimization algorithm. The performance of the optimal recursive WM Þlter is compared with the performances of a linear FIR Þlter, a linear IIR Þlter, and a nonrecursive WM Þlter all designed for the same task. Moreover, to show the noise attenuation capability of the recursive WM Þlter and compare it with those of the other Þlters, we use an impulse noise corrupted signal as a test signal. Examples are shown in one-dimensional signals for illustration purposes but the extension to two-dimensional signals is straightforward. The application at hand is the design of a 62-tap bandpass RWM Þlter with passband 0.075 ≤ ω ≤ 0.125 (normalized Nyquist frequency = 1). We use white Gaussian noise with zero mean and variance equal to one as input
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 229
training signals. The desired signal is provided by the output of a large FIR Þlter (122-tap linear FIR Þlter) designed by MatlabÕsfir1function. The 31 feedback Þlter coefÞcients were initialized to small random numbers (on the order of 10−3). The feed-forward Þlter coefÞcients were initialized to the values outputted by MatlabÕsfir1with 31 taps and the same passband of interest. A variable step-size, μ(n), was used in both adaptive optimizations, where μ(n) changes according to μ0 e−n/100 with μ0 = 10−2 . A signal that spans the entire range of frequencies of interest is used as a test signal. Figure 12a depicts a linear swept-frequency signal spanning instantaneous frequencies from 0 to 400 Hz, with a sampling rate of 2 kHz. Figure 12b shows the chirp signal Þltered by the 122-tap linear FIR Þlter that was used as the Þlter that produced the desired signal during the training stage. Figure 12c shows the output of a 62-tap linear FIR Þlter used here for comparison purposes. The adaptive optimization algorithm described in Section VIII.E was used to optimize a 62-tap nonrecursive WM Þlter admitting negative weights. The Þltered signal attained with the optimized weights is shown in Figure 12d. Note that the nonrecursive WM Þlter tracks the frequencies of interest but fails to attenuate completely the frequencies out of the desired passband. MatlabÕsyulewalk function was used to design a 62-tap linear IIR Þlter with passband 0.075 ≤ ω ≤ 0.125. Figure 12e depicts the linear IIR ÞlterÕs output. Finally, Figure 12f shows the output of the optimal recursive WM Þlter determined by the LMA training algorithm described in Section VIII.F. Note that the frequency components of the test signal that are not in the passband are attenuated completely. Moreover, the RWM Þlter generalizes very well on signals that were not used during the training stage. Comparing the different Þltered signals in Figure 12, we can see that the recursive Þltering operation performs much better than its nonrecursive counterpart having the same number of coefÞcients. Alternatively, to achieve a speciÞed level of performance, a recursive WM Þlter generally requires considerably fewer Þlter coefÞcients than the corresponding nonrecursive WM Þlter does. As a test of the robustness of the different Þlters, the test signal is next contaminated with additive α-stable noise as shown in Figure 13a. Impulse noise was generated using the parameter alpha (α) set to 1.4. Figure 13a is truncated so that the same scale is used in all the plots. Figures 13b and 13d show the Þlter outputs of the linear FIR and the linear IIR Þlters, respectively. Both outputs are severely affected by the noise. On the other hand, the outputs of the nonrecursive and recursive WM Þlters, shown in Figures 13c and 13e, respectively, remain practically unaltered. Figure 13 clearly depicts the robust characteristics of median based Þlters. To better evaluate the frequency response of the various Þlters, we must perform a frequency domain analysis. Due to the nonlinearity inherent in
230
JOSE« L. PAREDES AND GONZALO R. ARCE
Figure 12. Bandpass Þlter design: (a) input test signal, (b) desired signal, (c) linear FIR Þlter output, (d) nonrecursive WM Þlter output, (e) linear IIR Þlter output, and (f ) RWM Þlter C 2000 IEEE.) output. (Reproduced with permission from Arce and Paredes 2000,
the median operation, traditional linear tools, such as transfer function based analysis, cannot be applied. However, if the nonlinear Þlters are treated as a single-input, single-output system, the magnitude of the frequency response can be experimentally obtained as follows. A single-tone sinusoidal signal sin(2π f t) is given as the input to each Þlter, where f spans the complete range
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 231
Figure 13. Performance of the bandpass Þlter in noise: (a) chirp test signal in stable noise, (b) linear FIR Þlter output, (c) nonrecursive WM Þlter output, (d) linear IIR Þlter output, and (e) C 2000 IEEE.) RWM Þlter output. (Reproduced with permission from Arce and Paredes 2000,
of possible frequencies. A sufÞciently large number of frequencies spanning the interval [0, 1] is chosen. For each frequency value, the mean power of each ÞlterÕs output is computed. Figure 14a shows a plot of the normalized mean power versus frequency attained by the different Þlters. Upon closer examination of Figure 14a, it can be seen that the recursive WM Þlter yields
Figure 14. Frequency response (a) to a noiseless sinusoidal signal and (b) to a noisy sinusoidal signal. (Ñ) RWM Þlter, (− · − · −) nonrecursive C 2000 IEEE.) WM Þlter, (- - -) linear FIR Þlter, and (- - -) linear IIR Þlter. (Reproduced with permission from Arce and Paredes 2000,
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 233
the ßattest response in the passband of interest. A similar conclusion can be drawn from the time domain plots shown in Figure 12. So that we can see the effects that impulse noise has over the magnitude of the frequency response, a contaminated sinusoidal signal, sin(2π f t) + η, is given as the input to each Þlter, where η is α-stable noise with parameter α = 1.4. Following the same procedure just described, we obtain the mean power versus frequency diagram, which is shown in Figure 14b. As expected, the magnitudes of the frequency responses for the linear Þlters are highly distorted, whereas the magnitudes of the frequency responses for the median based Þlters do not change signiÞcantly with noise.
D. Sharpening with WM Filters In principle, image sharpening consists of adding to the original image a signal that is proportional to a high-pass Þltered version of the original image. Figure 15 illustrates this procedure, often referred to as unsharp masking on a one-dimensional signal (Jain, 1989). As shown in Figure 15, the original image is Þrst Þltered by a high-pass Þlter which extracts the high-frequency components, and then a scaled version of the high-pass Þlter output is added to the original image, which produces a sharpened image of the original. Note that the homogeneous regions of the signal (i.e., where the signal is constant) remain unchanged. The sharpening operation can be represented by Y (m, n) = X (m, n) + λF (X (m, n))
(108)
where X (m, n) is the original pixel value at the coordinate (m, n), F (·) is the output of the high-pass Þlter, λ is a tuning parameter greater than or equal to
Figure 15. Image sharpening by high-frequency emphasis.
234
JOSE« L. PAREDES AND GONZALO R. ARCE
zero, and Y(m, n) is the sharpened pixel at the coordinate (m, n). The value taken by λ depends on the grade of sharpness desired. Increasing λ yields a more sharpened image. If background noise is present, however, increasing λ will rapidly amplify the noise. The key point in the effective sharpening process lies in the choice of the high-pass Þltering operation. Traditionally, linear Þlters have been used to implement the high-pass Þlter; however, linear techniques can lead to rapid performance degradation should the input image be corrupted with noise. A trade-off between noise attenuation and edge highlighting can be obtained if a weighted median Þlter with appropriate weights is used. To illustrate this, consider a WM Þlter applied to a gray-scale image where the following Þlter mask is used: −1 −1 −1 8 −1 W = −1 (109) −1 −1 −1 Due to the weight coefÞcients in Eq. (109), for each position of the moving window, the output is proportional to the difference between the center pixel and the smallest pixel around the center pixel. That is, X c − X (1) (110) 2 where X c is the center pixel of the observation window. Thus, the Þlter output takes relatively large values for prominent edges in an image and small values in regions that are fairly smooth, being zero only in regions that have a constant gray level. The effect that Eq. (110) has on edges is somewhat different than that of a FIR high-pass Þlter with the same Þlter coefÞcients. To illustrate this point better, consider a one-dimensional input signal with a positive-slope edge.∗ The linear high-pass Þlter responds on both sides of the edge, outputting a negative value on the left-hand side of the edge, and a positive value on the right-hand side of the edge. On either side the output is proportional to the height of the edge. The WM high-pass Þlter exhibits a different behavior. The output in Eq. (110) is very small if the window is located on the left-hand side of the edge. This follows from the fact that X c and X (1) are on the same side of the edge and thus have similar gray levels. As the window shifts over the edge, X c becomes one of the higher gray levels on the right-hand side of the edge, while X (1) , the smallest sample, is still one of the low gray levels on the left-hand side. The WM high-pass Þlter thus outputs a positive value Y′ =
∗ A change from a gray level to a higher gray level is referred to as a positive-slope edge, whereas a change from a gray level to a lower gray level is referred to as a negative-slope edge.
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 235
Figure 16. Image sharpening based on the weighted median Þlter.
proportional to the height of the edge on the right-hand side. Hence, the WM high-pass Þlter responds to only one side of the edge. This behavior is reversed for negative slope edges. To overcome this limitation, we must modify the basic image sharpening structure shown in Figure 15 such that positive-slope edges as well as negative-slope edges are highlighted in the same proportion. A simple way to accomplish this is: (a) extract the positive-slope edges by Þltering the original image with the Þlter mask described previously; (b) extract the negative-slope edges by Þrst preprocessing the original image such that the negative-slope edges become positive-slope edges and then Þltering the preprocessed image with the Þlter described previously; and (c) combine appropriately the original image, the Þltered version of the original image, and the Þltered version of the preprocessed image to form the sharpened image. Thus both positive-slope edges and negative-slope edges are equally highlighted. This procedure is illustrated in Figure 16, where the top branch extracts the positive-slope edges and the middle branch extracts the negative-slope edges. So that we understand the effects of edge sharpening, a row of a test image is plotted in Figure 17 together with a row of the sharpened image when only the positive-slope edges are highlighted (Fig. 17a), only the negative-slope edges are highlighted (Fig. 17b), and both positive-slope and negative-slope edges are jointly highlighted (Fig. 17c). In Figure 16, λ1 and λ2 are tuning parameters that control the amount of sharpness desired in the positive-slope direction and in the negative-slope direction, respectively. The values of λ1 and λ2 are generally selected to be equal. The output of the preÞltering operation is deÞned as X (m, n)′ = M − X (m, n)
(111)
236
JOSE« L. PAREDES AND GONZALO R. ARCE
Figure 17. Original row of a test image (solid line) and row sharpened (dotted line) with (a) only positive-slope edges, (b) only negative-slope edges, and (c) both positive-slope and negative-slope edges.
with M equal to the maximum pixel value of the original image. This preÞltering operation can be thought of as a ßipping and shifting operation of the values of the original image such that the negative-slope edges are converted into positive-slope edges. Since the original image and the preÞltered image are Þltered by the same WM Þlter, the positive-slope edges and negative-slopes edges are sharpened in the same way. In Figure 16, the output of the Þrst branch is given by Eq. (110) whereas the output of the middle branch can be easily obtained by combining the preÞltering operation with the WM high-pass Þlter. Because of the preÞltering operation, X c becomes M − X c , and X (1) becomes M − X (N ) . When we implement these modiÞcations, the output of the middle branch in Figure 16 is given by Ym (n) =
X (N ) − X c 2
(112)
Combining the three branches shown in Figure 16 yields the output of the WM sharpener: Y (n) = (1 + λ)X c − λVn
(113)
where Vn = (X (1) + X (N ) )/2 is the midrange of the observation set. Note that the computation of Eq. (113) does not require sorting the entire observation set. It is sufÞcient to determine the minimum and maximum samples, which is simpler than sorting the full set of samples.
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 237
Figure 18. (a) Original image sharpened with (b) the FIR sharpener and (c) the WM sharpener. (d) Image with added Gaussian noise sharpened with (e) the FIR sharpener and (f ) the WM sharpener.
In Figure 18, the performance of the WM Þlter image sharpening is compared with that of traditional image sharpening based on linear FIR Þlters. For the linear sharpener, the scheme in Figure 15 was used. The parameter λ was set to 1. For the WM sharpener, the scheme of Figure 16 was used with λ1 = λ2 = 2. The Þlter mask given by Eq. (109) was used in median image sharpening, whereas the Þlter mask for the linear image sharpening is 1/3W, where W is given by Eq. (109). Sharpeners with WM Þlters do not introduce as much noise ampliÞcation as sharpeners equipped with FIR Þlters do.
X. Conclusion In this article, recent developments in stack Þltering and stack smoothing were presented. Unlike stack smoothers, the recently introduced class of stack
238
JOSE« L. PAREDES AND GONZALO R. ARCE
Þlters have been empowered not only with low-pass Þltering characteristics but with bandpass and high-pass Þltering characteristics as well. Much as threshold decomposition provides the foundation for the deÞnition of stack smoothers, the new class of stack Þlters can be deÞned in a similar fashion through a more general threshold decomposition architecture referred to as mirrored threshold decomposition. A particular class of stack Þlters is the class of weighted median Þlters admitting negative weights. It has been shown that these nonlinear Þltering structures can effectively address a number of signal and image problems that require frequency selection type of applications. It was also shown that when a recursive structure is incorporated into the WM framework, the resultant Þltering structure offers near-perfect ÒstopbandÓ characteristics and robustness against noise. As illustrated in this article, there are several image and signal applications where WM Þlters provide signiÞcant advantages over traditional methods using linear Þlters.
Acknowledgments This research has been supported through collaborative participation in the Advanced Telecommunications/Information Distribution Research Program (ATIRP) Consortium sponsored by the U.S. Army Research Laboratory under the Federated Laboratory Program, Cooperative Agreement DAAL01-96-2-0002, and by the National Science Foundation under grants MIP-9530923 and CDA-9703088.
References Arce, G. R. (1986). Statistical threshold decomposition for recursive and non-recursive median Þlters. IEEE Trans. Inform. Theory IT-32. Arce, G. R. (1998). A general weighted median Þlter structure admitting negative weights. IEEE Trans. Signal Processing SP-46. Arce, G. R., and Gallagher, N. C. (1988). Stochastic analysis of the recursive median Þlter process. IEEE Trans. Inform. Theory IT-34. Arce, G. R., and Paredes, J. L. (2000). Recursive weighted median Þlters admitting negative weights and their optimization. IEEE Trans. Signal Processing 48, 768Ð779. Brownrigg, D. R. K. (1984). The weighted median Þlter. Commun. Assoc. Comput. Mach. 27. Coyle, E. J., and Lin, J. (1988). Stack Þlters and the mean absolute error criterion. IEEE Trans. Acoustics, Speech Signal Processing 36, 1244Ð1254. Fitch, J. P., Coyle, E. J., and Gallagher, N. C. (1984). Median Þltering by threshold decomposition. IEEE Trans. Acoustics, Speech Signal Processing 32. Gilbert, E. N. (1954). Lattice-theoretic properties of frontal switching functions. J. Math. Phys. 33. Hu, S.-T. (1965). Threshold Logic. Berkeley, CA: Univ. of California Press.
RECENT DEVELOPMENTS IN STACK FILTERING AND SMOOTHING 239 Jain, A. K. (1989). Fundamentals of Digital Image Processing. Englewood Cliffs, NJ: Prentice Hall. Ko, S.-J., and Lee, Y. H. (1991). Center weighted median Þlters and their applications to image enhancement. IEEE Trans. Circuits Syst. 38. Lin, J., and Kim, Y. T. (1994). Fast algorithms for training stack Þlters. IEEE Trans. Signal Processing 42, 772Ð781. Lin, J., Sellke, T. M., and Coyle, E. J. (1990). Adaptive stack Þltering under the mean absolute error criterion. IEEE Trans. Acoustics, Speech Signal Processing 38, 938Ð954. Muroga, S. (1971). Threshold Logic and Its Applications. New York: Wiley. Nieweglowski, J., Gabbouj, M., and Neuvo, Y. (1993). Weighted mediansÐpositive boolean functions conversion algorithms. Signal Processing 34, 146Ð162. Nodes, T. A., and Gallagher, N. C. (1982). Median Þlters: Some modiÞcations and their properties. IEEE Trans. Acoustics, Speech Signal Processing ASSP-30. Paredes, J. L., and Arce, G. R. (1999). Stack Þlters, Stack smoothers, and mirrored threshold decomposition. IEEE Trans. Signal Processing 47, 2757Ð2767. Paredes, J. L., and Arce, G. R. (in press). Optimization of stack Þlters based on mirrored threshold decomposition. IEEE Trans. Signal Processing. Parker, R. G., and Rardin, R. L. (1988). Discrete Optimization. Computer Science and ScientiÞc Computing, Academic Press. Pitas, I., and Venetsanopoulos, A. (1990). Nonlinear Digital Filters: Principles and Applications. Boston: Kluwer Academic. Pratt, W. K. (1991). Digital Image Processing. New York: Wiley. Samorodnitsky, G., and Taqqu, M. S. (1994). Stable Non-Gaussian Random Processes. New York: Chapman & Hall. Sheng, C. L. (1969). Threshold Logic. Ontario, Canada: The Ryerson Press. Shmulevich, I., Paredes, J. L., and Arce, G. R. (in press). Output distributions of stack Þlters based on mirrored threshold decomposition. IEEE Trans. Signal Processing. Shynk, J. (1989). Adaptive IIR Þltering. IEEE ASSP Magazine 6, 4Ð21. Wendt, P., Coyle, E. J., and Gallagher, N. C., Jr. (1986). Stack Þlters. IEEE Trans. Acoustics, Speech Signal Processing 34. Yin, L., and Neuvo, Y. (1994). Fast adaptation and performance characteristic of FIR-WOS hybrid Þlters. IEEE Trans. Signal Processing 42. Yin, L., Yang, R., Gabbouj, M., and Neuvo, Y. (1996). Weighted median Þlters: A tutorial. Trans. Circuits Syst. II 43, 157Ð192. Yli-Harja, O., Astola, J., and Neuvo, Y. (1991). Analysis of the properties of median and weighted median Þlters using threshold logic and stack Þlter representation. IEEE Trans. Acoustics, Speech Signal Processing 39. Zukhovitskiy, S. I., and Avdeyeva, L. (1966). Linear and Convex Programming. Philadelphia: Saunders.
This Page Intentionally Left Blank
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 117
Resolution ReconsideredÑCon ventional Approaches and an Alternative A. VAN DEN BOS and A. J. DEN DEKKER Department of Physics, Delft University of Technology, 2600 GA Delft, The Netherlands
I. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . II. Classical Resolution Criteria . . . . . . . . . . . . . . . . . . . . A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . B. Rayleigh and Rayleigh-like Resolution Criteria . . . . . . . . . . . C. Proposals to Improve the Rayleigh Resolution . . . . . . . . . . . D. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . III. Other Resolution Criteria . . . . . . . . . . . . . . . . . . . . . A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . B. Extrapolation and Superresolution . . . . . . . . . . . . . . . . 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2. Extrapolation. . . . . . . . . . . . . . . . . . . . . . . . 3. Inverse Filtering. . . . . . . . . . . . . . . . . . . . . . . C. Approaches Based on Information Theory . . . . . . . . . . . . . D. Approaches Based on Signal-to-Noise Ratio . . . . . . . . . . . . E. Approaches Based on Decision Theory . . . . . . . . . . . . . . F. Approaches Based on Asymptotic Estimation. . . . . . . . . . . . G. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . IV. Modeling and Parameter Estimation . . . . . . . . . . . . . . . . . A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . B. Parametric Statistical Models of Observations . . . . . . . . . . . C. Parameter Estimation. . . . . . . . . . . . . . . . . . . . . . 1. Dependence of the Probability Density Function of the Observations on the Parameters . . . . . . . . . . . . . . . . . . . . . . 2. Limits to Precision: The Cram« erÐRaoLower Bound. . . . . . . . 3. Maximum Likelihood Estimation . . . . . . . . . . . . . . . 4. Limits to Resolution of Parameters: A Numerical Example. . . . . D. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . V. Elements of Singularity Theory. . . . . . . . . . . . . . . . . . . A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . B. DeÞnitions and Notions. . . . . . . . . . . . . . . . . . . . . C. Functions around Stationary Points . . . . . . . . . . . . . . . . 1. The Morse Lemma and the Splitting Lemma. . . . . . . . . . . 2. Simple Examples of Singular Representations . . . . . . . . . . 3. Bifurcation Sets. . . . . . . . . . . . . . . . . . . . . . . D. Functions near Singularities . . . . . . . . . . . . . . . . . . . 1. Derivation of the Reduction Algorithm . . . . . . . . . . . . . 2. Useful Polynomial Substitutions. . . . . . . . . . . . . . . . E. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
242 245 245 245 246 248 248 248 249 249 250 253 256 258 260 261 264 264 264 265 267
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
267 269 271 273 276 277 277 278 279 279 280 283 284 284 289 291
241 Volume 117 ISBN 0-12-014759-9
C 2001 by Academic Press ADVANCES IN IMAGING AND ELECTRON PHYSICS Copyright All rights of reproduction in any form reserved. ISSN 1076-5670/01 $35.00
242
A. VAN DEN BOS AND A. J. DEN DEKKER
VI. Singularity of Likelihood Functions . . . . . . . . . . . . . A. Introduction . . . . . . . . . . . . . . . . . . . . . B. Likelihood Functions for Two-Component Models . . . . . C. Stationary Points of the Likelihood Functions. . . . . . . . D. Examples of Stationary Points . . . . . . . . . . . . . . E. The Hessian Matrix of the Log-Likelihood Function. . . . . 1. Change of Coordinates. . . . . . . . . . . . . . . . 2. The Hessian Matrix at the One-Component Stationary Point 3. Summary, Discussion, and Conclusions. . . . . . . . . F. The Degenerate Part of the Likelihood Function. . . . . . . 1. The Quadratic Term . . . . . . . . . . . . . . . . . 2. The Cubic Term. . . . . . . . . . . . . . . . . . . 3. The Quartic Term . . . . . . . . . . . . . . . . . . G. Summary and Conclusions . . . . . . . . . . . . . . . VII. Singularity and Resolution. . . . . . . . . . . . . . . . . A. Introduction . . . . . . . . . . . . . . . . . . . . . B. Coinciding Scalar Locations. . . . . . . . . . . . . . . 1. Conditions for Resolution. . . . . . . . . . . . . . . 2. Numerical Computation of the Resolution Criterion. . . . 3. Conditions for the Use of the Criterion . . . . . . . . . 4. Application to RayleighÕs sinc-Square Model . . . . . . 5. Resolution as a Property of Observations . . . . . . . . 6. Resolution from Error Disturbed Observations . . . . . . 7. Probability of Resolution and Critical Errors. . . . . . . C. Coinciding Two-Dimensional Locations. . . . . . . . . . 1. Conditions for Resolution. . . . . . . . . . . . . . . 2. Application to the Airy Model. . . . . . . . . . . . . D. Nonstandard Resolution: Partial Coherence . . . . . . . . E. A Survey of Related Literature. . . . . . . . . . . . . . F. Summary and Conclusions . . . . . . . . . . . . . . . VIII. Summary and Conclusions. . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
292 292 293 295 298 304 305 307 310 311 311 316 320 323 325 325 325 325 327 328 330 332 337 339 344 344 347 350 352 353 353 355
I. Introduction The primary purpose of this article is to propose and explain an alternative definition of two-component resolution which may also be used as a resolution criterion. Two-component resolution is the ability to distinguish two overlapping component functions of the same family in a set of observations. For incoherent optics and electron optics, the components are often point spread functions. Thus two-component resolution is the same as what in the literature is called two-point resolution. The reason for the alternative deÞnition is that the existing approaches, however useful, are often specialized or stem from an era when neither adequate photon or electron detectors nor fast computers and software existed.
RESOLUTION RECONSIDERED
243
So that the differences with the proposed approach can be seen, the existing approaches will Þrst be reviewed brießy. Next statistical and mathematical introductory material needed by the new approach will be presented. This material consists of two distinct parts. The Þrst part is a description of the statistical model of two-component observations along with the most important statistical methods for estimation of the parameters of the components. The second part consists of elements of nonlinear mathematics. This introductory material is used later in the derivation of the remarkable and unusual properties of the estimates of the component parameters. The proposed resolution deÞnition and the corresponding resolution criterion will be shown to be a direct consequence of these properties. The detailed outline of the sections in this article is as follows: Section II is a description of classical resolution criteria. These are modiÞed versions of RayleighÕs well-known resolution criterion. The main characteristic of classical resolution criteria is that they are simply measures of the width of the main lobe of the point spread function concerned. This is a consequence of the fact that these criteria are intended to describe the resolving capabilities of a human observer, and in this limited sense they are operational. Also, these deÞnitions are based on the exact functional form of the point spread function, that is, on limitations of the optical system concerned, not on imperfections of the optical observations. Section III starts with a brief description of digital image processing methods intended to improve resolution. The resolution achieved by these methods exceeds the classical limits but is empirically found to be limited by noise. Therefore, these methods show that the limit to resolution is not the width of the point spread function or, equivalently, diffraction, but imperfections of the observations. In Section III.C, literature is addressed relating information theory to resolution. The described results, which are not intended to assess practical limits to two-component resolution, also emphasize the inßuence of the signal-to-noise ratio. Then, in Section III.D, more or less empirical methods are discussed directly linking resolution with the signal-tonoise ratio. The signal-to-noise ratio is equally important in theories, brießy reviewed in Section III.E, relating resolution to decision theory. The methods discussed in Sections III.C through III.E share an emphasis on the inßuence of noise on resolution. Methods described in Section III.F, which make use of parametric statistical models of the observations combined with statistical parameter estimation techniques, do the same but are much more general. However, the examples of such methods found in the literature require three conditions to be met: the parametric component function must be correct, the probability density of the observations must be known, and the number of observations must be inÞnite since asymptotic expressions for the variances are used.
244
A. VAN DEN BOS AND A. J. DEN DEKKER
Sections IV through VII introduce and explain the alternative resolution deÞnition and the corresponding criterion proposed in this article. These will also be based on parametric models of the observations and parameter estimation methods. However, the conditions of the parametric methods described in Section III.F will not be made, simply because they are not needed. Therefore, the alternative resolution deÞnition and criterion are much more general. First, in Section IV, the parametric statistical model of the two-component observations will carefully be described. A description will also be given of the most important method for estimation of the parameters of such observation models: the maximum likelihood method. The maximum likelihood estimate is the solution for the parameters that maximizes the so-called likelihood function, a function of the parameters directly derived from the probability density function of the observations. The asymptotic properties of the maximum likelihood method will also be discussed. Section IV is concluded by a numerical example. In this example, least squares estimates of the locations of a pair of strongly overlapping Gaussian peaks are computed from simulated, noisecorrupted observations. The experiment is repeated many times to assess the distribution of the estimates. The results show that the properties of parameter estimates from numbers of observations such as are usual in practice may be fundamentally different from those computed from asymptotic expressions. In particular, the asymptotic distributions do not predict the frequent occurrence of exactly coinciding estimates of the locations resulting in a one-component solution from two-component observations. Therefore, these solutions disclose the existence of limits to resolution not revealed by asymptotic considerations. In Sections V through VII, it will be seen that this remarkable behavior of the location estimates is a consequence of the way the solutions for these parameters depend on the observations. Section V presents elements of singularity theory, a branch of mathematics dealing with structural change of parametric functions under the inßuence of changes of the parameters where structure is the pattern of the stationary points of the function. In Section VI, the singularity theory discussed in Section V will be used to show that under the inßuence of modeling errors or statistical errors in the observations the structure of the likelihood function may change. In Section VII, the results of Section VI are used to explain the remarkable coincidence of estimates of the locations of the components under the inßuence of the observations and to derive the main result of this article: a criterion distinguishing the observations for which the location estimates are distinct from those for which the estimates coincide. If two components are deÞned as resolved if the estimates of their locations are distinct, the derived criterion is a natural resolution criterion. Moreover, by use of this criterion, it is easy to Þnd out whether a set of observations corresponds to resolved components or not. In Section VII.C the criterion is extended to include resolution in two dimensions
RESOLUTION RECONSIDERED
245
and in Section VII.D it is applied to partially coherent observations. A survey of relevant literature concludes Section VII. Section VIII is devoted to conclusions.
II. Classical Resolution Criteria A. Introduction Resolution is widely used as a performance measure of imaging systems in optics and electron optics. It was Lord Rayleigh who made the concept of resolution popular by introducing his well-known criterion for two-point resolution. The criterion deÞnes the possibility of perceiving separately two point sources in an image formed by a diffraction-limited imaging system or objective lens. The model of an object in the form of two points originated from astronomical problems, in which many objects are effectively point sources. In this section, the Rayleigh resolution criterion and criteria related to it are brießy discussed. Furthermore, attempts to improve the Rayleigh resolution are described. Finally, it will be pointed out that, in spite of their usefulness in the quality assessment of imaging systems, the so-called classical resolution criteria, with which this section is concerned, do not provide absolute limits to resolution.
B. Rayleigh and Rayleigh-like Resolution Criteria The Rayleigh criterion states that two point sources of equal brightness are just resolvable when the maximum of the intensity pattern produced by the Þrst point source falls on the Þrst zero of the intensity pattern produced by the second point source (Rayleigh, 1902). Consequently, the Rayleigh resolution limit is given by the distance between the central maximum and the Þrst zero of the diffraction-limited intensity point spread function of the imaging system concerned. The Rayleigh resolution criterion can be generalized to include point spread functions that have no zero in the neighborhood of their central maximum by taking the resolution limit as the distance for which the ratio of the value at the central dip in the composite intensity distribution to that at the maxima on either side is equal to 0.81. This corresponds to the original Rayleigh limit for a rectangular aperture. RayleighÕs choice of resolution limit is based on presumed resolving capabilities of the human visual system when it is used to detect differences in intensity at various points of the composite intensity distribution. Well aware of the arbitrariness of the proposed criterion, Rayleigh himself evaluated it as follows: ÒThisrule is convenient on account of its simplicity and it is sufÞciently accurate in view of the necessary
246
A. VAN DEN BOS AND A. J. DEN DEKKER
uncertainty as to what exactly is meant by resolutionÓ(Rayleigh, 1902; p. 85). Since RayleighÕs days, several other resolution criteria have been proposed which are similar to RayleighÕs. Notable examples of such so-called classical criteria for two-point resolution are those of Buxton (1937), Houston (1927), Schuster (1924), and Sparrow (1916). SchusterÕs criterion states that the two point sources are just resolved if no portion of the main lobe (central band) of the intensity distribution of one overlaps the main lobe of the other. This criterion provides a resolution limit that is twice that of RayleighÕs. According to HoustonÕs criterion the two point sources are just resolved if the distance between the two maxima in the composite intensity distribution produced by the two point sources equals the full width at half maximum (FWHM) of the intensity distribution of either point source. The criterion proposed by Buxton is quite similar to that of Houston. Buxton, however, did not consider the intensity distributions, but their square roots, the amplitude distributions. According to BuxtonÕs criterion, at the limit of resolution the amplitude distributions of the two point sources should intersect at the points of inßection. It was Sparrow who pointed out that the Rayleigh criterion as originally proposed was not, as many researchers assumed, intended as a measure of the actual limit of resolution; it was instead intended as a Þgure of merit for different instruments. According to Sparrow, the resolution limit for two point sources of equal brightness is given by the undulation condition, that is, by the condition that the central minimum (dip) in the composite intensity distribution just disappears. Then both central maxima and the minimum between just coincide. In this case, a central dip in the composite intensity distribution could never be detected, not even if visual inspection would be replaced by a hypothetical perfect intensity measurement instrument. The reason is simply that there is no such dip anymore. The Sparrow limit is, therefore, often referred to as the natural or physical limit to resolution. A clear description of the previously mentioned and several other classical resolution criteria is given by Ramsay and coauthors (1941). It is worth mentioning that although in their original context classical resolution criteria such as RayleighÕs and SparrowÕs tacitly assume incoherent illumination, they have later been generalized to include partially as well as fully coherent illumination (Born and Wolf, 1980; Lummer and Reiche, 1910; Luneberg, 1944; McKechnie, 1972).
C. Proposals to Improve the Rayleigh Resolution All classical resolution criteria express resolution in terms of the width of the main lobe of the point spread function of the diffraction-limited imaging
RESOLUTION RECONSIDERED
247
system. The narrower the point spread function, the better the resolution. Even recently, it was pointed out once more that a good image resolution measure should be proportional to the width of the point spread function of the imaging system (Wang and Li, 1999). The shape of the point spread function depends on the shape of the systemÕs aperture and the spatial transmittance in the plane containing this aperture. The function describing the spatial transmittance is called the pupil function (Castleman, 1979). A uniform transmittance over the aperture corresponds to a pupil function that is equal to one at points in the aperture and equal to zero elsewhere. This gives rise to conventional point spread functions such as Airy functions and sinc-square functions for circular and rectangular apertures, respectively (Castleman, 1979). However, it is possible to modify the transmittance over the aperture by implementing a suitable Þlter. Such a modiÞcation of the uniform amplitude transmission over the aperture (or pupil) is known as apodization. In the literature, many attempts to decrease the width of the central band of the systemÕs point spread function by means of apodization have been described (Barakat, 1962; Barakat and Levin, 1963; Jacquinot and Roizen-Dossier, 1964; Osterberg, 1950; Osterberg and Wilkins, 1949; Osterberg and Wissler, 1949; Wolf, 1951). Wilkins (1950) has shown that, in principle, apodization can reduce the width of the main lobe of the point spread function indeÞnitely, so that, theoretically, unlimited resolution in the Rayleigh sense can be attained. There is, however, no practical interest in carrying the apodization process to extremes since a narrowing of the main lobe will generally have the side effect of a considerable rise of the level of the side lobes of the modiÞed point spread function (McKechnie, 1972). Furthermore, as shown by Luneberg (1944), the point spread function with the highest central maximum corresponds to uniform transmittance, so that any transmittance other than uniform gives rise to a lower central maximum of the point spread function. This may be undesirable. For example, it can be shown that obstructing the central part of the aperture narrows the main lobe of the systemÕs point spread function and thus increases the resolution in the Rayleigh sense. However, this effect is accompanied not only by an increasing loss of light in the image as more of the aperture is obstructed but also by a considerable rise in the level of the side lobes of the point spread function since a larger amount of light is diffracted from its proper geometrical position. In spectral analysis, such a leak of power from the main lobe to the side lobes is called leakage. Obviously, leakage makes the interpretation of the images more difÞcult. Because of these practical limits, later work on apodization focused on the problems of Þnding for a speciÞed Rayleigh limit the pupil function (and associated point spread function) having the maximum central irradiance (Kibe and Wilkins, 1983, 1984; Wilkins, 1950) and the pupil function corresponding to a point spread function having as much of its energy as possible concentrated in a circle of a speciÞed radius (Clements and Wilkins, 1974).
248
A. VAN DEN BOS AND A. J. DEN DEKKER
D. Conclusions It may be concluded from the foregoing subsections that classical resolution criteria in fact do not concern detected images but calculated images (Ronchi, 1961), that is, noise-free images exactly describable by a known parameterized two-component mathematical model. The shape of the component function, that is, the point spread function, is assumed to be exactly known. However, the classical resolution criteria disregard the possibility of using this a priori knowledge about the point spread function to extract analytic results from observations by means of deconvolution or by model Þtting using parameter estimation methods. Obviously, in the absence of noise, numerically Þtting the known two-component model to the images with respect to the component locations and amplitudes would result in a perfect Þt. The resulting solutions for these locations and amplitudes would be exact, and despite diffraction no limit to resolution would exist, no matter how closely spaced the two point sources. In reality, however, calculated images do not occur. Instead, one has to deal with detected images. In this case, the shape of the point spread function will never be known exactly. There will always be irregularities or aberrations that are unknown or incorrectly described by the mathematical model adopted. This means that systematic errors are introduced. Furthermore, detected images will never be noise free, so that nonsystematic errors will also be present. It is these errors, both systematic and nonsystematic, that ultimately limit the resolution (den Dekker and van den Bos, 1997). Although classical resolution criteria are still widely used as a measure of the relative merit of different imaging systems, it should be recognized, and is in fact recognized by many, that such criteria do not provide absolute limits to resolution. If there is an ultimate limit to resolution, it must be a consequence of the fact that, as a result of systematic errors (modeling errors) and nonsystematic errors (statistical ßuctuations of the observations), detected images are never exactly described by the model adopted.
III. Other Resolution Criteria A. Introduction As mentioned in Section II, the classical Rayleigh resolution criterion is based on presumed limitations to the resolving capabilities of the human visual system. Since RayleighÕs days, visual inspection has been supplemented with intensity measuring instrumentation and, above all, digital computing facilities. These new facilities brought about a reconsideration of the notion of resolution and as a result many resolution criteria different from the classical
RESOLUTION RECONSIDERED
249
Rayleigh-like criteria have been proposed in the literature. In this section, a selection of these more modern criteria will be reviewed. Furthermore, a number of methods to enhance resolution will be discussed. The aim is to show that these resolution criteria and resolution enhancing methods are based both on the object and on the noise in the observations. It will be shown that it is only in the absence of any a priori knowledge about the object that resolution is limited by diffraction. In other cases, resolution is not limited by diffraction but by the noise in the observations.
B. Extrapolation and Superresolution 1. Introduction Generally, lenses and other imaging systems may be treated as two-dimensional shift-invariant linear systems. For coherent illumination, the imaging system is linear in complex amplitude whereas for incoherent illumination the system is linear in intensity, which is the square of the modulus of the amplitude. The characteristics of a shift-invariant linear system are deÞned by its point spread function or, equivalently, by its transfer function, which is the Fourier transform of the point spread function. Transfer functions are particularly useful in light and electron microscopy and in camera applications whereas point spread functions are more often used in the quality assessment of telescopes or spectroscopic instruments. The coherent point spread function is merely the Fourier transform of the systemÕs pupil function. The transfer function of a coherent system, which is also known as the amplitude transfer function or coherent transfer function, therefore has the same shape as the systemÕs pupil function. The incoherent point spread function is the squared modulus of the coherent point spread function, which is equal to the power spectrum of the pupil function. The incoherent transfer function, which in optics is known as the optical transfer function, is the autocorrelation function of the pupil function. This implies that both coherent and incoherent transfer functions are strictly bandlimited; that is, spectral components with frequencies beyond the cutoff frequency are not transferred by the system and therefore do not appear in the images. Furthermore, it has been shown that partially coherent imaging systems are also strictly bandlimited (Born and Wolf, 1980). The cutoff frequency of the imaging system is often used as a measure of resolution. This resolution measure, which is also known as the diffraction limit, is directly related to the Rayleigh-like classical resolution limits, which are a measure of the width of the main lobe of the point spread function involved. This can be seen as follows. Since the transfer function of the imaging system is the Fourier transform of the systemÕs point spread function, a point
250
A. VAN DEN BOS AND A. J. DEN DEKKER
spread function with a narrow main lobe corresponds to a transfer function with a high cutoff frequency. In fact, the product of the Rayleigh resolution limit and the cutoff frequency of the system involved is equal to a constant that is close to or even equal to one (Castleman, 1979). Transfer functions in electron microscopy are bandlimited as well. Resolution measures used in electron microscopy, such as the fringe resolution, also known as line resolution, and the information-limit resolution of the microscope, are likewise based on this bandwidth limitation (OÕKeefe, 1992; Spence, 1988). 2. Extrapolation Due to the bandwidth limitation, spatial frequencies above the cutoff frequency seem to be irrevocably lost. Under certain circumstances, however, superresolution, that is, resolution beyond the diffraction limit, turns out to be possible. The key to superresolution is a priori knowledge. The incorporation of prior knowledge may make it possible to reconstruct the object spectrum beyond the diffraction limit from that within the diffraction limit. Such prior knowledge is usually available. For example, a very realistic assumption for a real object is that it is spatially bounded; that is, it is nonzero only in a region of Þnite extent. This single condition can be shown to be sufÞcient to guarantee that the object spectrum is analytic. A well-known property of an analytic function is that if it is known on a certain interval, it is known everywhere, since there cannot exist two analytic functions that agree over a given interval and yet disagree outside the interval (Castleman, 1979). Consequently, given the values of an analytic function over a speciÞed interval, the analytic function can always be reconstructed in its entirety. This process of reconstruction, by means of certain mathematical operations, is called analytic continuation. If the noise is ignored, inverse Þltering of the image spectrum by the known transfer function of the imaging system would result in an exact reconstruction of the object spectrum within the passband of the imaging system, that is, up to the diffraction limit. Analytic continuation would then make it possible to extrapolate the object spectrum beyond the diffraction limit (Barnes, 1966; Harris, 1964a). This would result in an exact reconstruction of the total object spectrum and thus, after computation of the inverse Fourier transform, the exact object. Consequently, the Þnite extent of the object is a sufÞcient condition to guarantee that the resolution is not limited by diffraction. In practice, however, the images are disturbed by noise. Furthermore, the transfer function of the imaging system will rarely be exactly known. It is these errors, both nonsystematic and systematic, that limit the performance of superresolution techniques and limit the resolution in the process. In fact, at one time the feasibility of superresolution was seriously doubted by many. Although analytic continuation shows that it is, in principle, possible to
RESOLUTION RECONSIDERED
251
retrieve object properties beyond the diffraction limit if the available a priori knowledge about the object is used, it does not yield a practical way to achieve superresolution. This is because analytic continuation requires computation of derivatives but the presence of noise in real images makes an accurate computation of derivatives hopeless (Frieden, 1967). Considerations such as this even led to the assertion that superresolution is no more than a ÒmythÓ (Andrews and Hunt, 1977). It is now generally known that superresolution is not a myth. Its practical feasibility, even in the presence of noise, has clearly been demonstrated by several superresolution algorithms (Biraud, 1969; Dempster et al., 1977; Frieden, 1972, 1975; Frieden and Burke, 1972; Gerchberg, 1974; Holmes, 1988; Lucy, 1974; Richardson, 1972; Schell, 1965; Sementilli et al., 1993; Shepp and Vardi, 1982; Walsh and Nielsen-Delaney, 1994). All superresolution algorithms are based on the same principles as the concept of analytic continuation is, namely the following (B. R. Hunt, 1995): r
r
The spatial frequencies that are captured by image formation below the diffraction limit contain information needed to reconstruct spatial frequencies beyond the diffraction limit. Using additional knowledge about the object makes it possible to reconstruct object spatial frequencies beyond the diffraction limit from the image spatial frequencies below that limit.
Both empirically and theoretically, it has been shown that there are certain necessary conditions to be satisÞed by a superresolution algorithm for it to be effective (B. R. Hunt, 1994). First, the algorithm must contain nonlinear operations, since a purely linear operation can only modify the amplitudes of the Fourier harmonics that are already present in the image but cannot create object spatial frequencies beyond the diffraction limit. It should be noted, however, that although any trivial nonlinear operation, such as squaring the image gray levels, will cause the generation of spatial frequencies beyond the diffraction limit, such an operation will not necessarily construct spatial frequencies that bear any relation to the spatial frequencies of the true object. Therefore, a second condition is that the algorithm should explicitly utilize a mathematical description of the image formation process that relates object and image via the point spread function of the imaging system. This condition is necessary to prevent misconstruing an arbitrary nonlinear operation as superresolving (B. R. Hunt, 1994). A third condition is that, for Nyquist sampled images, aliasing distortion caused by the reconstruction of frequency components beyond the bandwidth be avoided by reconstructing the object on a Þner grid than the image. The algorithm should therefore contain some suitable form of interpolation. Obviously, such an interpolation is not needed if the image is sufÞciently oversampled. A fourth condition follows directly from the principles on which superresolution is based; that is, the algorithm should
252
A. VAN DEN BOS AND A. J. DEN DEKKER
incorporate a priori knowledge about the object. Examples of such additional knowledge used by superresolution algorithms are as follows: r
r
r
r
Finite extent of the object. Finite extent implies that the object possesses an analytic Fourier transform. However, as indicated previously, superresolution algorithms based on this one piece of a priori knowledge (Barnes, 1966; Frieden, 1967; Harris, 1964a; Slepian and Pollak, 1961) have been found to be highly sensitive to noise in the image (Pask, 1976; Rushforth and Harris, 1968). Positivity of the object. Positivity has been found to be a very important constraint. Remarkable results have been gained by algorithms that make use of the fact that the object is known to be positive and that therefore force the reconstructed object to be positive as well (Biraud, 1969; Frieden, 1972; Frieden and Burke, 1972; Gerchberg, 1974; Janson et al., 1970; Schell, 1965; Walsh and Nielsen-Delaney, 1994). The way the positivity constraint is realized differs from algorithm to algorithm. In the Biraud algorithm, for example, positivity is ensured in a rather ad hoc manner, namely through representing the object as the square of an unknown function (Biraud, 1969). In other methods, the positivity constraint follows logically from some prior meaningful principle, such as maximum entropy (Frieden, 1972; Frieden and Burke, 1972). The maximum entropy solution can be regarded as the most uniform image consistent with the observed data and the noise. It is the positivity constraint inherent in the entropy, though, which makes superresolution possible (Burch et al., 1983). Upper or lower bounds on object intensity. Obviously, the positivity of the object can be forced in the algorithm as a lower bound. A second bound on the object may be its maximum intensity, which is usually more difÞcult to Þx precisely. However, by looking at an image, the observer can always guess at some upper bound to the unknown object. The importance of incorporating this extra knowledge has been established empirically by Janson and coauthors (1970). Object statistics. In this case, not only the image, but also the object is modeled as a stochastic variable, which is, by deÞnition, characterized by a probability density function. An illustrative example of a superresolution algorithm incorporating object statistics is the maximum a priori (MAP) estimator, which we will discuss brießy now. Suppose that the linear image formation is represented in terms of a matrixÐvector formulation (Banham and Katsaggelos, 1997) g = Hf +n
(1)
where g is the image vector, f is the object vector, and n is the noise vector. The matrix H represents the point spread function, which is supposed to
RESOLUTION RECONSIDERED
253
be known. The MAP estimate of the object is found by maximizing the probability density function of the object f conditioned on the image g: arg max p( f |g) f
(2)
which, according to BayesÕrule, is equivalent to arg max f
p(g| f ) p( f ) p(g)
(3)
where p( f ) is the prior probability density function of the object, whereas p(g| f ) is the probability density function of the image conditioned on the object. The image probability density, p(g), is not a function of f and is therefore irrelevant to the maximization process. Obviously, the virtue of a BayesÕMAP estimator is that it allows one to incorporate a priori knowledge about the object into the solution via the prior probability density function, p( f ). Notice, for example, that assuming Poisson statistics for the object automatically imposes a positivity constraint. One should, however, always be aware of the fact that if incorrect a priori information is incorporated, this will give rise to biased results according to the principle Ògarbagein, garbage out.ÓIn the literature, different choices of mathematical models for the probability density functions p( f ) and p(g| f ) have been described (Frieden, 1980). For example, in the derivation of the Poisson MAP estimator of B. R. Hunt and Sementilli (1992), both the object and the image probability density functions, p( f ) and p(g| f ), were chosen to be Poisson densities. Geman and Geman (1984) developed a MAP superresolution algorithm for a Poisson image probability density function and a Markov random Þeld governing the object statistics. In the special case of normal distributions p( f ) and p(g| f ), it can easily be shown that the MAP estimator reduces to the linear minimum mean-square-error (LMMSE) estimator (Sezan and Tekalp, 1990). The LMMSE estimator is equivalent to the well-known Wiener Þlter, which is based on a priori second-order statistical information about the object and the noise. It should be noted, however, that the Wiener Þlter will not provide superresolution because it involves only linear operations. 3. Inverse Filtering At this stage, it is worthwhile to consider the case in which the prior probability p( f ) in Expr. (3) is assumed to be uniform. This reßects the absence of any a priori knowledge about the object. In this case, the MAP estimator Expr. (3) becomes a maximum likelihood estimator. The maximum likelihood estimate
254
A. VAN DEN BOS AND A. J. DEN DEKKER
of the object is then arg max p(g| f ) f
(4)
where the quantity p(g| f ) is usually referred to as the likelihood function. It can be shown that for the special case of independent and identically normally distributed noise, the maximum likelihood estimator becomes a least squares estimator (B. R. Hunt and Andrews, 1973): arg min g − H f 2 f
(5)
which, if the matrix product H T H is nonsingular, produces the least squares solution föL S = (H T H )−1 H T g,
(6)
where the superscript T denotes transposition. This is recognized as a generalized inverse Þlter. Without loss of generality, from now on we will assume that the dimension of the image vector g is equal to that of the object vector f. The matrix H then becomes a square matrix. For a nonsingular square matrix H, Eq. (6) reduces to föL S = H −1 g
(7)
Inverse Þltering can be implemented by discrete Fourier transform. It then consists of dividing the Fourier transform of the image by the known transfer function of the imaging system and next computing the inverse Fourier transform of the result. Alternatively, inverse Þltering can be performed purely by convolution in the spatial domain. Unfortunately, inverse Þltering has been found to be extremely sensitive to noise. Depending on the nature of H and the noise, the inverse Þltering solution may possess an excessively high variance, even though it is unbiased under normality assumptions (B. R. Hunt, 1973). This may be understood as follows. It follows from Eqs. (1) and (7) that the object estimate obtained by inverse Þltering deviates from the actual value of the object f by the additional error term H −1 n, where n represents the additive noise. Since the matrix H, representing the point spread function, is mainly Þlled with zeros and small elements near the diagonal, which causes H −1 to have very large elements, the error term H −1 n manifests itself as ampliÞed noise (Frieden, 1975). Notice that if the point spread function is such that H is singular, computation of the least squares solution by Eq. (7) becomes impossible. A unique least squares solution then no longer exists. Unfortunately, it can be shown that matrices H associated with bandlimited systems, which are the systems of interest in this article, will be singular. This becomes clear if we consider the counterpart of the inverse operator H −1 in the spectral frequency
RESOLUTION RECONSIDERED
255
domain, which is given by the inverse of the transfer function. For bandlimited systems, the inverse of the transfer function is undeÞned for frequencies beyond the cutoff frequency. This means that for bandlimited systems unmodiÞed inverse Þltering cannot be carried out. Coping with this problem often involves carrying out inverse Þltering up to a so-called processing frequency, which is chosen within the strictly limited bandwidth of the system transfer function. The result of this modiÞed inverse Þlteringis that the image is effectively Þltered by a product of the inverse system transfer function and a transfer function uniform up to the processing frequency. If the noise is disregarded, the Þnal result of modiÞed inverse Þltering is the object convolved with a sinc function, which is the point spread function corresponding to the uniform transfer. A simple calculation shows that the Þrst zero crossing of this point spread function is located at half the reciprocal of the processing frequency. This location may be considered to be the resolution in the reconstructed object. For the purposes of this article, it is important to mention that thus a twofold improvement of the Rayleigh resolution in the reconstructed object over that in the measured image may be achieved. However, even in the absence of noise, modiÞed inverse Þltering will not provide us with the exact inverse solution, which is the exact object. The method is strictly bandlimited: it does not extrapolate outside the processing frequency and will therefore not provide superresolution. This is not surprising since inverse Þltering is a linear method and it does not incorporate any a priori information about the object, which, as mentioned previously, is a necessary condition to achieve superresolution. For nonÐnormally distributed noise, the least squares estimate described by Eq. (6) is generally not the maximum likelihood estimate since it does not maximize the pertinent likelihood function. Let us consider, for example, Poisson distributed image observations. Then it can be easily shown that the maximum likelihood estimator Expr. (4) becomes a nonlinear estimator. In this case, a closed form maximum likelihood estimator does not exist. Hence, the maximum likelihood solution, which is the location of the maximum of the likelihood function, can be found only by an iterative, numerical optimization process. For this purpose, several nonlinear recursive algorithms have been developed, such as the RichardsonÐLucy algorithm, which has been derived independently by Richardson (1972) and Lucy (1974) and has been rediscovered in a different context by Dempster and coauthors (1977) and Shepp and Vardi (1982). It should be noted that unconstrained maximum likelihood estimation described by Expr. (4) does not assume any a priori knowledge of the object to be reconstructed. It can be shown, however, that the aforementioned iterative nonlinear algorithms developed to maximize the likelihood function for Poisson distributed observations share the properties that the estimates at each stage satisfy the positivity constraint for the object and that the total
256
A. VAN DEN BOS AND A. J. DEN DEKKER
energy is preserved. The algorithms, therefore, implicitly impose constraints on the solution of the object that represent additional a priori knowledge. Notice that the imposing of such extra constraints may, of course, diminish the likelihood of the Þnal solution; effectively, a smaller likelihood may be accepted for the beneÞt of a solution that is in agreement with the a priori knowledge. Nevertheless, the algorithms satisfy the necessary conditions to achieve superresolution as described previously, and their ability to achieve superresolving image restoration has been reported (Holmes, 1988; Meinel, 1986; Snyder et al., 1993). It is outside the scope of this article to describe the methods of superresolution in detail. For a review, we refer the reader to Frieden (1975), B. R. Hunt (1994), and Meinel (1986). What we should keep in mind is that, obviously, the performance of any superresolving algorithm will always be limited by noise. This is clearly demonstrated by Lucy (1992a, 1992b), who empirically investigated the limits imposed by photon statistics on the degree of superresolution achievable with image restoration techniques. For this purpose, he applied the RichardsonÐLucy algorithm for maximum likelihood image restoration to synthetic, Poisson distributed images of a pair of identical point sources and empirically investigated the degree to which RayleighÕs resolution limit for diffraction-limited images can be surpassed. Finally, it should be noted that often the available a priori information about the object consists of a parametric model. Then, it is needed only to compute the relatively small number of unknown parameters characterizing the object from the available observations. The image reconstruction problem thus becomes a parameter estimation problem. Parametric restoration methods can, in principle, achieve far higher resolution than that of nonparametric methods, as will become clear in the remainder of this article.
C. Approaches Based on Information Theory In the literature, authors have also discussed resolution in the framework of information theory, relating it to the number of degrees of freedom in the image. Most of this work is based on the concept of channel capacity as developed by Shannon (1949). By application of the sampling theorem, Shannon showed that the number of points M needed to completely specify a one-dimensional signal of duration T and bandwidth BT is given by M = 2T BT + 1
(8)
It should be mentioned that in ShannonÕs original papers the unity term is mentioned, but not included, since 2TBT is usually much greater than one (Shannon, 1949). Furthermore, if it is assumed that the detected signal has an
RESOLUTION RECONSIDERED
257
average energy s and additive noise energy n, then the number of levels that can be distinguished reasonably well is given by [(s + n)/n]1/2
(9)
so that the total number of possible distinct signals is given by m = (1 + SNR) M/2
(10)
with SNR = s/n representing the signal-to-noise ratio. The information capacity (in bits) of the system is then deÞned as (Shannon, 1949) N = log2 m = 12 (1 + 2T BT ) log2 (1 + SNR)
(11)
Several researchers applied this concept to optics to estimate the resolution limit of an imaging system for a given SNR (Fried, 1979; Toraldo di Francia, 1955). Cox and Sheppard (1986) derived an expression for the information capacity of an optical system, in which the signal is not limited simply to a single temporal dimension but possesses three spatial dimensions and two independent states of polarization as well. Their work extended not only the work of Fellgett and Linfoot (1955) by including temporal properties of the optical system, but also that of Lukosz (1966, 1967) by including the signalto-noise ratio. The main result is that the information capacity of an imaging system, that is, the number of degrees of freedom that the system can transmit, is invariant and is given by N F = (1 + 2L x Bx )(1 + 2L y B y )(1 + 2L z Bz )(1 + 2T BT ) log2 (1 + SNR) (12) where L x and Ly are the lengths of the sides of the rectangular image area; Lz is the depth of the Þeld of view of the system; and Bx , B y , and Bz are the spatial bandwidths of the system in the x, y, and z directions, respectively. T is the observation time, BT is the temporal bandwidth of the system, and SNR is the signal-to-noise ratio. The agreement of Eq. (12) with Eq. (11) is clear, if we take into account the incorporation of a factor 2 that is due to the two possible states of polarization. According to this invariance theorem, the parameters in Eq. (12) may be varied as long as NF remains constant. A priori knowledge can be used to determine if the actual information received is less than theoretically possible. For example, if it is known that an object is independent of time, the temporal bandwidth of the system, which seems useless at Þrst sight, can be used to transmit encoded additional high spatial frequency information. This information can then be decoded again at the receiver, that is, the detector after which a superresolution image can be formed. In this way, the spatial bandwidths of the system can be extended beyond the diffraction limit by a proportional reduction of another constituent parameter
258
A. VAN DEN BOS AND A. J. DEN DEKKER
in Eq. (12), namely the temporal bandwidth, BT . Furthermore, the fact that the SNR is included in Eq. (12) makes it possible to analyze the practical limits of superresolution methods. This may be seen as follows. Analytic continuation may be regarded as a trade-off between the bandwidth parameters and the SNR parameter in Eq. (12): in order to increase the spatial bandwidth parameters, the SNR parameter in Eq. (12) must be decreased proportionally. It is then possible to derive the maximum resolution improvement attainable by means of analytic continuation given a minimum acceptable image SNR (Cox and Sheppard, 1986). On the basis of ShannonÕs concept of information capacity, Kosarev (1990) derived an absolute limit for resolution enhancement in comparison with the diffraction limit. He showed that optimum superresolution is, in principle, determined by noise and may be computed via ShannonÕs result, Eq. (10). Kosarev found that the resolution limit δ for a signal in the form of equidistant lines, all having equal amplitudes, can be described as = 21 C log2 (1 + SNR) (13) δ where is the width of the point spread function, and the constant C is equal to W, with W the spatial bandwidth of the signal. The ratio /δ is called the superresolution coefÞcient. Note that Kosarev considered the case of a one-dimensional and time independent signal but his results may be extended to include more dimensional as well as time dependent imaging systems. It is worth noting that, as recognized by Kosarev, the resolution limit could be exceeded if the unknown signal, which is the object, could be parameterized. Parametric methods can, in principle, attain a better resolution than the Shannon limit. Such methods will be addressed in detail in the remainder of this article. In conclusion, in agreement with what was pointed out in the previous subsections, the work on information theory also shows that, ultimately, resolution is not limited by diffraction but by the unavoidable presence of noise.
D. Approaches Based on Signal-to-Noise Ratio In this subsection, other attempts to determine the resolution limit of an imaging system as a function of the signal-to-noise ratio (SNR) will be discussed. A vast amount of literature on this subject is available. For example, Idell and Webster (1992) deÞne a resolution limit based on an expression for the SNR in the frequency domain. They consider a coherent imaging process and view this as a procedure for estimating the imageÕs Fourier spectrum. Using a continuous detection model to describe the operation of photo-count noise-limited image recording and known statistical properties of laser speckle patterns, they compute the SNR of the Fourier spectrum of a detected, coherent
RESOLUTION RECONSIDERED
259
|E[D( f )]| [V ar {D( f )}]1/2
(14)
image, deÞned as SNR D( f ) =
where D( f ) is the estimate of the Fourier transform of the detected, coherent image. This SNR expression can now be used to quantify the effective spatial frequency resolution limit achievable with a given coherent imaging system. For this purpose, one should Þrst establish a minimum frequency domain SNR for which D( f ) may still be considered useful. The resolution limit is then deÞned as the highest spatial frequency value for which the system achieves this SNR value. The magnitude of spectra of real object scenes tends to drop off at higher frequencies. This means that for many object scenes, in practice, effective cutoff frequencies can be deÞned which provide a practical measure of frequency domain resolution (Idell and Webster, 1992). An alternative SNR-based deÞnition of the resolution limit has been proposed by Falconi (1967), who studied the two-point resolution of an imaging system in the presence of noise. For this purpose, he considered an object consisting of two closely located point sources of known separation imaged by an imaging system with a known point spread function. Furthermore, he assumed that the noise consists of ßuctuations in the number of detected photons only. In order to Þnd the limit, θ m , up to which the angular separation of the two sources can be measured, Falconi assumed the separation to be slightly decreased from θ to θ a , where the angle θ is a distance measured in the focal plane and divided by the focal length of the objective lens. An inÞnitesimal detector, of width dθ , at the focal plane will then detect a photon ßux change, or signal change (I − Ia) dθ, with I and Ia the photon intensities in the composite image of the two point sources at separations θ and θa , respectively. The standard deviation of the noise seen by the same detector is just (I dθ )1/2 . Falconi then deÞned the measurement limit, θm , as the minimum angular change in separation of the two sources to give an overall SNR equal to one, where the overall SNR is found from ∞ (I − Ia )2 dθ (15) SNR2 = I −∞ According to Falconi, the measurement limit thus obtained also deÞnes the standard deviation of the error of measurement of the separation of the two sources after the detection of a given number of photons. The resolution limit is then deÞned as the separation of both sources for which the standard deviation of the measurement error, which is the measurement limit, is equal to the actual separation. This resolution limit depends on the total dose of photons detected, the shape and the width of the systemÕs point spread function, and the intensity ratio of both sources.
260
A. VAN DEN BOS AND A. J. DEN DEKKER
A similar approach is based on the Þnding that the measurement precision with which a single point source can be measured can be expressed as the ratio of some constant of the order of the Rayleigh limit, which is called the resolution scale, to the SNR (Fried, 1979, 1980). Analogous results have been found in radar theory, where a resolution scale is used which, as a fraction of the SNR, equals the precision with which a single target position can be measured (Dunn et al., 1970). Fried studied the general problem of measuring simultaneously the midpoint location, separation, and relative intensities of a pair of point sources. Treating this problem as a signal processing problem and applying a matched Þlter, Fried (1979, 1980) calculated the root mean square (rms) precision with which these parameters can be estimated. In addition, Fried calculated the rms precision with which the location of a single point source can be measured, and found, in agreement with earlier results, that this precision can be deÞned as the ratio of a resolution scale to the SNR. In examining quantitative results for measurement of the parameters of a pair of point sources, Fried found that there is no fundamental impediment to measuring these parametersÑe ven when the separation of both point sources is signiÞcantly smaller than the Rayleigh limitÑother than the absence of an SNR required to achieve the desired level of precision. The results of Fried also show that a signiÞcant increase of the SNR, in the sense of by a factor of two or more, to counter the complicating effect of small separation, is not needed until the separation approximates the resolution scale. From these results, Fried concluded that, apparently, the resolution scale deÞnes not only the precision with which the position of a single point source can be measured but also the minimum separation of a pair of point sources that is required to avoid signiÞcant interference of the estimates of the parameters of both sources. This inspired Fried to suggest that the resolution scale, rather than the Rayleigh limit, ought to be considered basic to the resolution of an optical system. The resolution criteria discussed in this subsection have in common that they relate resolution to the SNR. However, they differ from one another in the assumed amount of available a priori knowledge. Furthermore, being more or less heuristic, they provide little insight into what the ultimate and exact limits to resolution are.
E. Approaches Based on Decision Theory Several authors have discussed the concept of resolution in the framework of classical decision theory (Cunningham and Laramore, 1976; Harris, 1964b; Helstrom, 1969; Nahrstedt and Schooley, 1979). In this framework, resolving two identical objects such as point sources is deÞned as deciding whether two
RESOLUTION RECONSIDERED
261
objects are present in the Þeld of view or only one. A measure of resolution is then the probability of making this binary decision correctly. Generally, the procedure is as follows. It is assumed that the set of possible objects is known a priori and that it consists of two alternatives only. For two-point resolution, these alternatives are one point source located at a speciÞed position and two point sources located at speciÞed positions, respectively. Given the noise statistics, the likelihood that two point sources are present and the likelihood that one point source is present are computed from the image observations. The ratio of both likelihoods is then used as a decision function. Obviously, this ratio will be a stochastic variable with a certain probability distribution. This distribution will depend on the two speciÞed object alternatives, the object that is actually present and the noise statistics. From this distribution, it is possible to calculate the probability that the application of the decision function would result in a correct decision. Accordingly, Harris (1964b) considered the case in which there are actually two point objects with a known separation while the observations are disturbed by additive normally distributed noise. The probability that application of the decision function would result in a correct decision then corresponds to the probability of resolution. The results of Harris clearly indicate that Òalthoughdiffraction increases the difÞculty of rendering a correct binary decision, it does not prevent a correct binary decision from being made, no matter how closely spaced the two points may beÓ(Harris, 1964b; p. 610). In the absence of noise, the probability of making the binary decision correctly would be equal to one. It is therefore only the limited precision of the observations and not diffraction that ultimately limits the resolution. Notice, however, that the preceding analysis breaks down if the true object is not a member of the a priori speciÞed set of possible objects, which may be a serious restriction in practice.
F. Approaches Based on Asymptotic Estimation Most of the resolution criteria and resolution enhancing methods discussed in the preceding subsections dealt with nonparameterized objects. However, the available a priori information about the object is often a parametric model. For instance, the notion of two-point resolution clearly implies a two-component model parametric in the locations, and possibly the amplitudes, of the components. An example of such a model is the sinc-square Rayleigh model. Resolving the components may then be regarded as a parameter estimation problem in which the most important parameters are the locations of the components. These and any other parameters of interest have to be estimated from a set of noisy observations. Usually, the number of parameters to be estimated is relatively small compared with the number of observations.
262
A. VAN DEN BOS AND A. J. DEN DEKKER
In Section IV, it will be shown how the parameters enter the probability density function of the statistical observations. From this parameterized probability density function, the Cram« erÐRaolower bound may be computed. This is a lower bound on the variance with which the parameters can be estimated. A detailed description of the Cram« erÐRaolower bound can also be found in Section IV. Furthermore, from the probability density function of the observations the maximum likelihood estimator of the parameters can be derived. An important property of the maximum likelihood estimator is that it achieves the Cram« erÐRao lower bound asymptotically, that is, for an inÞnite number of observations. Therefore, it is asymptotically most precise (Stuart et al., 1999). In the literature, many attempts to express resolution in terms of precision computed by use of statistical parameter estimation theory are found (Bettens et al., 1999; Cathey et al., 1984; Farrell, 1966; Helstrom 1969, 1970; Orhaug, 1969). Some of these references consider single-source resolution, which is deÞned as the capability of an imaging system including digital computing facilities to determine the position of a point-source object (Cathey et al., 1984; Farrell, 1966; Helstrom, 1969, 1970; Orhaug, 1969). Other references treat differential (two-source) resolution, which is deÞned as the systemÕs capability to determine the separation of two point sources (Bettens et al., 1999; Orhaug, 1969). For single-source and differential resolution, the parameters of interest are the location of the single point source and the separation and midpoint position or, equivalently, the locations of the two point sources, respectively. Farrell (1966) derived expressions for the Cram« erÐRaolower bound for estimators of the location of a point source in terms of its intensity, the shape of the point spread function, and the noise characteristics. He found that the Cram« erÐRaolower bounds decrease with increasing width of the imaging aperture and tend to zero for an increasing SNR. This was also found by Helstrom (1969), who determined the Cram« erÐRaolower bounds for estimators of object parameters such as radiance and position for a uniformly radiating circular object, not necessarily a point source, and a circular aperture. Furthermore, Helstrom found that for a given total energy received from the object, the Cram« erÐRaolower bounds increase with increasing radius of the object, that is, with decreasing degree of coherence of the light received at the imaging aperture. Orhaug (1969) and Cathey and coauthors (1984) used results obtained in a radar context (Kelly et al., 1960; Root, 1962; Swerling, 1964) to derive expressions for the variances of pertinent maximum likelihood estimators. In particular, Cathey and coauthors (1984) considered single-source resolution for one-dimensional coherent imaging in the presence of additive normally distributed noise. Taking the variance of the maximum likelihood estimator of the location of a point source as a measure of resolution, they showed that
RESOLUTION RECONSIDERED
263
resolution is improved as the curvature of the point spread function at its center is increased. This result corresponds to the guideline provided by the classical resolution criteria: the narrower the main lobe of the point spread function, the better the resolution. Furthermore, Cathey and coauthors found that, in agreement with earlier results, the variance is inversely proportional to the SNR. They also compared the result obtained when the aperture is a perfect low-pass spatial Þlter with that obtained when the aperture is a perfect bandpass Þlter with the same total bandwidth. It was found that enhanced resolution, that is, a lower variance, may be achieved by moving the passband away from the origin, since this results in a point spread function with a narrower main lobe. This is in essence an apodization procedure as described in Section II. Theoretically, the main lobe of the point spread function can thus be made arbitrarily narrow. However, as we carry the apodization process to extremes, eventually the probability of selecting a secondary lobe rather than the main lobe will no longer be an exception. Of course, this produces a large error in the location estimate. Then, the preceding analysis breaks down, as was pointed out by the authors themselves, who stated that their analysis is valid only for large SNRs and that it is restricted to the main lobe of the point spread function (Cathey et al., 1984). Applying statistical parameter estimation theory, Orhaug (1969) expressed both the single-source and differential source resolution of an imaging system in terms of the system and the noise parameters. He also discussed the relation between both kinds of resolution, which is found to be determined directly by the shape of the systemÕs point spread function, and he found that for separations that are large compared with the width of the point spread function, the variance with which the separation can be estimated is twice the variance with which the position of a single point source can be estimated. At this stage, it should be emphasized that the statistical parameter estimation theory as applied in the references discussed in this subsection is valid only if the parametric image model used is correctly speciÞed. This means that exact knowledge of the point spread function is required. Furthermore, computation of the Cram« erÐRaolower bound and application of maximum likelihood estimation both require the probability density function of the observations to be known. In addition, the results are asymptotic, since maximum likelihood estimators generally attain the Cram« erÐRaolower bound only asymptotically. (See, for instance, Bettens et al. (1999).) The three assumptionsÑcorrect model, large number of observations, and known probability density function of the observationsÑare often not realistic in practice. Therefore, the results of statistical parameter estimation theory should be treated with caution. On the other hand, the assumptions just listed are not made in the alternative parameter estimation based approach to the concept of two-point resolution proposed in this article and described in detail in Section VII.
264
A. VAN DEN BOS AND A. J. DEN DEKKER
G. Conclusions Several approaches to the concept of resolution different from the classical Rayleigh-like approach have been discussed in this section. It has been pointed out that, in principle, superresolution, that is, resolution beyond the diffraction limit, is possible. A condition necessary to achieve superresolution is the availability of a priori knowledge about the object. This a priori knowledge may be knowledge about Þnite object extension, positivity of the object, upper or lower bounds on the object intensity, object statistics, or parametric model of the object. Image restoration methods that incorporate a priori knowledge may, in principle, attain superresolution but their performance will always be limited by noise. These limits have been analyzed in various ways, by using, for instance, information theory, decision theory, or statistical parameter estimation theory. This has resulted in a variety of alternative resolution criteria. These criteria have in common that they deÞne the attainable resolution limit in terms of the signal-to-noise ratio but they differ from one another in the assumed amount and nature of available a priori knowledge. The subsequent sections of this article will focus on a parameter estimation approach to resolution. This approach is concerned with two-point resolution and all it requires is that a parametric model for the observations and a loglikelihood function or, equivalently, a criterion of goodness of Þt are chosen. Since this approach requires neither known error distributions and asymptoticity nor a correct model, it is essentially different from and a practical alternative to the approaches found in the literature and described in this section and Section II.
IV. Modeling and Parameter Estimation A. Introduction In this section, parametric statistical models of observations will be introduced. SpeciÞcally, these will be used in this article to model statistical-errorcorrupted observations made on two-component models such as the Rayleigh sinc-square two-point resolution model. For the purposes of this article, the most important parameters of such models are the locations of the components. It will be shown how these and other parameters enter the probability density function of the statistical observations. This parameterized probability density function is used for two purposes. First, from it, the Cram« erÐRaolower bound may be computed. This is a lower bound on the variance with which the parameters can be estimated. Second, from the probability density function the maximum likelihood estimator of the parameters is derived. This estimator actually achieves the Cram« erÐRaolower bound asymptotically, that is, for an
RESOLUTION RECONSIDERED
265
inÞnite number of observations. Therefore, it is asymptotically most precise. For this and other reasons, the maximum likelihood estimator is very important and is often used in practice. In this article, emphasis will be on the analysis of maximum likelihood estimators of the parameters of two-component models. However, it will be seen later that the results of this analysis also apply to estimators that are not always maximum likelihood such as the widely used least squares estimator. The section is concluded by a numerical example. Its purpose is to show that for Þnite numbers of observations the properties of maximum likelihood and comparable estimators may essentially differ from the asymptotic properties and that this may have important consequences for the ability of these estimators to resolve the location parameters and, hence, to distinguish the components.
B. Parametric Statistical Models of Observations Any applied physicist will readily admit that his or her observations Òcontain errors.ÓFor the purposes of this article, these errors must be speciÞed. This speciÞcation is the subject of this subsection. Generally, sets of observations made under the same conditions nevertheless differ from experiment to experiment. The usual way to describe this behavior is to model the observations as stochastic variables. The reason is that there is no viable alternative and that it has been found to work. Stochastic variables are deÞned by probability density functions. In this article, the observations are either counting results or continuous measurements. If the observations are counting results, they are nonnegative integers and the probability density function deÞnes the probability of occurrence of each of these integer outcomes. If the observations are continuous, the probability density function describes the probability of occurrence of an observation on a particular interval. Useful books about distributions and their properties are those of ChatÞeld (1995), Mood et al. (1987), Papoulis (1965), and Stuart and Ord (1994). Suppose that a set of observations wn , n = 1, . . . , N , is available. Then the N × 1 vector w deÞned as w = (w1 . . . w N )T
(16)
represents a point in the Euclidean N space having w1 . . . w N as coordinates. This will be called space of the observations throughout. The expectations of the observations are deÞned by their probability density function. The vector of expectations E[w] = (E[w1 ] . . . E[w N ])T
(17)
is also a point in the space of observations and the observations are distributed about this point.
266
A. VAN DEN BOS AND A. J. DEN DEKKER
These ingredients are sufÞcient for a deÞnition of nonsystematic, or statistical, errors in the observations. These are the differences of each observation and its expectation. Therefore, the expectation of the nonsystematic errors is equal to zero. In this article, the expectations of the observations in every measurement point are function values of a function of one or more independent variables which is parametric in the quantities to be measured. This is illustrated in the following example. Example IV.1 (Observations of Poisson Distributed Biexponential Decay) Suppose that observations wn , n = 1, . . . , N, are made on a biexponential decay process at the measurement points x1, . . . , xN . Here and in what follows, these measurement points are assumed to be exactly known. This is realistic in the many applications where the measurement points are time instants or locations. Furthermore, in this example, it is assumed that the observations are statistically independent and have a Poisson distribution. This implies that the probability that the observation wn is equal to ωn is equal to exp(−λn )
λωn n ωn !
(18)
where the parameter λn is equal to the expectation E[wn ]. Since the wn are assumed to be independent, the probability p(ω) of a set of observations ω = (ω1 . . . ωN)T is the product of all probabilities described by Eq. (18): λωn (19) exp(−λn ) n p(ω) = ωn ! n
with n = 1, . . . , N. Since the parameter λn in this expression is equal to the expectation E[wn ], it is, by deÞnition, equal to the value of the biexponential function in the measurement point xn: E[wn ] = λn = α1 exp(−β1 xn ) + α2 exp(−β2 xn )
(20)
where the amplitudes α 1 and α 2 and the decay constants β 1 and β 2 are the unknown parameters of the function which have to be estimated from the observations. The quantities w n − λn
(21)
are the nonsystematic errors in the observations. Their expectations are equal to E[wn ] − λn and are, therefore, equal to zero. Example IV.1 shows that the expectation of the observations is an accurate description of what experimenters usually call the model underlying the observations. Also, substitution of λn as described by Eq. (20) in Eq. (19) shows how the probability density of wn depends on the parameters α1 , α2 , β1 , and β2 .
RESOLUTION RECONSIDERED
267
This dependence will be extensively used later for the design of estimators of these parameters from a set of observations. Of course, such an estimation procedure is correct only if the expectation model is an accurate description of the true expectations. Example IV.2 (Wrong Model of the Expectations of the Observations) Suppose that in Example IV.1 the expectations of the observations are described by λn = γ1 + γ2 xn + α1 exp(−β1 xn ) + α2 exp(−β2 xn )
(22)
with γ 1, γ 2 > 0. In practice, a function like γ 1 + γ 2x usually represents a background function. It constitutes a systematic contribution to the expectations and thus to the observations. Then, if this function is not included in the expectation model of the observations and λn described by Eq. (20) is substituted in the probability density function instead of λn described by Eq. (22), the functional dependence of this probability density function on α1 , α2 , β1 , and β2 is wrong. In what follows, this will be called a systematic error, or modeling error. It will be shown later that the modeling error has consequences for the estimates of these parameters and, therefore, for distinguishing estimates of closely located parameters β 1 and β 2.
C. Parameter Estimation In the previous subsection, examples were presented of introducing the parameters to be measured into the probability density function of the observations. In this subsection, these results will Þrst be somewhat generalized. Next, it will be shown how the probability densities thus parameterized may be used to deÞne the Fisher score concept and to compute the Cram« erÐRaolower bound, which is a lower bound on the variance of any unbiased estimator. Then, it is discussed how, from the parameterized probability density functions, the maximum likelihood estimator of the parameters may be derived. This subsection is concluded by a numerical example demonstrating the limits to resolution of maximum likelihood estimates of the parameters of two-component models like those used in optics. 1. Dependence of the Probability Density Function of the Observations on the Parameters In this subsection, examples will be presented illustrating the dependence of the probabilities or probability density functions of the observations on the parameters. To simplify the analysis, in what follows we will use the logarithm
268
A. VAN DEN BOS AND A. J. DEN DEKKER
of the probabilities or probability density functions of the observations instead of the probabilities or probability density functions themselves. Example IV.3 (Poisson Distributed Observations) The logarithm of the probability of the Poisson distributed observations described by Eq. (19) becomes − λn + ωn lnλn − ln ωn ! (23) n
Suppose that the expectation of the nth observation is described by E[wn ] = f n (θ) = f (xn ; θ)
(24)
where xn is the measurement point which is supposed to be known and which may be vector valued. The vector θ = (θ1 . . . θ K )T is the vector of parameters of the function. Then Expr. (23) may be written as − f n (θ) + ωn ln f n (θ ) − ln ωn ! (25) q(θ ) = n
where the last term does not depend on θ . Example IV.4 (Normally Distributed Observations) The probability density function of normally distributed observations is deÞned by p(ω) =
1 (2π ) N /2 (det
W )1/2
exp − 12 (ω − E[w])T W −1 (ω − E[w])
(26)
where ω = (ω1 . . . ω N )T , W is the N × N covariance matrix of the observations with as its (n1, n2)-th element cov(wn1 , wn 2 ), and where det W and W −1 are the determinant and the inverse of W, respectively. The N × 1 vector E[w] of expectations is deÞned by Eq. (17). The logarithm of p(ω) is equal to q(θ) = −
1 1 N ln 2π − ln det W − (ω − f (θ))T W −1 (ω − f (θ)) 2 2 2
(27)
where f (θ ) = ( f 1 (θ) . . . f N (θ))T . Notice that both Þrst terms of Eq. (27) are independent of the parameters θ. If the wn are uncorrelated, W = diag σw21 . . . σw2 N (28) and
N 1 ωn − f n (θ) 2 q(θ) = − ln 2π − ln σwn − 2 2 n σwn n
(29)
RESOLUTION RECONSIDERED
269
where σw2n is the variance of wn . If, in addition, σw2n = σw2 for all n, q(θ) = −
1 N ln 2π − N ln σw − [ωn − f n (θ)]2 2 2σw2 n
(30)
Example IV.5 (Binomially Distributed Observations) Next, consider independent and binomially distributed observations. Then the probability that an observation wn is equal to ωn is deÞned as % & M (31) pnωn (1 − pn ) M−ωn ωn where
%
M ωn
&
(32)
is the binomial coefÞcient; M is the total number of trials, which is assumed to be equal in all points; ωn is the number of successes; and pn is the probability of success. The expectation of a binomially distributed wn is equal to M pn and, therefore, pn = f n (θ)/M. Then the logarithm of the probability of w is equal to %M & ln − NM ln M q(θ) = ωn n ωn ln f n (θ) + (M − ωn ) ln(M − f n (θ )) (33) + n
Notice that only the last sum depends on the parameters θ.
These three examples show that it may be relatively easy to establish the functional relationship of the probability or the probability density function of the observations to the parameters. This relationship will be used for two purposes. First, in Section IV.C.2, it will be used to compute a lower bound on the variance of estimators of the parameters, the Cram« erÐRaolower bound. Then, in Section IV.C.3, it will be used for the construction of the most important type of estimator, the maximum likelihood estimator. 2. Limits to Precision: The Cram« erÐRaoLower Bound In this subsection, the Cram« erÐRao lower bound will be introduced. This is a lower bound on the variance of any unbiased estimator of a parameter. An estimator is said to be unbiased if its expectation is equal to the true value of the parameter. Stated differently, an unbiased estimator has no systematic error.
270
A. VAN DEN BOS AND A. J. DEN DEKKER
First, the Fisher score vector (Jennrich, 1995) is introduced. This is deÞned as the K × 1 vector s(θ) =
∂q(θ ) ∂θ
(34)
The expectation of the Fisher score vector is equal to the null vector. This follows from . . . p(ω) dω = 1 (35) and hence
...
∂ p(ω) dω = ∂θ
...
∂ ln p(ω) p(ω) dω = o ∂θ
(36)
and this is equivalent to
∂q(θ) E = E[s(θ)] = o ∂θ
(37)
where o is the K × 1 null vector. Notice that this result applies to any allowable parameterization of the probability density function since it is a direct consequence of the volume of any probability density function being equal to one. Since the expectation of the Fisher score is equal to zero, its K × K covariance matrix is described by ∂ ln p(ω) ∂ ln p(ω) T (38) F = E[s(θ)s (θ)] = E ∂θ ∂θ T This covariance matrix is called the Fisher information matrix. It can be shown (Dhrymes, 1970; Goodwin and Payne, 1977) that under general conditions the covariance matrix cov(t) of any unbiased estimator t of θ satisÞes cov(t) ≥ F −1
(39)
This inequality expresses that the difference of cov(t) and F −1 is positive semideÞnite. Since the diagonal elements of cov(t) represent the variances of t1 , . . . , t N and since the diagonal elements of a positive semideÞnite matrix are nonnegative, these variances are larger than or equal to the corresponding diagonal elements of F −1 . In this sense, F −1 represents a lower bound to the erÐRaolower variances of all unbiased t. The matrix F −1 is called the Cram« bound on the variance of t. The Cram« erÐRaolower bound can be extended to include unbiased estimators of vectors of functions of the parameters instead of the parameters proper.
RESOLUTION RECONSIDERED
271
ρ(θ) = (ρ1 (θ) . . . ρ L (θ))T
(40)
Let
be such a vector and let r = (r1 . . . r L )T be an unbiased estimator of ρ(θ ). Then it can be shown that cov(r ) ≥
∂ρ −1 ∂ρ T F ∂θ T ∂θ
(41)
This expression will be used in Section IV.C.4. 3. Maximum Likelihood Estimation The maximum likelihood method for estimation of the parameters assumes the probability or probability density function of the observations to be known. It also assumes the dependence of the probability density function on the unknown parameters of the observations to be known. Examples of how this dependence may be established have been presented in Section IV.C.1. The maximum likelihood procedure consists of the following three steps: 1. The available observations w = (w1 . . . w N )T are substituted for the corresponding independent variables ω = (ω1 . . . ω N )T in the probability or the probability density function of the observations. Since the observations are numbers, the resulting expression depends only on the elements of θ = (θ1 . . . θ K )T . 2. The elements of θ = (θ1 · · · θ K )T , which are the hypothetical true parameters, are considered to be variables. To express this, we replace them by t = (t1 . . . t K )T . The resulting function q(t) is called the log-likelihood function of the parameters t for the observations w. The pertinent log-likelihood functions for the three distributions discussed earlier in this section are described by the Eqs. (25), (27), and (33) with ωn replaced by wn and θ by t, respectively. 3. The maximum likelihood estimates tö of the parameters θ are computed. These are deÞned as the values of the elements of t that maximize q(t), or tö = arg maxt q(t)
(42)
An important consideration is that maximum likelihood estimators are usually relatively easily found. All that needs to be done is to establish the dependence of the supposed probability or probability density function of the observations on the parameters. Examples of this process have been presented in Section IV.C.1. In these examples, the dependence on the parameters is established via the expectation of the observations.
272
A. VAN DEN BOS AND A. J. DEN DEKKER
The theory of maximum likelihood estimation can be found in Norden (1972, 1973), Stuart et al. (1999), Stuart and Ord (1994), and Zacks (1971). The content of these references may quickly convince any reader that a rigorous description of the statistical properties of maximum likelihood estimators is far outside the scope of this article. For the purposes of this article, their most important properties may generally be described as follows: r
r
r
Consistency. Generally, an estimator is said to be consistent if the probability that an estimate deviates more than a speciÞed amount from the true value of the parameter can be made arbitrarily small by increasing the number of observations used. Asymptotic normality. If the number of observations increases, the probability density function of a maximum likelihood estimator tends to a normal distribution. Asymptotic efÞciency. The asymptotic covariance matrix of a maximum likelihood estimator is equal to the Cram« erÐRao lower bound. In this sense, the maximum likelihood estimator is most precise.
Also notice that unless the likelihood function q(t) is quadratic in the elements of t, the actual computation of the maximum likelihood estimate is a nonlinear optimization problem that cannot be solved in closed form. Finally, Eq. (30) shows that the likelihood function for independent and identically normally distributed observations is described by 1 N [wn − f n (t)]2 (43) q(t) = − ln 2π − N ln σw − 2 2σw2 n
From this expression, it follows that under these conditions the value of t minimizing the sum in the last term is identical to the maximum likelihood estimate. To simplify the terminology, in what follows we will also use the term likelihood function for the nonlinear least squares criterion [wn − f n (t)]2 (44) n
on the understanding that it has to be minimized instead of maximized. Minimizing Eq. (44) with respect to t and using the value of t at the minimum as an estimate of θ is called nonlinear least squares estimation. For an overview of this method and numerical methods to implement it, see van den Bos (1982). The nonlinear least squares method is often used in practice, irrespective of or in the absence of knowledge of the probability density function of the observations. If it is applied to observations other than those that are independent and identically normally distributed, it is generally not a maximum likelihood estimator. Therefore, in what follows, so that estimators that are not necessarily maximum likelihood can be included, the probability or probability density
RESOLUTION RECONSIDERED
273
function of the observations will not necessarily correspond to the likelihood function used. Consequently, the theory developed will apply to all probability density functions of the observations on the one hand and likelihood functions and criteria of goodness of Þt, such as the least squares criterion, on the other. The three properties of maximum likelihood estimators listed previously are asymptotic. For example, if the observations are described by wn = f n (θ ) + vn
(45)
with the vn independent and identically normally distributed errors with an expectation equal to zero and variance σw2 , the least squares estimates of the elements of θ may be expected to be normally distributed about θ with covariance matrix % T & ∂ f −1 2 ∂f (46) σw ∂θ ∂θ T with f = ( f 1 (θ) . . . f N (θ))T if the number of observations is sufÞciently large. If this condition is not met, the properties of the solutions may be essentially different from the asymptotic properties. This is illustrated by a numerical example in the next subsection. 4. Limits to Resolution of Parameters: A Numerical Example Suppose that observations w = (w1 . . . w N )T have been made with bi-Gaussian expectations described by 6 5 6 5 (47) f n (θ) = α1 exp − 21 (xn − β1 )2 + α2 exp − 12 (xn − β2 )2
where f n (θ) = f (xn ; θ) and the unknown parameters θ are the locations (β1 β2 )T. For simplicity, the parameters α 1 and α 2 are supposed to be known and to be positive. Furthermore, assume that the component functions of f (x; θ) strongly overlap. Since the half-width of the components of f (x; θ ) is approximately equal to one, this means |β1 − β2 | ≪ 1. Let the observations be described by Eq. (45) with the vn independent and identically normally distributed errors with a standard deviation σw and an expectation equal to zero. Next deÞne α1′ = α1 /(α1 + α2 ) and α2′ = α2 /(α1 + α2 ). Then, the functions of the parameters ρ1 = α1′ β1 + α2′ β2 and ρ2 = β1 − β2 are a measure of the overall location of f (x; θ) and the difference of the locations of its components, respectively. By Eq. (41), the Cram« erÐRaolower bound on the variance of unbiased estimators of ρ = (ρ1 ρ2 )T is described by ∂ρ −1 ∂ρ T F ∂θ T ∂θ
(48)
274
A. VAN DEN BOS AND A. J. DEN DEKKER
Figure 1. Measurement points and expectations of the observations in the numerical example of Section IV.C.4. The expectations are the sum of two overlapping Gaussian functions.
where F −1 is described by Eq. (46) and % ′ α1 ∂ρ = 1 ∂θ T
& α2′ −1
(49)
Suppose that the values of the parameters are, respectively, α1 = 0.4, α2 = 0.6, and β1 = −β2 = −0.05. Furthermore, assume that there are 21 measurement points xn = (n − 11) × 0.4, n = 1, . . . , 21. Figure 1 shows the twocomponent function and the location of the measurement points. The standard deviation σw of the observations is taken equal to 0.004. If the asymptotic covariance matrix described by Eq. (48) is used as covariance matrix, it yields a standard deviation 0.003 in the estimates of the location ρ1 = 0.4β1 + 0.6β2 , a standard deviation 0.09 in the estimates of the distance ρ2 = β1 − β2 , and a low correlation coefÞcient of the estimates of −0.01. If the corresponding asymptotic normal distribution is taken as distribution of the estimates, the estimates of ρ 2 are marginally normally distributed with an expectation −0.1 and a standard deviation 0.09. These asymptotic considerations predict a probability of an estimate on the interval (−0.2, 0) of about .74. Furthermore, it is easily shown that the probability of a positive estimate, which causes the components to change places, is about .13. Since this implies a probability of .87 that the components do not change places, the conclusion from these asymptotic considerations is that with a probability of .87 an
RESOLUTION RECONSIDERED
275
Figure 2. Histogram of estimates of the difference of the locations (horizontal) and the overall location of the components (diagonal) of the overlapping Gaussian functions in the numerical example of Section IV.C.4.
estimate of the distance is obtained that is perhaps not very precise but may be called meaningful. To check these results, we next carry out the following simulation experiment (den Dekker, 1992). A number of 100,000 sets of 21 normally distributed observations such as described are generated and for each set the values of the estimates of β 1 and β 2 minimizing the least squares criterion are computed. Under the described conditions, this is a maximum likelihood estimate. The results, computed by use of a numerical minimization procedure, are collected into a histogram shown in Figure 2. This histogram may be explained as follows. The horizontal coordinate is bö1 − bö2 , which is the difference of the estimates of the locations. Therefore, at the origin of this coordinate bö1 = bö2 . This corresponds to coinciding components or, equivalently, to one single component at a location b = bö1 = bö2 and with an amplitude α = α1 + α2 . To the left of the origin, bö1 < bö2 . Therefore, this is where the exact difference β1 − β2 is found. The diagonal coordinate is equal to α1′ bö1 + α2′ bö2 , which is an estimate of the overall location of the twocomponent model. The horizontal plane is divided into classes as follows. The interval (0, 0.02) of the (α1′ bö1 + α2′ bö2 )-axis and the interval (−0.28, 0.28) of the (bö1 − bö2 )-axis are divided into 20 and 17 equal subintervals, respectively.
276
A. VAN DEN BOS AND A. J. DEN DEKKER
Thus the 100,000 solutions are divided into 340 different classes with the exception of 11 solutions which were found outside the intervals mentioned. The vertical coordinate represents the frequency, that is, the number of times a solution in a particular class occurs. The resulting histogram is trimodal. The peak in the middle represents solutions bö1 − bö2 distributed about zero. Together, the classes having bö1 − bö2 = 0 as their midpoint contain a number of 29,782 solutions. Of these, 29,323 solutions exactly coincide, that is, bö1 = bö2 . On the other hand, the additional lower peaks to the left and to the right of bö1 − bö2 = 0, by deÞnition, correspond to solutions with distinct values for bö1 and bö2 . The solutions represented by the left-hand peak are more or less distributed about the true values: (bö1 − bö2 , α1′ bö1 + α2′ bö2 ) = (β1 − β2 , α1′ β1 + α2′ β2 ) (50) The location of the component with amplitude α 1 relative to that with amplitude α 2 agrees with the relative location of the component with amplitude α 1 in the exact model. In the solutions represented by the right-hand peak of the histogram, the components have changed places. The number of solutions in the right-hand peak is almost equal to that in the left-hand peak. The conclusion drawn from these simulation results is that there are substantial and essential differences between the asymptotic predictions and the actual simulation results. The asymptotic prediction is a unimodal normal probability density function around the true values of the parameters. This probability density function is such that with a probability of .13 the sign of the difference of the locations will be opposite that with exact observations. Then the components change places. The simulations, on the other hand, produce a trimodal probability density function. The probability of a solution in the neighborhood of the true values of the locations is about .35. This probability is more or less equal to that of a similar solution but with the locations reversed. The probability of exactly coinciding solutions is about .3. Thus the solution from two-component observations is a one-component model. This means that with a probability of .3 it is not possible to resolve the locations of the components. The resolution limit thus encountered will be the subject of the sections to follow.
D. Conclusions In this section, maximum likelihood estimators for measurement of parameters of statistical observations have been introduced with emphasis on estimation from noise-corrupted observations made on two-component models such as the biexponential and bi-Gaussian function. By deÞnition, these estimators maximize the likelihood function of the observations. The favorable properties
RESOLUTION RECONSIDERED
277
of maximum likelihood estimators are asymptotic, that is, for an inÞnitenumber of observations. A numerical example has been presented showing that for Þnite numbers of observations these properties may essentially differ and obstruct the resolving of the location parameters of the two-component models. It will be shown later in this article that this behavior is caused by a change of the structure of the likelihood function under the inßuence of the ßuctuations in the observations. For an explanation of this structural change, elements of a branch of mathematics called singularity theory are needed. These will be presented in Section V.
V. Elements of Singularity Theory A. Introduction Section IV was concluded by an example illustrating how ßuctuations in the observations may obstruct the resolving of the components of a two-component model. In Sections VI and VII, singularity theory will be used to explain this phenomenon. The relevant results and terminology from this theory will be presented in this section. The combination of singularity theory and its applications is also called catastrophe theory (ArnolÕd, 1992). For the purposes of this article, useful texts on catastrophe theory are those of ArnolÕd (1992), Gilmore (1981), Poston and Stewart (1978), and Saunders (1980). In Thompson (1982), fascinating examples of the use of singularity theory in theoretical and applied science are described. Generally, singularity theory deals with structural changes of parametric functions of one or more independent variables as a result of changes of the parameters. In this context, the word structural relates to the structure of the function, that is, the number and nature of its stationary points. These are the points where the gradient of the function vanishes. The nature of stationary points may be different: they may be, absolute or relative, maxima or minima or be saddle points. Saddle points are minima in one or more coordinates and maxima in the others. A simple example of a saddle point is a mountain pass. Thus structural change of a function may be deÞned as a change of the number and nature of the stationary points. Central in singularity theory is the insight that structural change of a function may occur only if, as a result of a change of the parameters of the function, one of the stationary points becomes degenerate. Whether a stationary point is degenerate depends on the Hessian matrix in that stationary point. This is the matrix of the second-order partial derivatives of the function with respect to its independent variables. At a maximum, all eigenvalues of the Hessian matrix
278
A. VAN DEN BOS AND A. J. DEN DEKKER
are negative, at a minimum they are all positive, while at a saddle point they are partly negative, partly positive. If at a stationary point one or more eigenvalues vanish, the Hessian matrix is called singular and the stationary point is said to be degenerate. It is clear that the nature of a stationary point remains the same as long as changes of the values of the parameters do not cause one or more eigenvalues to vanish. If this applies to all stationary points, the structure of the function remains the same and the function is called structurally stable. On the other hand, if as a result of changing parameters, an eigenvalue vanishes, further change of the parameters may change the structure of the function. Therefore, for parameter values corresponding to vanishing eigenvalues, the function is said to be structurally unstable. In the Euclidean space of the parameters, the parameter values for which a particular eigenvalue is equal to zero may be collected into sets. These sets are called bifurcation sets. An important consideration in what follows is that one may freely move in parameter space without altering the structure of the function as long as no bifurcation set is crossed. In the applications studied in this article, at the bifurcation set two stationary points merge to form a single degenerate stationary point and, subsequently, vanish. If the bifurcation set is crossed in the opposite direction, two new stationary points appear. In Sections VI and VII, it will be shown that this merging and subsequent vanishing of stationary points also explains the vanishing of distinct solutions for the model parameters of the numerical example of Section IV.C.4. The outline of this section is as follows. In Section V.B, relevant notions and deÞnitions are introduced. Then, in Section V.C, the representation of a function for parameter values close to a bifurcation set is discussed. Finally, in Section V.D, an algorithm is described for the systematic derivation of such a representation. Conclusions are drawn in Section V.E.
B. DeÞnitions and Notions In this section, f (x; λ) is a real scalar function of the elements xp of the P × 1 vector of independent variables x = (x1 . . . x P )T
(51)
The function is parametric in the elements λq of the Q × 1 vector of parameters λ = (λ1 . . . λ Q )T
(52)
In these expressions, the superscript T denotes transposition. It will be assumed throughout that f (x; λ) is a smooth function of the x p ; that is, all its partial derivatives exist and are continuous. For the validity of some results to follow, this assumption is too severe but it simpliÞes the mathematical formalism.
RESOLUTION RECONSIDERED
279
Moreover, the likelihood functions and criteria of goodness of Þt that the theory will be applied to in Sections VI and VII are smooth functions. The P × 1 vector of Þrst-order partial derivatives of f (x; λ) deÞned as % & ∂f ∂f T ∂f = (53) ··· ∂x ∂ x1 ∂xP is called the gradient of f (x; λ) with respect to the vector x. A point x is called stationary if in that point the gradient is equal to the null vector. The Hessian matrix is deÞned as ⎛ 2 ⎞ ∂2 f ∂2 f ∂ f · ⎜ ∂x2 ∂ x1 ∂ x2 ∂ x1 ∂ x P ⎟ 1 ⎜ ⎟ ⎟ ⎜ ∂2 f ∂2 f ⎟ ⎜ 2 ∂ f · · ⎟ ⎜ 2 (54) = ⎟ ⎜ ∂ x ∂ x ∂ x 2 ⎟ ⎜ 2 1 ∂x ∂xT ⎟ ⎜ · · · · ⎟ ⎜ ⎝ ∂2 f ∂2 f ⎠ · · ∂ x P ∂ x1 ∂ x P2
If, at a stationary point, the eigenvalues of the Hessian matrix are nonzero, that stationary point is called nondegenerate or Morse. At a Morse maximum all eigenvalues are strictly negative. At a Morse minimum all eigenvalues are strictly positive. A Morse stationary point is called a Morse R-saddle point if R of the eigenvalues of the Hessian matrix concerned are strictly negative and the remaining ones are strictly positive. The representation of functions in and around stationary points, degenerate and nondegenerate, is the subject of the next subsection.
C. Functions around Stationary Points 1. The Morse Lemma and the Splitting Lemma It can be shown (Poston and Stewart, 1978) that in the neighborhood of a Morse R-saddle point a smooth function f (x; λ) may be represented in local coordinates ξ = (ξ1 . . . ξ P )T as 2 2 − ξ P−R+1 − · · · − ξ P2 ξ12 + ξ22 + · · · + ξ P−R
(55)
where the saddle point has been chosen as the origin and the function has been translated so as to make it vanish at the origin. The result Expr. (55) is called the Morse lemma (for parametric families of functions). Notice that the Morse lemma implies that the form Expr. (55) is valid for all values of λ concerned. If all stationary points of the function f (x; λ) are Morse, each of them is locally represented by an expression of the form Expr. (55) and the structure of f (x; λ)
280
A. VAN DEN BOS AND A. J. DEN DEKKER
does not change if the elements of λ are changed. Under these conditions, the function is called structurally stable. ÷ of the Next consider a function f (x; λ) that has, for a particular value λ parameters λ, a degenerate stationary point x÷. That is, one or more, say S, eigenvalues of the Hessian matrix at the stationary point are equal to zero. Then the splitting lemma (for parametric families of functions) states that in ÷ the function f (x; λ) may be represented as the neighborhood of (x÷, λ) 2 + g(ξ P−S+1 , . . . , ξ P ; λ) d1 ξ12 + d2 ξ22 + · · · + d P−S ξ P−S
(56)
where the d p , p = 1, . . . , P − S, are either equal to 1 or to −1 (Poston and Stewart, 1978). In this expression, the ξ p are local coordinates with the stationary point as origin, S is the number of eigenvalues equal to zero, and g(ξ P−S+1 , . . . , ξ P ; λ) is a function of the coordinates ξ P−S+1 , . . . , ξ P and depending on λ. The quadratic terms in Expr. (56) are called the nondegenerate or Morse part of the representation of the function f (x; λ) in the neighborhood ÷ and g(ξ P−S+1 , . . . , ξ P ; λ) is the degenerate part. of (x÷, λ), It is clear that ξ1 , ξ2 , . . . , ξ P−S are the coordinate directions in which nothing happens if λ is varied. They are, therefore, called the inessential variables. If through variation of λ the structure of the function f (x; λ) changes, it is the degenerate part g(ξ P−S+1 , . . . , ξ P ; λ) that changes. This is why ξ P−S+1 , . . . , ξ P are called essential variables. 2. Simple Examples of Singular Representations In this subsection, two simple examples of representations of functions in the neighborhood of singular stationary points will be introduced. Example V.1 (The Fold Catastrophe) The Þrst example is the function 2 f (x; λ) = x12 + x22 + · · · + x P−1 + λx P + 13 x P3
The gradient of this function is described by T 2x1 2x2 . . . 2x P−1 λ + x P2
(57)
(58)
Its Hessian matrix is the diagonal matrix
2 diag(1 1 . . . 1
xP)
(59)
Equations (58) and (59) show that the function has two stationary points: (0 0 . . . 0 ±(−λ)1/2 )
(60)
for λ < 0 and that the Hessian matrices of the function at these stationary
RESOLUTION RECONSIDERED
281
Figure 3. The fold catastrophe if its parameter is (a) negative, (b) equal to zero, and (c) positive, respectively.
points are equal to 2 diag(1 1 . . . 1 ±(−λ)1/2 )
(61)
These stationary points merge to form one degenerate stationary point if λ vanishes. For λ > 0 the function has no stationary points at all. It is clear that g(x P ; λ) = λx P + 13 x P3
(62)
is the degenerate part of the function. Figure 3 shows g(x p ; λ) for λ = −1, λ = 0, and λ = 1, respectively. The function has one minimum and one maximum if λ is negative. For increasing λ, the minimum and the maximum get closer and closer and merge to form one degenerate stationary point for λ equal to zero. If λ becomes positive, the function has no stationary points anymore. Therefore, the bifurcation set, that is, the set of all points in the Euclidean space of the parameters where the function g(x P ; λ) changes its structure, is simply λ=0
(63)
In catastrophe theory, the function described by Eq. (62) is called the (canonical) fold catastrophe. Applying these results to the function f (x; λ) deÞned by Eq. (57) shows that this function changes from a function with a one-minimum, one one-saddle point structure into a function without any stationary point or the reverse if the sign of λ changes. Example V.2 (The Cusp Catastrophe) The second example of a function representation in the neighborhood of a singular stationary point is 2 f (x; λ) = x12 + x22 + · · · + x P−1 + g(x P ; λ)
(64)
282
A. VAN DEN BOS AND A. J. DEN DEKKER
with g(x; λ) = λ1 x P + 21 λ2 x P2 + 14 x P4
(65)
where λ = (λ1 λ2 )T is the vector of parameters. The gradient of f (x; λ) is described by T (66) 2x1 2x2 . . . 2x P−1 λ1 + λ2 x P + x P3 and its Hessian matrix is equal to diag 2 2 . . . 2 λ2 + 3x P2
(67)
From Expr. (66), it follows that the number of stationary points is equal to the number of real roots of the cubic polynomial λ1 + λ2 x P + x P3
(68)
DeÞne the discriminant D of this polynomial as D = 4λ32 + 27λ21
(69)
Then the polynomial has one real root if D > 0, three real roots of which at least two are equal if D = 0, and three different real roots if D < 0, respectively. This is a standard result from algebra (Selby, 1971). Moreover, it is easily shown that as a result of the absence of a quadratic term in Expr. (68), three equal real roots can occur only if both λ1 and λ2 are equal to zero. In that case, Expr. (68) reduces to x P3 and, consequently, the three roots are equal to zero. Since the coefÞcient of the quartic term x P4 in Eq. (65) is positive, the single real root occurring for D > 0 represents a minimum of g(x; λ). Furthermore, the three different real roots occurring for D < 0 correspond to two minima with a maximum between. Finally, if for D = 0 two of the three real roots coincide, the coinciding roots correspond to a degenerate stationary point resulting from the merging of one of the minima with the maximum of g(x; λ) while the remaining root corresponds to a minimum. If for D = 0 the three real roots coincide, the resulting stationary point is a degenerate minimum. From these considerations, it is clear that the function g(x; λ) changes from a two-minima, one-maximum structure into a single-minimum structure if the sign of the discriminant changes from negative to positive. Therefore, the bifurcation set is deÞned by 4λ32 + 27λ21 = 0
(70)
It is shown in Figure 4. In the literature, this bifurcation set is called cusp and the function g(x; λ) described by Eq. (65) is called the (canonical) cusp catastrophe. The same function but with the coefÞcient of the quartic term − 41 instead of 14 is called the dual (canonical) cusp catastrophe.
RESOLUTION RECONSIDERED
283
Figure 4. The bifurcation set for the cusp catastrophe (thick line), and the catastrophe itself for three different parameter combinations (insets).
Three subplots in Figure 4 show g(x; λ) for different (λ1, λ2)-combinations. Inside the cusp, g(x; λ) has a two-minima, one-maximum structure. On the cusp, one of the minima and the maximum have just merged to form a degenerate stationary point. Outside the cusp, the function has the one-minimum structure. Applying these results to the function f (x; λ) deÞned by Eqs. (64) and (65) shows that the structure of this function changes from a two-minima, one saddle point structure into a one-minimum structure or the reverse if in parameter space the bifurcation set is crossed. The fold and cusp catastrophes are analyzed in detail in ArnolÕd (1992), Poston and Stewart (1978), and Saunders (1980). 3. Bifurcation Sets The fold catastrophe and the cusp catastrophe have been chosen as examples in this section since it can be shown that the singularities exhibited by these functions are generic. That is, the occurrence of possible alternatives has measure zero (Whitney, 1955). In general, more complex singularities may always be split into folds and cusps. Besides, in the applications of singularity theory to parameter estimation and resolution to be described later, the singularities
284
A. VAN DEN BOS AND A. J. DEN DEKKER
of likelihood functions occurring will be folds and cusps. In particular, the bifurcation sets of these catastrophes play a dominant part since they are the subsets of the Euclidean space of the parameters where the structure, that is, the number and nature of the stationary points, changes. The maximum of the likelihood function represents the solution of the estimation problem. Being a stationary point, this solution will be seen to be affected by structural change. Structural change occurs only if at a stationary point the Hessian matrix of the function concerned becomes singular, that is, if one or more of the eigenvalues of this matrix vanish. Therefore, bifurcation sets are found by studying these eigenvalues. This study is considerably simpliÞed by the observation that in the applications discussed in this article, no more than one eigenvalue will vanish at the same time. The subsequent analysis will be limited to this particular case. Readers interested in singularity in more than one coordinate are referred to Gilmore (1981), Poston and Stewart (1978), and Saunders (1980). The splitting lemma described by Eq. (56) shows that, under the assumption made, the function in the neighborhood of a degenerate stationary point ÷ may be represented as (ξ, λ) = (0, λ) 2 + g(ξ P ; λ) d1 ξ12 + d2 ξ22 + · · · + d P−1 ξ P−1
(71)
where the d p , p = 1, . . . , P−1, are either equal to 1 or to −1. In any ÷ has no linear term case, the Taylor expansion of the degenerate part g(ξ P ; λ) since the origin ξ = 0 is a stationary point. Also, it has no quadratic term since this stationary point is degenerate. The question now arises as to whether ÷ are equal to the third-order and, possibly, higher order derivatives of g(ξ P ; λ) zero as well. The answer to this question is important since it determines the number of terms to be included in the representation g(ξ P ; λ) in the neighbor÷ An algorithm for the computation of this representation hood of (ξ, λ) = (0, λ). will be presented in the next subsection. D. Functions near Singularities 1. Derivation of the Reduction Algorithm In this subsection, an algorithm will be presented for the computation of the representation of arbitrary smooth functions in the neighborhood of a singular stationary point. It is the reduction algorithm described in Poston and Stewart (1978) but slightly modiÞed and specialized to corank 1, that is, the case that the Hessian matrix in the stationary point has one eigenvalue equal to zero. For an alternative approach, see G. W. Hunt (1981). Since the functions f (x; λ) are supposed to be smooth, they may be represented by their Taylor expansion. Without loss of generality, suppose that x÷, the stationary point concerned, is the origin and that this stationary point
RESOLUTION RECONSIDERED
285
÷ Also suppose that f (x; λ) is translated such that the is degenerate for λ = λ. constant term of the Taylor expansion is equal to zero. Then the Taylor expansion may be written as follows: 1 T x 2
H x + cubic and higher degree terms in the elements of x
(72)
where H is the P × P Hessian matrix of f (x; λ) with respect to x evaluated at x = 0 and deÞned by Eq. (54). Notice that in Expr. (72) the linear terms are absent since the expansion is about a stationary point. The Reduction Algorithm consists of the following steps: Step 1 The P × 1 vector x is linearly transformed into the P × 1 vector y: (73)
y = Ax where A is a nonsingular P × P matrix such that 1 T x 2
where the P × P matrix
H x = y T Gy
(74)
G = 21 A−T H A−1
(75)
is diagonal: G = diag(δ1 . . . δ P−1
ε P2 )
(76)
Since, by assumption, H has corank 1, Expr. (72) may be written 2 δ1 y12 + · · · + δ P−1 y P−1 + ε P2 y P2 + p(y)
(77)
where the δ p are either strictly positive or strictly negative for all values of λ. The coefÞcient ε P2 , on the other hand, may as a function of λ become ÷ The function p(y) positive or negative and is equal to zero if λ is equal to λ. represents all cubic and higher degree terms in the elements of y. Furthermore, all elements of the vector x present in the cubic and higher degree terms in Expr. (72) have been replaced by elements of the vector y by using x = A−1 y
(78)
This completes the Þrst step of the algorithm. Step 2 The purpose of this step is removing all cubic terms with the exception of that in y P3 . This is done as follows. First, it is observed that the sum of all cubic terms of p(y) containing y1 may be described as 2δ1 y1 f 1 (y1 , . . . , y P )
(79)
286
A. VAN DEN BOS AND A. J. DEN DEKKER
while the sum of the remaining terms containing y2 is equal to 2δ2 y2 f 2 (y2 , . . . , y P )
(80)
2δ P−1 y P−1 f P−1 (y P−1 , y P )
(81)
and so on, up to
Therefore, the general form of these sums is 2δ p y p f p (y p , . . . , y P )
(82)
with p = 1, . . . , P − 1. Notice that the description Expr. (82) of the cubic terms implies a division by δ p of the cubic terms originally present. This is allowed since the δ p never vanish as opposed to ε P2 . The f p (.) are polynomials consisting of quadratic terms only. Subsequently combining this sum with δ p y 2p and completing the square yields δ p y 2p + 2y p f p (y p , . . . , y P ) = δ p [y p + f p (y p , . . . , y P )]2 − δ p f p2 (y p , . . . , y P )
(83)
Next deÞne z p = y p + f p (y p , . . . , y P )
(84)
with p = 1, . . . , P − 1 and substitute this in Eq. (83). The result is δ p z 2p − δ p f p2 (y p , . . . , y P )
(85)
Substituting these results in Expr. (77) yields δ1 z 12 + · · · + δ P−1 z 2P−1 + ε P2 y P2 + ε P3 y P3 − δ p f p2 (y p , . . . , y P ) + q(y1 , . . . , y P )
(86)
p
where p = 1, . . . , P − 1, and q(y1 , . . . , yP) is the sum of all quartic and higher degree terms present in p(y1 , . . . , y P ). The term ε P3 y P3
(87)
is unaffected by the substitution. Two remarks are in order. First, the coordinates z 1 , . . . , zP−1 appearing in the quadratic part of Expr. (86) are coordinates that are curvilinear in the original coordinates y1 , . . . , yP. Second, Expr. (86) does not contain cubic terms in z 1 , . . . , zP−1 anymore since the fp(.) are quadratic and, therefore, their squares are quartic. Thus, the goal of Step 2, the removal of all cubic terms with the exception of that in y P3 , has been achieved.
RESOLUTION RECONSIDERED
287
Finally, notice that the treatment of the coordinates y1 , . . . , yP−1 is essentially different from that of the coordinate yP. The reason is that the representation ÷ for which the coefÞcient Expr. (86) has to be valid for all values of λ including λ ε P2 vanishes, and, therefore, the quadratic term in yP is absent. Then the Þrst term in yP present is cubic or of a higher degree. Step 3 A difÞculty with the result Expr. (86) is that the quadratic terms are a function of the new coordinates z 1 , . . . , zP−1 but the quartic and higher degree terms are still a function of the old coordinates y1 , . . . , yP . The purpose of Step 3 is to Þnd a polynomial expression in z 1 , . . . , zP only. Therefore, expressions are needed for y1 , . . . , yP as a function of z 1 , . . . , zP . Substituting these expressions in Expr. (86) then yields an expression in z 1 , . . . , zP only. Simplest is the expression for yP. It is yP = z P
(88)
since this coordinate is not changed. The remaining expressions for y1 , . . . , yP−1 may be derived from Eq. (84) as follows. First, write, for each of the yp separately, a polynomial expression in z 1 , . . . , zP containing all possible terms from all linear ones up to and including all those of a speciÞed maximum degree and leave for the moment the coefÞcients of the polynomial unspeciÞed. What the maximum degree required is will be addressed in Section V.D.2, where examples of substitutions for the yp will be described. Then these polynomials are substituted for the yp in each of the expressions for z 1 , . . . , zP−1 described by Eq. (84). The values of the polynomial coefÞcients then follow from the identity of the left-hand member and the right-hand member. All that is left to be done is to substitute the resulting polynomials yp(z) for the corresponding yp in Eq. (86). The result is a polynomial expression in the coordinates z 1 , . . . , zP only described by δ1 z 12 + · · · + δ P−1 z 2P−1 + ε P2 z 2P + ε P3 z 3P + r (z) where z = (z1 . . . zP)T and r (z) = − δ p f p2 (y p (z), . . . , y P (z)) + q(y1 (z), . . . , y P (z))
(89)
(90)
p
Step 4 The purpose of Step 4 is to remove the quartic terms with the exception of that in z 4P from Expr. (89). The approach is very similar to that followed in Step 2 to remove the cubic terms. Here, for the sum of all quartic terms of r(z) containing z1 the following form is chosen: 2δ1 z 1 g1 (z 1 , . . . , z P )
(91)
288
A. VAN DEN BOS AND A. J. DEN DEKKER
and so on, with as general expression 2δ p z p g p (z p , . . . , z P )
(92)
where p = 1, . . . , P − 1. Combining this sum with the terms δ p z 2p and completing the square removes all quartic terms if as new coordinates u p = z p + g p (z p , . . . , z P )
(93)
are chosen. Substituting the polynomial expressions z p (u), p = 1, . . . , P − 1, then yields δ1 u 21 + · · · + δ P−1 u 2P−1 + ε P2 u 2P + ε P3 u 3P + ε P4 u 4P + s(u)
(94)
where s(u) is the sum of all quintic and higher order terms in the elements of u = (u 1 . . . u P )T . Step 5 and Possible Further Steps In Step 5, all quintic terms with the exception of that in u 5P are removed, and in further steps, all higher degree terms if needed. This concludes the derivation of the Reduction Algorithm. The result will be of the form δ1 w12 + · · · + δ P−1 w2P−1 + ε P2 w2P + ε P3 w3P + ε P4 w 4P + · · · + ε P K w PK + t(w) (95) where K is arbitrary and t(w) is the sum of all (K + 1)-th and higher degree terms in the elements of w = (w1 . . . w P )T . The Þrst P − 1 terms of Expr. (95), representing the Morse part of the function, will remain the same, independent of the value of K. These terms will neither change sign nor vanish as a result of changes in the parameters λ. It is also seen that terms in w1 , . . . , w P−1 of any degree higher than two will be removed as K increases. The quadratic term in w P , on the other hand, may as a result of changes in ÷ If the degree K is the parameters change sign and vanishes for λ equal to λ. increased, the number of terms of the polynomial in w P and the number of the ÷ corresponding coefÞcients increase. Suppose that for the parameter vector λ the coefÞcients ε P3 , . . . , ε P,K −1 also vanish but that ε PK is the Þrst coefÞcient that does not vanish for any allowable parameter vector λ. Then it can be shown that a polynomial coordinate transformation of w P may be derived removing all terms in w P having a degree higher than K (Gilmore, 1981). The result ÷ is is that the function around the degenerate stationary point (x, λ) = (0, λ) represented by δ1 v12 + · · · + δ P−1 v 2P−1 + ε P2 v 2P + ε P3 v 3P + ε P4 v 4P + · · · + ε P K v PK
(96)
RESOLUTION RECONSIDERED
289
where v1 , . . . , v P are the coordinates after removal of the terms t(w) by the described polynomial transformation of w1 , . . . , w P−1 and that of w P . It is emphasized that in the local coordinates v1 , . . . , v P , the representation is exact. No approximations have been made. Therefore, for us to Þnd the number and ÷ nature of the stationary points of f (x; λ) in the neighborhood of (x, λ) = (0, λ), it is sufÞcient to investigate the number and nature of the stationary points of the polynomial ε P2 v 2P + ε P3 v 3P + ε P4 v 4P + · · · + ε P K v PK
(97)
2. Useful Polynomial Substitutions As an example, in this subsection polynomial substitutions will be derived that are needed in Step 3 of the Reduction Algorithm described in Section V.D.1. The results are used in Section VI of this article. Suppose that the Taylor expansion of the function to be represented is described by δ1 y12 + δ2 y22 + ε32 y32 + γ300 y13 + γ210 y12 y2 + γ201 y12 y3 + γ120 y1 y22 + γ111 y1 y2 y3 + γ102 y1 y32 + γ030 y23 + γ021 y22 y3 + γ012 y2 y32 + ε33 y33 + q(y1 , y2 , y3 )
(98)
where q(y1 , y2 , y3) is the sum of all quartic and higher degree terms and the coefÞcient ε32 may vanish as a result of changes of the parameters λ as opposed to the coefÞcients δ1 and δ2 . Here, and in what follows, the coefÞcient of y1k y2l y3m in the Taylor expansion will be denoted as γklm . Expression (98) may be rewritten as follows: δ1 y12 + 2δ1 y1 × 12 γ300 y12 + γ210 y1 y2 + γ201 y1 y3 + γ120 y22 + γ111 y2 y3 + γ102 y32 / δ1 + δ2 y22 + 2δ2 y2 × 12 γ030 y22 + γ021 y2 y3 + γ012 y32 /δ2 +ε32 y32 + ε33 y33 + q(y1 , y2 , y3 )
(99)
Then, in the notation of Section V.D.1, f 1 = f 1 (y1 , y2 , y3 )
′ ′ ′ ′ ′ ′ = γ300 y12 + γ210 y1 y2 + γ201 y1 y3 + γ120 y22 + γ111 y2 y3 + γ102 y32 (100)
and ′′ ′′ ′′ y22 + γ021 y2 y3 + γ012 y32 f 2 = f 2 (y2 , y3 ) = γ030
(101)
′ = 21 γ300 /δ1 ). A double where a single prime denotes a division by 2δ1 (e.g., γ300
290
A. VAN DEN BOS AND A. J. DEN DEKKER
prime denotes a division by 2δ2 . Therefore, Eq. (99) may be written δ1 z 12 + δ2 z 22 + ε32 z 32 + ε33 z 33 − δ1 f 12 − δ2 f 22 + q(y1 , y2 , y3 )
(102)
with z 1 = y1 + f 1
(103)
z 2 = y2 + f 2
(104)
z 3 = y3
(105)
and
In Expr. (102), f 1 , f 2 , and q(y1 , y2 , y3 ) are still functions of y1 , y2 , and y3 instead of functions of z 1 , z 2 , and z3. Therefore, Eqs. (100), (101), and (103)Ð (105) are used to derive polynomial expressions for y1 , y2 , and y3 in terms of z 1 , z 2 , and z3 to be substituted in Expr. (102). These expressions are obtained as follows. Suppose that it is decided that quadratic expressions in z 1 , z 2 , and z3 are sufÞcient. Then substitution of these quadratic expressions for y1 , y2 , and y3 in f 1 , f 2 , and q(y1 , y2 , y3) of Expr. (102) produces a function representation in z 1 , z 2 , and z3 only. Inspection of f 12 and f 22 after this substitution shows that the degree of all terms containing z1 or z2 and resulting from the quadratic terms of the polynomial substitutions for y1 and y2 is Þve or higher. After substitution, the same is true for all terms of q(y1 , y2 , y3) containing y1 or y2 . Therefore, these quadratic terms do not affect the quartic terms of the function representation. The conclusion is that for a quartic function representation, it is sufÞcient to substitute z1 for y1 and z2 for y2 , respectively. In addition, Eqs. (100) and (101) show that substituting z3 for y3 and subsequent squaring of f1 and f2 produces in Expr. (102) two quartic terms in z 3 , respectively described by ′2 4 −δ1 γ102 z3
(106)
′′2 4 −δ2 γ012 z3
(107)
′2 ′′2 ε34 = γ004 − δ1 γ102 − δ2 γ012
(108)
δ1 z 12 + δ2 z 22 + ε32 z 32 + ε33 z 33 + ε34 z 44
(109)
and Then ε34 , the coefÞcient of z 34 in the function representation, becomes where γ004 is the coefÞcient of y34 in the original Taylor expansion. Since the remaining quartic terms containing z1 or z2 may be removed in the next step, the quartic function representation becomes
with ε32 = γ002 , ε33 = γ003 , and ε34 deÞned by Eq. (108). Notice that Exprs. (108) and (109) are easily generalized to any number of inessential variables. SpeciÞcally, let there be a number of M + 3 variables z 1 , . . . , z M+3 of which
RESOLUTION RECONSIDERED
291
z M+3 is essential. Furthermore, let ψ/4! be the coefÞcient of z 4M+3 before the removal of the inessential cubic terms. Denote the coefÞcients of z m z 2M+3 , m = 1, . . . , M + 2, by ϕm and the nonvanishing eigenvalues of H by δm , m = 1, . . . , M + 2. Then, by Expr. (108), the coefÞcient of z M+3 after removal of the inessential cubic terms is described by 1 (110) ψ − 6ϕ T diag δ1−1 . . . δ −1 M+2 ϕ 4!
where ϕ is the (M + 2) × 1 vector of the ϕm . Expression (110) will be used in Section VI. The procedure may, of course, be continued to produce representations of a degree higher than four. For example, the quintic function representation may be shown to be δ1 z 12 + δ2 z 22 + ε32 z 32 + ε33 z 33 + ε34 z 34 + ε35 z 35
(111)
with ′ ′ ′ ′ ′′ ′ ′′ ′′ ′′ ′′ (γ103 − γ102 γ201 − γ012 γ111 ) − δ2 γ012 (γ013 − γ012 γ021 ) ε35 = γ005 − δ1 γ102 (112)
However, for the purposes of this article, the quartic representation described by Eq. (109) sufÞces.
E. Conclusions In this section, a relatively simple polynomial representation has been derived for functions in the neighborhood of singular stationary points. Here, the concept neighborhood refers both to the independent variables and to the parameters of the function. The parameter values have to be in the neighborhood of a bifurcation set representing all parameter combinations for which the stationary point concerned is singular. The parameter neighborhood in which the representation is valid consists of all parameter combinations between the bifurcation set concerned and the next one encountered in parameter space. The function representations analyzed have been restricted to functions that may become degenerate in only one of the independent variables at a time. This is sufÞcient for the purposes of this article. It has been shown that under these conditions the representation may always be put in the form described by Eq. (96): δ1 v12 + · · · + δ P−1 v 2P−1 + ε P2 v 2P + ε P3 v 3P + ε P4 v 4P + · · · + ε P K v PK
(113)
In this expression, the sign of each of the coefÞcients δ, . . . , δ P−1 remains the same for all parameter values considered. Therefore, the structure of the
292
A. VAN DEN BOS AND A. J. DEN DEKKER
function does not change in the directions of the corresponding independent variables v1 , . . . , v P−1 . This is the reason that these variables are called inessential. The coefÞcient ε P2 , on the other hand, may become equal to zero as may ε P3 up to and including ε P,K −1 , but ε P K is the Þrst coefÞcient that cannot vanish under the inßuence of the parameters. The variable v P represents the direction in which structural change may occur and, therefore, v P is called an essential variable. It has been shown that the ε Pk are equal to or are relatively simple algebraic expressions in the coefÞcients of the original Taylor expansion of the function about the stationary point concerned. The analysis of the structure of the function using the representation Expr. (113) is much simpler than that of the function from which Expr. (113) is derived, for the following reasons: r
r
It requires an analysis in only one variable, the essential variable v P . The inessential variables v1 , . . . , v P−1 may be left out of consideration. The expression in the essential variable v P , the degenerate part of the function, is the polynomial ε P2 v 2P + ε P3 v 3P + ε P4 v 4P + · · · + ε P K v PK
r
(114)
All that is required is a study of the nature and number of the stationary points of this polynomial. Combining the results of this study with the signs of δ1 , . . . , δ P−1 determines the structure of the function in the neighborhood of the stationary point concerned unambiguously. The degree K of the polynomial (114) is such that the ε P K is the Þrst coefÞcient that cannot vanish. In many practical applications, this degree is low, which further simpliÞes the analysis of the structural change.
In Section VI, the elements of singularity theory described in this section will be used to explain the occurrence of singularities in a general class of likelihood functions of parameters of two-component observations.
VI. Singularity of Likelihood Functions A. Introduction In this section, it is shown that likelihood functions for two-component models may have singular stationary points. These singularities are caused by the observations. Thus the observations play the part of the function parameters in Section V. For different sets of observations, the structure of the likelihood function may be different. It will also be shown that these singularities may cause the values of the quantities to be measured to coincide exactly at the maximum of the likelihood function. For example, if error-disturbed observations are made on two seriously overlapping spectral peaks, the maximum
RESOLUTION RECONSIDERED
293
TABLE 1 Corresponding Notions in Singularity Theory and Estimation Singularity theory
Estimation
Function Independent variables Parameters of the function Absolute maximum
Likelihood function Parameters of the model (Errors in) observations Solution for parameters
likelihood estimates of the locations of the peaks may be distinct for the one set of observations and exactly coinciding for the other. In the latter case, both quantities can, of course, no longer be resolved from the available observations. To avoid confusion, at this point, we introduce the terminology to be used in this section and compare it with that of Section V. In Section V, parametric functions of a number of independent variables were studied. In this section, the function is the likelihood function which is a function of the model parameters. The observations act as the parameters of the likelihood function. By deÞnition, the location of the absolute maximum of the likelihood function is the solution for the model parameters. This corresponds to the absolute maximum of the function in Section V. Table 1 summarizes these corresponding notions. The outline of this section is as follows. In Section VI.B, two-component models are introduced and their parameterization, that is, dependence on the quantities to be measured, is discussed. Next, it is explained how these parametric models enter the probability density function of the observations and thus deÞne the likelihood function of the parameters. In Section VI.C, the results of Section VI.B are used to show that for all distributions of the observations and all types of component functions, the corresponding likelihood functions may be expected to have similar structures. Numerical examples in Section VI.D conÞrm this. In Sections VI.E and VI.F, a remarkably simple polynomial representation is derived describing and explaining the two different structures likelihood functions for two-component models may have. Conclusions are drawn in Section VI.G. B. Likelihood Functions for Two-Component Models Suppose that the elements wn of the vector w = (w1 . . . w N )T
(115)
are the available observations and that their expectation is described by the
294
A. VAN DEN BOS AND A. J. DEN DEKKER
two-component model E[wn ] = f (xn ; θ ) = g(xn ; γ ) + α1 h(xn ; β1 ) + α2 h(xn ; β2 )
(116)
In this expression, h(x; βk ), k = 1, 2, are the component functions with nonlinear parameters βk and amplitudes αk . For example, in an exponential decay model h(x; βk ) = exp(−βk x) while for Gaussian spectral peaks h(x; βk ) = exp{− 12 (x − βk )2 }. The values xn of the independent variable x are the measurement points and are supposed to be known. For example, they may be time instants or spatial coordinates. The background function g(x; γ ) may be used to model contributions to the two-component model such as a trend g(x; γ ) = γ 1x or the sum of a trend and an offset g(x; γ ) = γ1 x + γ2 . The function g(x; γ ) may also model additional components αk h(x; βk ), k = 3, . . . The M × 1 vector γ is the vector of unknown parameters of g(x; γ ). The vector θ is deÞned as θ = (γ T α1 α2 β1 β2 )T
(117)
The model f (xn ; θ ) described by Eq. (116) is called the generalized twocomponent model. The addition generalized refers to the inclusion of the function g(x; γ ). For reasons to become clear later, f (xn ; θ) is Þrst reparameterized as follows: f (xn ; θ) = g(xn ; γ ) + α[λ1 h(xn ; β1 ) + λ2 h(xn ; β2 )]
(118)
with λ1 = λ and λ2 = 1 − λ. Then the new vector of unknown parameters θ is deÞned as θ = (γ T α λ β1 β2 )T
(119)
These are the parameters to be estimated from the observations. Notice that the parameter λ distributes the amplitude α over the components. Next suppose that the joint probability density function of the observations wn with expectation f (xn ; θ) may be described as p(w; f (θ ))
(120)
f (θ ) = ( f 1 (θ) . . . f N (θ))T
(121)
where
with f n (θ ) = f (xn ; θ). Then the corresponding log-likelihood function of the parameters t given the observations w is equal to q(w; f (t)) = ln[ p(w; f (t))]
(122)
where the elements of the parameter vector t = (c T a ℓ b1 b2 )T
(123)
RESOLUTION RECONSIDERED
295
correspond to those of Eq. (119) while the vector f (t) is deÞned as f (t) = ( f 1 (t) . . . f N (t))T
(124)
and f n (t) = f (xn ; t) = g(xn ; c) + a[ℓ1 h(xn ; b1 ) + ℓ2 h(xn ; b2 )]
(125)
with ℓ1 = ℓ and ℓ2 = 1 − ℓ. SimpliÞcation of the analysis involves replacing maximum likelihood estimation of θ through maximizing q(w; f (t)) with respect to the parameter vector t described by Eq. (123) by the following hypothetical procedure. For a Þxed value of ℓ, the likelihood function is maximized with respect to the newly deÞned parameter vector t = (c T
a
b1
b2 )T
(126)
obtained by leaving the parameter ℓ out of the parameter vector Eq. (123). This is repeated for all allowable values of ℓ, and the optimal values ℓ÷ and t÷ are subsequently selected. The parameterization Eq. (126) will be used in what follows. The maximum likelihood estimation of the parameters of the generalized two-component model may be summarized as follows. First, the generalized two-component model Eq. (118) is substituted for the expectations of the observations in the logarithm of the probability density function of the observations. Then the observations are substituted for the corresponding independent variables. Next, the exact parameters θ are replaced by the variable parameters t described by Eq. (123). Finally, the log-likelihood function thus obtained is maximized with respect to the parameters t described by Eq. (126) for all allowable values of ℓ, and the optimal ℓ÷ and t÷ are selected. The latter procedure has two advantages. First, as a result of keeping ℓ Þxed in every step, the number of parameters decreases by one, which substantially simpliÞes the analysis. Second, by using ℓ, the ratio of the amplitudes of the components may be limited to values agreeing with the available a priori knowledge. Since it will be assumed that α1 and α2 are known to have the same sign, in what follows only values of ℓ will be chosen on subintervals of (0, 1). See also the remark in Section III.B.3 about incorporating a priori knowledge in estimation.
C. Stationary Points of the Likelihood Functions In this subsection, Þrst the gradient of log-likelihood functions for the special case of two-component models is presented. Equating each of the elements of the gradient to zero yields a set of nonlinear equations called the normal equations. Among the solutions of the normal equations is the maximum of
296
A. VAN DEN BOS AND A. J. DEN DEKKER
the likelihood function which represents the maximum likelihood solution for the parameters. Furthermore, the solutions of the normal equations deÞne the structure of the likelihood function, which is deÞned as the pattern of its stationary points. Such solutions are, therefore, essential to the study of structural changes of the likelihood functions to be carried out later. The gradient of the log-likelihood function Eq. (122) with respect to the parameters described by Eq. (126) has three types of elements: ∂q ∂ f n ∂q ∂gn ∂q = = ∂cm ∂ f n ∂cm ∂ f n ∂cm n n
m = 1, . . . , M
(127)
∂q ∂ f n ∂q ∂q = = (ℓ1 h n (b1 ) + ℓ2 h n (b2 )) ∂a ∂ f n ∂a ∂ fn n n
(128)
∂q ∂ f n ∂q ∂q = = aℓk h (1) n (bk ) ∂bk ∂ f ∂b ∂ f n k n n n
(129)
and k = 1, 2
where q denotes the log-likelihood function q(w; f (t)), and f n is deÞned as f n (t) described by Eq. (125), while gn and h n (bk ) are deÞned as g(xn ; c) and h(xn ; bk ), respectively, which are both present in Eq. (125). Furthermore, h (np) (bk ) is deÞned as the pth-order derivative of h(xn ; bk ) with respect to bk . Then the normal equations are deÞned as ∂q ∂gn =0 ∂ f n ∂cm n
m = 1, . . . , M
(130)
∂q (ℓ1 h n (b1 ) + ℓ2 h n (b2 )) = 0 ∂ fn n
(131)
∂q h (1) n (bk ) = 0 ∂ f n n
(132)
and k = 1, 2
Notice that Eqs. (130)Ð(132)constitute M + 3 nonlinear equations that can, in principle, be solved for the M + 3 unknown parameters c1 , . . . , c M , a, b1 , and b2 . However, since the equations are nonlinear, more than one solution is possible, as will now be demonstrated. Suppose that, instead of the parameters of the two-component model, the parameters t = (c T a b)T
(133)
RESOLUTION RECONSIDERED
297
of the one-component model of the same nonlinearly parametric family f n (t) = f (xn ; t) = g(xn ; c) + ah(xn ; b)
(134)
would be estimated from the same observations w. Then, the normal equations of the likelihood function q for this model become ∂q ∂gn =0 m = 1, . . . , M (135) ∂ f n ∂cm n
and
∂q h n (b) = 0 ∂ fn n
(136)
∂q h (1) n (b) = 0 ∂ f n n
(137)
ö T is a solution for the parameters (c1 . . . c M a b)T Suppose that (ö c1 . . . cöM aö b) ö T is substituted for t = in Eqs. (135)Ð(137). Then, if tö = (ö c1 . . . cöM aö bö b) T (c1 . . . c M a b1 b2 ) in the gradient equations Eqs. (127)Ð(129)for the twocomponent model Eq. (125), this gradient becomes equal to the null vector. That is, the parameter values tö generated by the solution of the normal equations for the one-component model satisfy the normal equations for the two-component model. Therefore, tö = (ö c1 . . . cöM
öT aö bö b)
(138)
is a stationary point of the log-likelihood function for the two-component model. This stationary point will be called a one-component stationary point since it has been generated by a stationary point of the log-likelihood function for a one-component model. Conclusion VI.1 The Þrst conclusion of this section is that log-likelihood functions for generalized two-component models always have a onecomponent stationary point. This is true for any component function, background function, and log-likelihood function. Furthermore, let, for the moment, ℓ = ℓ1 = ℓ2 = 0.5. Then, for symmetry reasons, (b2, b1) is the absolute maximum of the log-likelihood function if (b1, b2) is. This means that in this special case there are two equivalent absolute maxima located symmetrically with respect to the line b1 = b2. If, subsequently, ℓ is somewhat changed, a slight asymmetry in the locations of the maxima occurs and one maximum becomes relative but, for continuity reasons, does not disappear. This leads to the second conclusion.
298
A. VAN DEN BOS AND A. J. DEN DEKKER
Conclusion VI.2 The second conclusion of this section is that the log-likelihood function for two-component models has two maxima.
D. Examples of Stationary Points In this subsection, examples are presented illustrating the structure of the loglikelihood function for simple two-component models. Example VI.1 (A Biexponential Model and Normally Distributed Observations) In this example, it is assumed that the observations w = (w1 . . . w N )T have a normal probability density function. In Section IV.C.1, it has been shown that then the log-likelihood function is described by 1 1 N ln 2π − ln det W − (w − f (t))T W −1 (w − f (t)) (139) 2 2 2 where W is the covariance matrix of the observations and f (θ ) is the N × 1 vector of the expectations of the wn . In this example, it will be supposed that these expectations are biexponential and contain a linear background function q=−
f n (θ) = γ1 + γ2 xn + α1 exp(−β1 xn ) + α2 exp(−β2 xn )
(140)
or, in the parameterization of Eq. (118), f n (θ ) = γ1 + γ2 xn + α[λ1 exp(−β1 xn ) + λ2 exp(−β2 xn )]
(141)
The function g(x; γ ) introduced in Section VI.B is equal to γ1 x + γ2 and h(x; βk ) = exp(−βk x), k = 1, 2. Then, in agreement with Section VI.B, the model substituted for the expectations in the likelihood function is described by f n (t) = c1 + c2 xn + a[ℓ1 exp(−b1 xn ) + ℓ2 exp(−b2 xn )]
(142)
with ℓ1 = ℓ and ℓ2 = 1 − ℓ. From Eqs. (130)Ð(132),it then follows that the normal equations for the two-component model are described by − wmn dn xm = 0 m≥n
m≥n
m≥n
m≥n
− wmn dn = 0
(143) − wmn dn [ℓ1 exp(−b1 xm ) + ℓ2 exp(−b2 xm )] = 0 − wmn dn xm exp(−bk xm ) = 0
k = 1, 2
RESOLUTION RECONSIDERED
299
with dn = wn − c1 − c2 xn − a[ℓ1 exp(−b1 xn ) + ℓ2 exp(−b2 xn )]
(144)
− while the wmn are the elements of W −1 . The normal equations for the onecomponent model are described by − wmn dn xm = 0 m≥n
m≥n
m≥n
m≥n
with
− wmn dn = 0
(145) − wmn dn exp(−bxm ) = 0 − wmn dn xm exp(−bxm ) = 0
dn = wn − c1 xn − c2 − a exp(−bxn )
(146)
Notice that Eqs. (145) and (146) are directly obtained by substituting b1 = b2 = b in Eqs. (143) and (144). In the numerical example to be described now, the observations are supposed to be uncorrelated and to have equal variance σw2 . Then W = σw2 I , with I the identity matrix of order N. As shown in Section IV.C.1, under these conditions the log-likelihood function described by Eq. (139) simpliÞes to 1 N [wn − f n (t)]2 (147) − ln 2π − N ln σw − 2 2σw2 n Consider as a numerical example the following model:
f n (θ ) = 0.005 + 0.01xn + 0.7 exp(−xn ) + 0.3 exp(−0.8xn )
(148)
Then θ = (γ1 γ2 α β1 β2 )T = (0.005 0.01 1 1 0.8)T , λ1 = 0.7 and, therefore, λ2 = 0.3. For the xn , the values 0.4 × n, n = 1, . . . , 10, are chosen. Figure 5 shows f n (θ) in these points. Next, in the log-likelihood function, the observations are replaced by their expectations. Generally, the structure of likelihood functions and criteria of goodness of Þt as functions of the parameters with the observations replaced by their expectations provides the experimenter with a reference structure to which the structure for all other realizations of the observations can be compared. Notice that in this example with these Òobservations,Óall dn = wn − f n (t) = E[wn ] − f n (t) = f n (θ) − f n (t)
(149)
300
A. VAN DEN BOS AND A. J. DEN DEKKER
Figure 5. Measurement points and expectations of the observations in Example VI.1. The expectations are the sum of a biexponential function and a linear background.
for ℓ = λ are equal to zero if t is equal to θ. Then the normal equations given in Eq. (143) are always satisÞed while the log-likelihood function Eq. (147) is seen to be absolutely maximal. The log-likelihood function Eq. (147) is a function of the elements of t = (c1 c2 a b1 b2 )T . To visualize this function in spite of its high dimensionality, we adopt the following procedure. The Þrst three equations of Eq. (143) are solved for the linear parameters c1, c2, and a. Then the solution for these linear parameters is a function of the nonlinear parameters b1 and b2. This solution is subsequently substituted in the log-likelihood function Eq. (147), which makes it a function of b1 and b2 only. Contours of this log-likelihood function are shown in Figure 6 as a function of b1 − b2 and λ1 b1 + λ2 b2 , that is, for ℓ = λ. These transformed coordinates have been chosen to make the contours easier to interpret. The contours show an absolute maximum at b1 − b2 = 0.2 and λ1 b1 + λ2 b2 = 0.94 corresponding to b1 = 1 and b2 = 0.8 being the exact values β1 and β2 . There is an additional, relative, maximum at λ1 b1 + λ2 b2 = 0.941 and b1 − b2 = −0.191 corresponding to b1 = 0.884 and b2 = 1.075. More or less between both maxima of Figure 6, there is a saddle point at λ1 b1 + λ2 b2 = 0.94 and b1 − b2 = 0 corresponding to b1 = b2 = 0.944. This is the one-component stationary point predicted in the previous subsection. For obvious reasons, in what follows the structure of the likelihood function thus found will be called the two-maxima, one saddle point structure.
RESOLUTION RECONSIDERED
301
Figure 6. Contours of the normal log-likelihood function of Example VI.1.
Example VI.2 (A Bi-Gaussian Model and Independent Poisson Distributed Observations) In this example, it is assumed that the observations w = (w1 . . . w N )T are independent and Poisson distributed. In Section IV.C.1, it has been shown that then their log-likelihood function is described by − f n (t) + wn ln f n (t) − ln wn ! (150) n
where, as before, t is the vector of parameters with respect to which the likelihood function has to be maximized. Furthermore, f n (θ) is the expectation of the nth observation wn . In this example, it will be assumed that this expectation is described by f n (θ) = α[λ1 h n (β1 ) + λ2 h n (β2 )] where
6 5 h n (βk ) = exp − 12 (xn − βk )2
(151)
(152)
with k = 1, 2. The components of this bi-Gaussian function are described by αλk exp{− 21 (xn − βk )2 }, k = 1, 2, located at β1 and β2 and having peak values αλ1 = αλ and αλ2 = α(1 − λ), respectively. Therefore, the model f n (t) to be substituted in the likelihood function is f n (t) = a[ℓ1 h n (b1 ) + ℓ2 h n (b2 )]
(153)
302
A. VAN DEN BOS AND A. J. DEN DEKKER
with ℓ1 = ℓ and ℓ2 = 1 − ℓ. Then the normal equations for the two-component model are described by dn (t)[ℓ1 h n (b1 ) + ℓ2 h n (b2 )] = 0 (154) n
and
n
with
dn (t)(xn − bk )h n (bk ) = 0
dn (t) =
k = 1, 2
wn −1 f n (t)
Similarly, the normal equations for the one-component model are dn (t)h n (b) = 0
(155)
(156)
(157)
n
and
n
with
dn (t)(xn − b)h n (b) = 0
dn (t) =
wn −1 f n (t)
(158)
(159)
and 6 5 f n (t) = ah n (b) = a exp − 21 (xn − b)2
(160)
Again, it is seen that if b is the solution of the normal equations for the onecomponent model, the point (b, b) is a solution of the normal equations for the two-component model. Consider as a numerical example the following model: 6 5 6 5 (161) f n (θ) = 0.8 exp − 12 (xn − 0.05)2 + 0.2 exp − 21 (xn + 0.05)2
Then θ = (α β1 β2 )T = (1 0.05 −0.05)T and λ = λ1 = 0.8 and, therefore, λ2 = (1 − λ) = 0.2. The measurement points are taken as xn = −1.8 + (n − 1) × 0.4, n = 1, . . . , 10. Figure 7 shows f n (θ). Next the observations wn in the likelihood function Eq. (150) are replaced by their expectations f n (θ ) described by Eqs. (151) and (152). It is observed that then for ℓ = λ the deviations E[wn ] f n (θ) wn −1= −1= −1 (162) dn = f n (t) f n (t) f n (t)
RESOLUTION RECONSIDERED
303
Figure 7. Measurement points and expectations of the observations in Example VI.2. The expectations are the sum of two overlapping Gaussian functions.
are equal to zero if t is equal to θ. Then the normal equations Eqs. (154) and (155) are always satisÞed. The Poisson log-likelihood function Eq. (150) is a function of the three elements of t = (a b1 b2 )T . To eliminate a, we solve this parameter from the normal equation Eq. (154) where it is present in the dn(t): a=
$n w n $n ℓ1 h n (b1 ) + ℓ2 h n (bk )
(163)
where h n (bk ) = exp{− 12 (xn − bk )2 }. If this expression for a is substituted in Eq. (153) and, subsequently, the resulting expression is substituted for fn(t) in the likelihood function Eq. (150), the result is 5 6 − wn 1 ln ℓ1 h n2 (b1 ) + ℓ2 h n2 (b2 ) + wn ln{ℓ1 h n (b1 ) + ℓ2 h n (b2 )} n1
n2
n
(164)
where terms depending on the wn only have been omitted. Contours of the likelihood function Eq. (164) as a function of b1−b2 and λ1b1 + λ2b2, that is, for ℓ = λ, are shown in Figure 8. There is a maximum at (0.1, 0.03) corresponding to the exact parameter values b1 = β 1 = 0.05, b2 = β 2 = −0.05, and α = 1. There is an additional maximum at (−0.1, 0.03) corresponding to b1 = 0.01 and b2 = 0.11. Between these maxima, there is a saddle point at (0, 0.03) corresponding to b1 = b2 = 0.03 which is the one-component stationary point.
304
A. VAN DEN BOS AND A. J. DEN DEKKER
Figure 8. Contours of the Poisson log-likelihood function of Example VI.2.
Conclusion VI.3 Example VI.1 of this section concerns a normal likelihood function. The model of the expectations of the observations in this example is the sum of a biexponential function and a linear background function. The likelihood function of Example VI.2, on the other hand, is Poisson while the model of the expectations consists of the sum of two overlapping Gaussian peaks. In spite of these differences, the log-likelihood functions in both examples have the same two-maxima, one saddle point structure. This conÞrms the conclusion of the preceding subsection that this particular structure is characteristic of all log-likelihood functions for which the expectation of the observations consists of a generalized two-component function.
E. The Hessian Matrix of the Log-Likelihood Function In this subsection, an expression is derived for the Hessian matrix of the log-likelihood function of the parameters of the generalized two-component model. This is a Þrst step in the assessment of the behavior of the stationary points under the inßuence of the observations, which, as has been explained in Section V, requires Taylor expansion of the log-likelihood function with the model parameters as variables. In the preceding two subsections, the onecomponent stationary point of the likelihood function for two-component models has been introduced. This point, located between both two-component stationary points, will be chosen as the origin of the Taylor expansion. This choice will be seen to simplify the analysis substantially.
RESOLUTION RECONSIDERED
305
For the computation of the elements of the Hessian matrix of the loglikelihood function q with respect to the parameters t = (c T a b1 b2 )T of the generalized two-component model, use is made of the expressions Eqs. (127)Ð (129) for the gradient with respect to these parameters. Straightforward differentiation yields Þve different types of second-order derivatives: ∂ 2 q ∂gn 1 (c) ∂gn 2 (c) ∂q ∂ 2 gn ∂ 2q = + ∂cm 1 ∂cm 2 ∂ f n 1 ∂ f n 2 ∂cm 1 ∂cm 2 ∂ f n ∂cm 1 ∂cm 2 n1 n2 n ∂ 2 q ∂gn 1 (c) ∂ 2q = ℓ1 h n 2 (b1 ) + ℓ2 h n 2 (b2 ) ∂cm ∂a ∂ f n1 ∂ f n 2 ∂cm n1 n2
(165)
(166)
∂ 2q ∂ 2q = ℓ1 h n 1 (b1 ) + ℓ2 h n 1 (b2 ) ℓ1 h n2 (b1 ) + ℓ2 h n2 (b2 ) 2 ∂a ∂ fn1 ∂ fn2 n1 n2 (167) 2 2 ∂ q ∂gn 1 (c) (1) ∂ q = aℓk h (bk ) (168) ∂cm ∂bk ∂ f n 1 ∂ f n2 ∂cm n2 n1 n2 ∂ 2q ∂ 2q = aℓk h (1) n 1 (bk1 ) ℓ1 h n 2 (b1 ) + ℓ2 h n 2 (b2 ) ∂a ∂bk ∂ fn1 ∂ fn2 n1 n2 + ℓk
∂q h (1) n (bk ) ∂ f n n
(169)
and ∂ 2q (1) ∂ 2q = aℓk1 ℓk2 h (1) n 1 bk1 h n 2 bk2 ∂bk1 ∂bk2 ∂ fn1 ∂ fn2 n1 n2 + δk1 ,k2 aℓk1
∂q h (2) n bk1 ∂ fn n
(170)
where δk1 ,k2 is the Kronecker symbol deÞned as δk1 ,k2
1 = 0
if k1 = k2 otherwise
(171)
1. Change of Coordinates For reasons explained later, the coordinates t = (c1 . . . c M a b1 b2 )T are transformed into t † = (c1 . . . c M a b1† b2† )T with b1† = ℓ1 b1 + ℓ2 b2 and
306
A. VAN DEN BOS AND A. J. DEN DEKKER
b2† = b1 − b2 :
⎞ c1 ⎜ .. ⎟ ⎜ . ⎟ % ⎜c ⎟ I ⎜ M⎟ t† = ⎜ a ⎟ = O ⎜ ⎟ ⎜ †⎟ ⎝ b1 ⎠ ⎛
b2†
where
⎛
⎞ c1 ⎜ . ⎟ & ⎜ .. ⎟ ⎟ O ⎜ ⎜c M ⎟ = diag(I ⎟ L ⎜ ⎜a⎟ ⎝b ⎠
L)t
(172)
1
b2
% ℓ L= 1 1
& ℓ2 −1
(173)
and I is the identity matrix of order M + 1. Then, if H denotes the (M + 3) × (M + 3) Hessian matrix deÞned by Eqs. (165)Ð(170),the Hessian matrix H † with respect to the new coordinates t † is described by H † = diag(I
L −T )H diag(I
where the inverse L −1 of L is deÞned as % & 1 ℓ2 −1 L = 1 −ℓ1
L −1 )
(174)
(175)
where use has been made of the fact that ℓ1 + ℓ2 = 1. Next suppose that the original matrix H is partitioned as follows: H11 H12 (176) H= T H22 H12 where the dimensions of H11, H12, and H22 are (M + 1) × (M + 1), (M + 1) × 2, and 2 × 2, respectively. Then it follows from Eqs. (174) and (176) that † † H12 L −1 H11 H11 H12 † H = = (177) † †T (H12 L −1 )T L −T H22 L −1 H22 H12 † † † , H12 , and H22 of H † are now successively The three different partitions H11 computed:
r
† , is, of course, Equation (177) shows that the upper diagonal partition, H11 unchanged: † H11 = H11
(178)
307
RESOLUTION RECONSIDERED r
r
† , is seen to be equal to The upper off-diagonal partition, H12 ⎛ ⎞ ℓ2 h 11 − ℓ1 h 12 h 11 + h 12 .. .. ⎜ ⎟ † H12 = H12 L −1 = ⎝ ⎠ (179) . . h m+1,1 + h M+1,2 ℓ2 h M+1,1 − ℓ1 h M+1,2 12
† The lower diagonal matrix, H22 , is equal to † = L −T H22 L −1 H22 h 11 + 2h 12 + h 22 = ℓ2 h 11 + (ℓ2 − ℓ1 )h 12 − ℓ1 h 22
ℓ2 h 11 + (ℓ2 − ℓ1 )h 12 − ℓ1 h 22 ℓ22 h 11 − 2ℓ1 ℓ2 h 12 + ℓ21 h 22
22
(180) where use has been made of the fact that H22 is symmetric. This completes the description of the Hessian matrix H † after the coordinate change Eq. (172) and in terms of the elements of the original matrix H. 2. The Hessian Matrix at the One-Component Stationary Point The next step is the evaluation of H † at the one-component stationary point ö T described by Eq. (138). Recall that, by deÞnition, the tö = (ö c1 . . . cöM aö bö b) ö T satisfy the normal equations Eqs. (135)Ð(137) elements of (ö c1 . . . cöM aö b) of the likelihood function for the estimation of the parameters of the one† † † of H † are , and H22 , H12 component model. Again the three partitions H11 addressed consecutively: r
r
† Consider Þrst the elements of the partition H11 described by Eqs. 165Ð ö 167. Since b1 = b2 = b and ℓ1 + ℓ2 = 1, these elements are easily seen to be equal to the corresponding elements of the Hessian matrix of the log-likelihood function for the parameters of the one-component model, ö T. evaluated at the stationary point (ö c1 . . . cöM aö b) † Next the elements of the partition H12 deÞned by Eq. (179) are addressed. These elements are functions of the elements of the matrix H that are deÞned by Eqs. (168) and (169). Simple computations then show that the † are equal to the second-order Þrst M elements of the Þrst column of H12 partial derivatives of the log-likelihood function for the one-component model with respect to cm and b. Therefore, they are described by
∂ 2 q ∂gn1 (ö ∂ 2q c) (1) ö = aö h n2 (b) ∂cm ∂b ∂ f ∂ f ∂c n1 n2 m n1 n2
(181)
308
A. VAN DEN BOS AND A. J. DEN DEKKER
Similarly, the (M + 1)-th element of the Þrst column is the second-order partial derivative with respect to a and b. It is, therefore, equal to ∂q ∂ 2q ∂ 2q ö ö ö h (1) = aö h (1) (b) n 1 (b)h n 2 (b) + ∂a ∂b ∂ fn1 ∂ fn2 ∂ fn n n n1 n2 = aö
r
n1
n2
∂ 2q ö n2 (b) ö h (1) (b)h ∂ fn1 ∂ fn2 n1
(182)
where the last step follows from Eq. (137). Equally simple computations † are all equal to zero. show that the elements of the second column of H12 † Finally, computation of the elements of H22 using Eqs. (170) and (180) shows that the upper diagonal element of this matrix is the second-order derivative of the log-likelihood function for the one-component model with respect to the parameter b: ∂q ∂ 2q ∂ 2q (1) ö (1) ö ö ö ö h ( b)h ( b) + a = a h (2) n1 n2 n (b) ∂b2 ∂ f ∂ f ∂ f n n n 1 2 n n1 n2
(183)
Furthermore, the off-diagonal elements are equal to zero while the lower diagonal element is equal to
ö − ℓ) ρ = aℓ(1
∂q ö h (2) n (b) ∂ f n n
(184)
Hence, in summary, the particularly simple outcome for the Hessian matrix of the log-likelihood function for two-component models evaluated at the one-component stationary point is
H† =
%
G oT
o ρ
&
= diag(G
ρ)
(185)
where G is the (M + 2) × (M + 2) Hessian matrix of the log-likelihood function for the one-component model and o is the (M + 2) × 1 null vector. Before the results Eqs. (184) and (185) are discussed, two examples illustrating the computation of the quantity ρ are presented. These examples correspond to those of Section VI.D.
RESOLUTION RECONSIDERED
309
Example VI.3 (The Quantity ρ for a Biexponential Model Fitted to Normally Distributed Observations) For the normal log-likelihood function and the biexponential model with background, described by Eqs. (147) and (142), respectively, the quantity ∂q/∂ f n becomes 1 ∂q = 2 (wn − f n ) ∂ fn σw
(186)
with f n = f n (t) = a[ℓ1 exp(−b1 xn ) + ℓ2 exp(−b2 xn )] + c1 + c2 xn
(187)
Therefore, ρ=
ö − ℓ) aℓ(1 ö n) (wn − f n )xn2 exp(−bx σw2 n
(188)
ö T is the least squares solution ö T . Notice that (ö at tö = (ö c1 cö2 aö bö b) c1 cö2 aö b) for the parameters when the one-component model c1 + c2 xn + a exp(−bxn )
(189)
is Þtted to the observations. Therefore, the procedure for computation of the quantity ρ from given observations w is this: ö T for the parameters 1. Compute the least squares solution (ö c1 cö2 aö b) T (c1 c2 a b) of the one-component model Eq. (189). 2. Substitute the solution in the expression Eq. (188) for ρ. Example VI.4 (The Quantity ρ for a Bi-Gaussian Model and Independent Poisson Distributed Observations) In this example, the quantity ρ is computed for the Poisson log-likelihood function and the bi-Gaussian expectation model described by Eq. (150) and Eqs. (151) and (152), respectively. From Eq. (150), it follows that wn ∂q = −1 ∂ fn fn
(190)
2 In this example, h n (b) = exp{− 21 (xn − b)2 } and hence h (2) n (b) = [(x n − b) − 1]h n (b). Therefore, & % wn ö 2 − 1]h n (b) ö ö − ℓ) − 1 [(xn − b) ρ = aℓ(1 f n n & % wn ö 2 h n (b) ö ö − ℓ) − 1 (xn − b) (191) = aℓ(1 f n n
310
A. VAN DEN BOS AND A. J. DEN DEKKER
ö T where, in the last step, use has been made of Eq. (157), at (a b)T = (aö b) which is the normal equation for the one-component model. 3. Summary, Discussion, and Conclusions In Sections VI.E.1 and VI.E.2, the expression Eq. (185), %
G H = T o †
o ρ
&
= diag(G
ρ)
(192)
has been derived for the (M + 3) × (M + 3) Hessian matrix of the log-likeliö T. hood function q(t) at the one-component stationary point (ö c1 . . . cöM aö bö b) In Eq. (192), the matrix G is the (M + 2) × (M + 2) Hessian matrix of the likelihood function for the one-component model with the same observations ö T . The vector o is the and evaluated at the stationary point (ö c1 . . . cöM aö b) (M + 1) × 1 null vector. The quantity ρ is the scalar deÞned by Eq. (184), ö − ℓ) ρ = aℓ(1
∂q ö h (2) n (b) ∂ f n n
(193)
ö T . The scalar quantity ℓ is discussed in Secevaluated at (ö c1 . . . cöM aö bö b) tion VI.B. It is a parameter of the two-component model f n (t) = gn (c) + a[ℓ1 h n (b1 ) + ℓ2 h n (b2 )]
(194)
with gn (c) and h n (bk ) the background function and the component function, respectively, both in the point xn . It is important to realize that under these conditions H † is fully deÞned by the ö T , that is, the stationary point of the log-likelihood elements of (ö c1 . . . cöM aö b) function for the one-component model f n (t) = gn (c) + ah n (b)
(195)
From these considerations, the usefulness of transforming the coordinates b1 and b2 into b1† = ℓ1 b1 + ℓ2 b2 and b2† = b1 − b2 is now apparent: this transformation partly diagonalizes the Hessian matrix at the one-component stationary point and produces the diagonal element ρ. Furthermore, up to now, no particular assumptions have been made with respect to the nature of the ö T of the log-likelihood function for the onestationary point (c1 . . . cöM aö b) component model. However, if it is assumed that it is the maximum likelihood solution for the parameters (c1 . . . c M a b)T of the one-component model, ö T is, by deÞnition, a maximum and, therefore, the Hessian ma(ö c1 . . . cöM aö b) trix G in this point is negative deÞnite. Then the nature of the one-component
311
RESOLUTION RECONSIDERED
ö T of the log-likelihood function of the twostationary point (ö c1 . . . cöM aö bö b) component model parameters is determined by the sign of ρ. If ρ is positive, this point is a saddle point. This saddle point is clearly seen in the reference structures of Figures 6 and 8. On the other hand, if ρ is negative, it is a maximum. Finally, if ρ is equal to zero, the point is degenerate. Therefore, in any case, the coordinates c1 . . . , c M , a, and b1† are inessential variables. On the other hand, if ρ might vanish, b2† would be essential. (For an explanation of the terms essential and inessential, see Section V.C.1.) For the moment, assume that ρ may vanish. It will be shown later that this may happen under the inßuence of ßuctuations of the observations. Then, for a study of the structures of the log-likelihood function that may occur, the degenerate part of this function is needed, that is, the Taylor polynomial in b2† up to and including the Þrst term with a nonvanishing coefÞcient such as described in Section V.D.1. The derivation of this polynomial will be the subject of the next subsection.
F. The Degenerate Part of the Likelihood Function 1. The Quadratic Term Equation (192) describes the Hessian matrix of the log-likelihood function for the two-component model in the one-component stationary point (ö c1 . . . ö T . Then the quadratic terms of the Taylor expansion of the logcöM aö bö b) likelihood function about this point are described by 1 [(c 2!
− cö)T
ö a − aö b1† − b]G[(c − cö)T
öT+ a − aö b1† − b]
2 1 ρb2† 2!
(196) ö T is the maxAs in Section VI.E.3, it is now assumed that (ö c1 . . . cöM aö b) imum likelihood solution for the parameters of the one-component model. ö T is a maximum of the likelihood function for the oneThen (ö c1 . . . cöM aö b) component model. Since the matrix G is the Hessian matrix at this maximum, it is negative deÞnite. This implies that G can always be diagonalized by a suitö T into an M + 2 able nonsingular transformation of ((c − cö)T a − aö b1† − b) ′ vector t so that the quadratic terms Eq. (196) become 1 M+2 δm tm′2 + 2! m=1
1 ρb2′2 2!
(197)
with b2′ = b2† where the δm , m = 1, . . . , M + 2, are strictly negative but ρ may be positive, zero, or negative. Then the splitting lemma presented in
312
A. VAN DEN BOS AND A. J. DEN DEKKER
Section V.C.1 shows that the Þrst M + 2 terms of Expr. (197) are the nondegenerate part of the representation of the log-likelihood function in the neigh′ borhood of the one-component stationary point. The coordinates t1′ , . . . , t M+2 ′2 are inessential. The term ρb2 is the Þrst term of the degenerate part. This part has to be investigated by computing the higher degree terms up to and including the Þrst term having a coefÞcient that cannot vanish. This computation will be carried out in the subsequent subsections. This subsection will be concluded by two examples illustrating the inßuence of errors in the observations on the sign of the coefÞcient ρ. Example VI.5 (The Sign of the Quantity ρ in the Presence of a Modeling Error) In this example, the quantity ρ is computed for the log-likelihood function described by Eq. (147): 1 N [wn − f n (t)]2 (198) − ln 2π − N ln σw − 2 2σw2 n
when, as in Example VI.1, the expectations of the observations are described by f n (θ ) = γ1 xn + γ2 + α[λ exp(−β1 xn ) + (1 − λ) exp(−β2 xn )]
(199)
but, different from Example VI.1, the model Þtted to the observations is taken as f n (t) = a[ℓ exp(−b1 xn ) + (1 − ℓ)exp(−b2 xn )]
(200)
That is, the background, present in the observations, is not present in the model. The following numerical values are chosen for the parameters: α = 1, λ = 0.7, β1 = 1, and β2 = 0.8. The measurement points are xn = 0.4 × n, n = 1, . . . , 10. The standard deviation σw may be any value but is taken equal to one. Next ρ is computed for γ1 = 0 and −0.003 ≤ γ2 ≤ 0. As described in Example VI.3, this is done by Þtting the model a exp(−bxn ) ö bö in the expression for ρ: to the wn and substituting the solution a, ö n )]xn2 exp(−bx ö n) ö − ℓ) [wn − aö exp(−bx ρ = aℓ(1
(201)
(202)
n
with ℓ = 0.7. The results are shown in Figure 9. The Þgure shows that in the absence of background, ρ is positive. It decreases for decreasing background until for γ ≈ −0.002 the quantity ρ becomes negative. Thus, this example shows the inßuence of a relatively minor modeling error upon the sign of the quantity ρ.
RESOLUTION RECONSIDERED
313
Figure 9. The quantity ρ as a function of the background modeling error in Example VI.5.
Example VI.6 (The Sign of the CoefÞcient ρ in the Presence of Statistical Errors) The purpose of this example is to illustrate the inßuence of statistical ßuctuations in the observations on the sign of the coefÞcient ρ. For that purpose, sets of independent, Poisson distributed observations wn , n = 1, . . . , N , are generated having expectations f (xn ; θ ) = f n (θ) = α[λ1 h n (β1 ) + λ2 h n (β2 )]
(203)
where 6 5 h n (βk ) = exp − 21 (xn − βk )2
(204)
with k = 1, 2. The reference structure of the likelihood function of this example is discussed in Example VI.2. The expression for the coefÞcient ρ is derived in Example VI.4 and is described by Eq. (191). As in Example VI.2, suppose that β1 = −β1 = 0.05 and λ = 0.8, while the measurement points are xn = −1.8 + (n − 1) × 0.4, n = 1, . . . , 10. The shape of f (x; θ) and the location of the measurement points are shown in Figure 7. The observations are generated for α = 625, 2500, 10,000, 40,000, and 160,000, respectively. In Figure 7, f (x5 ; θ) and f (x6 ; θ) are approximately equal to one. The relative standard deviation in these points is 4, 2, 1, 0.5, and 0.25% for the chosen values of α, respectively. The generation of a set of observations corresponding to each of these values of α has been repeated 100 times. For each of the resulting
314
A. VAN DEN BOS AND A. J. DEN DEKKER
Figure 10. Percentage of negative values of the quantity ρ as a function of the expected number of counts in Example VI.6.
500 sets of observations the coefÞcient ρ has been computed. Figure 10 shows the percentage of the negative values of ρ for each of the values of α. This percentage decreases for increasing α, that is, for decreasing relative standard deviation of the observations. The results of Example VI.6 may be commented upon as follows. The deÞnition by Eq. (184) shows that ρ is a linear combination of the quantities ∂q/∂ f n , n = 1, . . . , N . Here, f n is the one-component model f n (t) = ö T and, in the notation of Eq. (122), q is the logÐ ah n (b) at t = tö = (aö b) probability density function q(w; f (tö)) = log{ p(ω; f (tö))}, that is, for hypothetical observations with expectations f n (tö). Since these expectations are parameters of the probability density function p(ω), the quantity ∂q/∂ f n is equal to the Fisher score of these parameters. (The Fisher score concept was introduced in Section IV.C.2.) If the observations made and substituted in Eq. (184) had a probability density function p(ω; f (tö)) with expectation f n (tö), the expectation of ∂q/∂ f n would vanish since this is a property of the Fisher score (see Section IV.C.2). However, the true expectation of the observations wn is the two-component model f n (θ) = α[λh n (β1 ) + (1 − λ)h n (β2 )]
(205)
The conclusion is that ∂q/∂ f n tends to a stochastic variable with an expectation
RESOLUTION RECONSIDERED
315
equal to zero as f n (tö) tends to f n (θ ). Generally, this will happen as β1 and β2 are more closely located. This is illustrated in Examples VI.3 and VI.4 where 1 ∂q = 2 (wn − f n ) ∂ fn σw
(206)
wn ∂q = −1 ∂ fn fn
(207)
and
respectively, with f n = f n (tö). Since tö is the solution for the parameters of the one-component model, it differs little for different realizations of the observations and may be considered to be constant. Then the expectations of the right-hand members are approximately equal to 1 [ f n (θ) − f n ] σw2
(208)
f n (θ) −1 fn
(209)
and
respectively. On the other hand, if β 1 and β 2 are more widely apart, the approximate expectations Exprs. (208) and (209) are increasingly different from zero. ö The coefÞcients of the ∂q/∂ f n in Eq. (184) are proportional to h (2) n (b), n = 1, . . . , N . Again, since bö is an estimate of a parameter of the one-component model, it differs little for different realizations of the observations and may be considered to be constant. Then the expectation of ρ is a linear combination of the expectations of the ∂q/∂ f n and tends to zero if β 1 and β 2 approach each other. The variance of the ∂q/∂ f n is seen to be dominated by the variance of wn and, hence, the variance of ρ is dominated by the variances of the elements of w = (w1 . . . w N )T . Therefore, sample values of ρ may be expected to be increasingly frequently negative, for increasing variance of w, a fact demonstrated in Example VI.6. Conclusion VI.4 It is concluded that both systematic errors (wrong model) and nonsystematic errors (statistical ßuctuations) may change the sign of the coefÞcient ρ as compared with the sign in the absence of the errors. In the absence of errors, the one-component stationary point is a saddle point. After a change of sign of ρ as a result of the errors, the point becomes a maximum of the likelihood function of Examples VI.5 and VI.6. If this maximum would be
316
A. VAN DEN BOS AND A. J. DEN DEKKER
absolute, this would have important consequences since then the onecomponent stationary point would, in both examples, represent the solution.
2. The Cubic Term In Step 2 of the Reduction Algorithm, derived in Section V.D.1, all cubic terms of the Taylor expansion of the function investigated are removed with the exception of the cubic term in the essential variable, which, in the notation of the preceding subsection, is the term in b2′3 . It was found that the coefÞcient of this term was not affected by the removal procedure. Therefore, for us to compute the coefÞcient of the cubic term of the degenerate part of the representation of the likelihood function around the one-component stationary point, it is sufÞcient to compute the coefÞcient of b2′3 in the coordinates t ′ = ′ b2′ )T introduced in Section VI.F.1. However, this coefÞcient must (t1′ . . . t M+2 3 be the same as that of b2† in the coordinates t † = (c1 . . . c M a b1† b2† )T since t ′ has been obtained from t † by a linear transformation such that b2′ = b2† . Finally, t † is a linearly transformed version of the vector t = (c1 . . . c M a b1 b2 )T introduced in Section VI.E.1 and described by t † = diag(I
L)t
where I is the identity matrix of order M + 1 and % & ℓ1 ℓ2 L= 1 −1
(210)
(211)
It is clear that this transformation affects b1 and b2 only and that b1 = b1† + ℓ2 b2†
(212)
b2 = b1† − ℓ1 b2†
(213)
and
Generally, the cubic terms in b1 and b2 of the Taylor expansion of q about t = tö are in operator notation described by ∂− ∂− 3 1 ö ö + (b2 − b) q (214) (b1 − b) 3! ∂b1 ∂b2 By Eqs. (212) and (213) this is equal to % & &3 % 1 ö ∂− + ∂− + b2† ℓ2 ∂− − ℓ1 ∂− q (b1† − b) 3! ∂b1 ∂b2 ∂b1 ∂b2
(215)
RESOLUTION RECONSIDERED
317
Straightforward calculus shows that the generic expression for third-order partial derivatives of q with respect to b1 and b2 is ∂ 3q ∂ fn1 ∂ fn2 ∂ fn3 ∂ 3q = ∂bk ∂bℓ ∂bm ∂ f n 1 ∂ f n 2 ∂ f n3 ∂bk ∂bℓ ∂bm n1 n2 n3 +
n1
n2
∂ 2 fn1 ∂ fn2 ∂ 2q ∂ f n 1 ∂ f n2 ∂bk ∂bℓ ∂bm
∂ 2 f n1 ∂ f n2 ∂ 2 fn1 ∂ fn2 + ∂bk ∂bm ∂bℓ ∂bℓ ∂bm ∂bk 3 ∂q ∂ fn + ∂ f n ∂bk ∂bℓ ∂bm n
+
(216)
where k, ℓ, and m are equal to either 1 or 2. In this expression, f n is the generalized two-component model described by Eq. (125). Combining Eqs. (215) and (216) and some straightforward algebra then yield for the coefÞcient of b2′3 σ (217) 3! with ö 1 ℓ2 (ℓ2 − ℓ1 ) σ = aℓ
∂q ö h (3) n (b) ∂ f n n
(218)
This completes the computation of the cubic term of the degenerate part of the representation of the likelihood function q in the neighborhood of the one-component stationary point. Notice that in the important case that the amplitudes are equal, ℓ1 = ℓ2 = 0.5 and, hence, σ is exactly equal to zero. Example VI.7 (The CoefÞcient σ for the Least Squares Criterion and a Biexponential Model) For the normal log-likelihood function described by Eq. (147), 1 ∂q = 2 [wn − f n (t)] ∂ fn σw
(219)
Then for this log-likelihood function and the biexponential model with background described by Eq. (142), the quantity σ becomes σ =−
ö 1 ℓ2 (ℓ2 − ℓ1 ) aℓ ö n )]xn3 exp(−bx ö n) [wn − cö1 − cö2 xn − aö exp(−bx σw2 n (220)
ö T substituted for the parameters (a b)T . with the least squares solution (aö b)
318
A. VAN DEN BOS AND A. J. DEN DEKKER
Example VI.8 (The CoefÞcient σ for a Bi-Gaussian Model and Independent Poisson Distributed Observations) In this example, the quantity σ is computed for the Poisson log-likelihood function and the bi-Gaussian model described by Eq. (150) and Eqs. (152) and (153), respectively. From Eq. (150), it follows that wn ∂q = −1 ∂ fn fn
(221)
2 In this example, h n (b) = exp{− 21 (xn − b)2 } and hence h (3) n (b) = [(x n − b) − 3] (xn − b)h n (b). Therefore, & % wn ö 2 − 3](xn − b)h ö n (b) ö ö 1 ℓ2 (ℓ2 − ℓ1 ) σ = aℓ − 1 [(xn − b) f n n & % wn ö 3 h n (b) ö ö 1 ℓ2 (ℓ2 − ℓ1 ) − 1 (xn − b) (222) = aℓ f n n
ö T where, in the last step, use has been made of the normal at (a b)T = (aö b) equation for the one-component model Eq. (158). The deÞnition of σ by Eq. (218) is similar to the deÞnition of ρ by Eq. (184) since both are linear combinations of the Fisher scores ∂q/∂ f n . Therefore, the properties of σ are comparable to those of ρ as well. The quantity σ also has an expectation tending to zero as β1 tends to β2 . Its variance is dominated by and increases with that of the elements of w. As a consequence, sample values of both signs may occur with a frequency dependent on the distance of β1 and β2 and the statistical properties of the observations wn . Expression (108) of Section V.D.2 shows that for computing the coefÞcient of b2′4 in the degenerate part after removal of nonessential cubic terms, the coefÞcient of b2′4 and the coefÞcients of tm′ b2′2 , m = 1, . . . , M + 2, in the Taylor ex′ b2′ )T pansion with respect to the elements of t ′ = (t1′ . . . t M+2 ′ before the removal are needed. Since the elements tm , m = 1, . . . , M + 2, are liner combinations of the elements of (c1 . . . c M a b1† )T , the coefÞcients of the combinations of the coefÞcients of the terms in 2 tm′ b2′2 are 2 linear 2 2 terms c1 b2† , . . . , c M b2† , ab2† , and b1† b2† of the Taylor expansion with respect to (c1 . . . c M a b1† b2† )T . Therefore, for us to get an impression of the behavior of the coefÞcients of2 the terms in2 tm′ b2′22, it is sufÞcient to study the behavior of the 2 coefÞcients of c1 b2† , . . . , c M b2† , ab2† , and b1† b2† . This is the reason why the latter coefÞcients are now discussed. Since their computation is similar to that of σ described earlier in this subsection, only the results2 are given. These 2 2 are as follows. The coefÞcients of cm b2† , ab2† , and b1† b2† are described by,
RESOLUTION RECONSIDERED
319
respectively,
c) ∂ 2q ö ∂gn2 (ö h (2) n 1 (b) ∂ fn1 ∂ fn2 ∂cm
(223)
∂ 2q ö n 2 (b) ö + 1ρ h (2) (b)h 2 ∂ f n1 ∂ f n2 n1
(224)
∂ 2q 1 ö (1) ö h (2) (b)h n 2 (b) + 2 σ ∂ fn1 ∂ f n2 n1
(225)
1 ℓ ℓ aö 2 1 2
n 1 ,n 2 1 ℓ ℓ aö 2 1 2
n 1 ,n 2
and 1 ℓ ℓ aö2 2 1 2
n 1 ,n 2
Study of these coefÞcients reveals that they are relatively simple expressions ö T . Furthermore, Eqs. (224) and (225) of the one-component solution (ö c T aö b) both contain a second term which is by deÞnition much smaller than the Þrst term and can in practice be left out. This is illustrated in the following examples. Example VI.9 (CoefÞcients of Cubic Terms for the Normal Log-Likelihood Function and an Exponential Model) For the Log-Likelihood Function of Eq. (147) and the exponential model with background described by Eq. (148), the coefÞcients Eqs. (223)Ð(225)become ö 1 ℓ2 3 ö 1 ℓ2 2 aℓ aℓ ö n) ö n ) (226) xn exp(−bx and − x exp(−bx − 2 2σw n 2σw2 n n
for m = 1, 2, respectively,
−
ö 1 ℓ2 2 aℓ ö n) x exp(−2bx 2σw2 n n
(227)
−
ö 1 ℓ2 3 aℓ ö n) x exp(−2bx 2σw2 n n
(228)
and
öT These coefÞcients are stable since the one-component solution (ö c1 . . . cöM aö b) is stable in the sense that it does not differ very much from the one-component solution for exact observations. Example VI.10 (CoefÞcient of Cubic Terms for the Poisson Log-Likelihood Function and a Bi-Gaussian Model) For the log-likelihood function Eq. (150) and the bi-Gaussian2model described by Eqs. (151) and (152), the coefÞcient background of the terms in cm b2† is, of course, equal to zero since there is no 2 2 function. The expression for the coefÞcients of the terms in ab2† and b1† b2† are
320
A. VAN DEN BOS AND A. J. DEN DEKKER
as follows: ℓ1 ℓ2 wn {(xn − 1)2 − 1} aö n
(229)
ℓ1 ℓ2 wn {(xn − 1)2 − 1}(xn − 1) aö n
(230)
− and −
if the terms 21 ρ and 12 σ are left out. These are again stable coefÞcients. The results of this subsection may be summarized as follows. Expressions have been derived for the coefÞcients of the cubic terms of the Taylor repre′ ′ t M+3 )T . These are the coordinates sentation in the coordinates t ′ = (t1′ . . . t M+2 ′ is the essential variafter full diagonalization of the Hessian matrix and t M+3 ′ able b2 which, in its turn, is equal to b1 − b2 . The coordinates t ′ are a linear † transformation of the coordinates t † = (t1† . . . t M+3 )T = (c1 . . . c M a λ1 b1 + T λ2 b2 b1 − b2 ) introduced earlier to partly diagonalize the Hessian matrix in order to show that b1 − b2 is essential. The coefÞcient of b2′3 is the quantity σ/3! described by Eq. (218). This equation shows that σ is a linear combination of the quantities ∂q/∂ f n discussed earlier. Therefore, σ may have a different sign for only slightly differing sets of observations. This behavior differs essentially from that of the coefÞcients of the cubic terms in tm′ b2′2 , m = 1, . . . , M + 2, to be used in the next subsection. These coefÞcients are linear combinations of the coefÞcients of the terms in tm† b2′2 since t ′ is a linear transformation of t † . It has been shown that the latter coefÞcients are stable in the sense that they do not change very much and keep the same sign if the observations change. Therefore, the coefÞcients of tm′ b2′2 , m = 1, . . . , M + 2, are stable in the same sense, a fact that will be employed in the next subsection, which deals with the quartic terms. Notice that if in Eqs. (224) and (225) the terms ρ and σ are left out, Eqs. (223)Ð(225)can be compactly expressed as 1 ℓ ℓ aö 2 1 2
n 1 ,n 2
ö ∂ 2q ö ∂ f n 2 (t ) h (2) n 1 (b) ∂ f n1 ∂ f n2 ∂tm
m = 1, . . . , M + 2
(231)
ö T. where tö = (ö c1 . . . cöM aö bö b) 3. The Quartic Term In this subsection, the quartic term of the degenerate part of the representation of the log-likelihood function in the neighborhood of the one-component stationary point is computed. For this purpose, use is made of the Reduction Algorithm described in Section V.D, of Eq. (110) in particular. This equation
RESOLUTION RECONSIDERED
321
shows that the coefÞcient τ/4! of the quartic term of the degenerate part after the removal of the inessential cubic terms is equal to the difference of the corresponding coefÞcient before the removal and a linear combination of the squares of the coefÞcients of a selection of the cubic terms. In the notation of the preceding subsection, this is the difference of the coefÞcient of b2′4 and a linear combination of the squares of the coefÞcients of the cubic terms in tm′ b2′2 , m = 1, . . . , M + 2. Equations (223)Ð(225)describe these coefÞcients. Therefore, all that is left to be done is to compute the coefÞcient of b2′4 before the removal of the pertinent cubic terms. This computation is highly similar to that of the coefÞcient σ described in Section VI.F.2 and will, therefore, be left out. The result is 1 ψ 4!
=
∂ 2q ∂q (4) 1 ö (2) ö ö ö 1 ℓ2 ℓ31 + ℓ32 h (2) h (b) 12aö2 ℓ21 ℓ22 n 1 (b)h n 2 (b) + aℓ 4! ∂ f n1 ∂ f n2 ∂ fn n2 n n 1 ,n 2
(232)
The Þrst term of this expression is a stable term not changing very much under the inßuence of small changes in the observations. The second term, on the other hand, is a linear combination of ∂q/∂ f n , n = 1, . . . , N , and, as before, may be left out. This produces 1 ψ 4!
≈
1 2 2 2 ∂ 2q ö (2) ö h (2) aö ℓ1 ℓ2 n 1 (b)h n 2 (b) 2 ∂ f ∂ f n1 n2 n 1 ,n 2
(233)
It is observed that Expr. (232) is the quantity ψ/4! used in Eq. (110) in Section V.D.2. The quantities ϕm , m = 1, . . . , M + 2, appearing in the same expression, are the coefÞcients of tm′ b2′2 , m = 1, . . . , M + 2, discussed in the preceding subsection. Then, by Eq. (110), the coefÞcient of b2′4 after removal of the inessential terms becomes τ/4! with (234) τ = ψ − 6ϕ T diag δ1−1 . . . δ −1 M+2 ϕ It is straightforward to show that this is equal to
τ = [ψ − 6ϕ ′T G −1 ϕ ′ ]
(235)
′ where the elements of ϕ ′ = (ϕ1′ . . . ϕ M+2 )T are deÞned by Exprs. (223)Ð(225), respectively, and G is the Hessian matrix of the log-likelihood function for the one-component model. In other words, the coefÞcient of b2′4 after removal of the nonessential cubic terms is equal to the difference of two terms. The Þrst term is equal to the quantity ψ/4! described by Expr. (232). Expression (233) shows that it is approximately a quadratic form in second-order derivatives of
322
A. VAN DEN BOS AND A. J. DEN DEKKER
the one-component function with respect to its location parameter and evaluated at the maximum of the likelihood function for the one-component model. The second term is merely a correction of the coefÞcient ψ/4! for the introduction of the curvilinear coordinates removing the nonessential cubic terms. The term is seen to be a quadratic form in the coefÞcients of the cubic terms described by Eqs. (223)Ð(225).All these quantities are stable in the sense that they do not differ very much from their values for hypothetical exact observations. Typically, τ/4! is relatively insensitive to errors in the sense that its standard deviation is much smaller than the absolute value of its expectation and, therefore, it will be supposed to always have the same sign. In this respect, the coefÞcient τ/4! of b2′4 = (b1 − b2 )4 is essentially different from the coefÞcients ρ/2! and σ/3! which, being linear combinations of the derivatives ∂q/∂ f n , n = 1, . . . , N , easily change sign if statistical ßuctuations in the observations or systematic errors are present. Example VI.11 will present a convincing illustration of this behavior. Since the coefÞcient τ/4! is the Þrst coefÞcient of the Taylor expansion of the degenerate part of the representation of the log-likelihood function about the one-component stationary point that is not supposed to vanish under the inßuence of errors, the degenerate part is supposed to be fully represented by 1 ρ2 2!
+
1 σ 3 3!
+
1 τ 4 4!
(236)
where, for simplicity, has been substituted for b2′ = b1 − b2 . If, in this expression, the variable is transformed by the Tschirnhaus transformation (Poston and Stewart, 1978), (237) ′ = + 14 3!1 σ
the resulting polynomial in ′ will be different from Eq. (236) in two respects: the cubic term will vanish and a linear one will appear. This new polynomial is easily recognized as the dual canonical cusp catastrophe discussed earlier in Example V.2 in Section V.C.2. In this article, however, Expr. (236) is preferred to the canonical cusp since in the former representation the origin is the onecomponent stationary point. This makes the analysis simpler. Example VI.11 (Statistics of Quadratic, Cubic, and Quartic Terms) The purpose of this example is to illustrate the difference in statistical behavior of the coefÞcients ρ/2! and σ/3!, on the one hand, and the coefÞcient τ/4!, on the other. Suppose that the observations available are the Poisson distributed observations with bi-Gaussian expectations described in Examples VI.2 and VI.6. Furthermore, suppose that the coefÞcient α is equal to 2500. This means that the peak value of the observed bi-Gaussian corresponds on the average to approximately 2500 counts with a standard deviation of 50 counts. For this value of α, the observations are generated 100 times and each time the maximum
RESOLUTION RECONSIDERED
323
likelihood estimates aö and bö of the parameters α and β of the one-component model 5 6 α exp − 21 (xn − β)2 (238)
are computed. From these values of aö and bö the coefÞcients ρ/2!, σ/3!, and τ/4! are computed by use of Eqs. (184), (218), and (235), respectively. The histograms of the results for each of these coefÞcients are shown in Figure 11. They show that the coefÞcients ρ/2! and σ/3! are approximately distributed about zero with a standard deviation of about 10 and 3, respectively. The coefÞcient τ/4!, on the other hand, is distributed around −145 with a standard deviation of approximately 3. G. Summary and Conclusions In this section, the generalized two-component model has been introduced. It is described by f (x; θ) = g(x; γ ) + α[λ1 h(x; β1 ) + λ2 h(x; β2 )]
(239)
Figure 11. Histograms of the coefÞcients ρ/2! (a), σ/3! (b), and τ/4! (c) in Example VI.11. The quantities ρ and σ are distributed about zero but τ is always negative.
324
A. VAN DEN BOS AND A. J. DEN DEKKER
In this expression, the component function h(x; βk ) is a function of x and is nonlinearly parametric in βk . For example, h(x; βk ) may be the model of a spectral peak as a function of frequency or a point spread function as a function of a spatial coordinate located at βk or it may be exponential with decay constant βk . The amplitude of the component h(x; βk ) is the fraction λk of the parameter α. The background function g(x; γ ) is parametric in the elements of the vector γ = (γ1 . . . γ M )T . Its purpose is to model contributions often occurring in practice such as trends or further components h(x; βk ), k = 3, . . . . The elements of the (M + 3) × 1 parameter vector θ are α, β1 , β2 , and the elements of γ : θ = (γ T
α
β1
β2 )T
(240)
Throughout, it has been assumed that observations w1 , . . . , w N are made at the exactly known measurement points x1 , . . . , x N , respectively. It has also been assumed that the observations wn ßuctuate about the corresponding model values f (xn ; θ) if the experiment is repeated, that is, E[wn ] = f (xn ; θ )
(241)
and that the parameters θ enter the probability density function of the observations via these expectations. The log-likelihood function q(t) corresponding to this probability density function has also been introduced, where the elements of the (M + 3) × 1 parameter vector t correspond to those of the vector of hypothetical true parameters θ. In addition, the structure, that is, the pattern of stationary points, of the log-likelihood function has been analyzed. This analysis has shown that in addition to the absolute maximum representing the maximum likelihood solution, the log-likelihood function has a further, usually relative, maximum. The analysis has also revealed the presence of a stationary point representing the solution for the parameters of a one-component model of the same nonlinearly parametric family from the same observations. This stationary point is called the one-component stationary point. It has also been shown that under the inßuence of systematic errors or of statistical errors in the observations the likelihood function may become degenerate in or in the neighborhood of the one-component stationary point. The variable in which the log-likelihood function becomes degenerate is called the essential variable. Since degeneracy may change the structure of the loglikelihood function, it may change the solution. To investigate this, we have computed the degenerate part of the likelihood function in the form of its Taylor expansion in the essential variable up to and including the Þrst term that cannot vanish under the inßuence of the observations. This degenerate part
RESOLUTION RECONSIDERED
325
will be employed in Section VII so that we can investigate structural changes of the likelihood function and their consequences for the resolution of the components. VII. Singularity and Resolution A. Introduction In this section, in Section VII.B.1, the polynomial representation of the loglikelihood function derived in Section VI is analyzed. The most important conclusion from this analysis is that there exist two types of sets of observations: sets of observations producing distinct solutions for the location parameters and sets of observations producing coinciding solutions. From the latter observations, the parameters, of course, cannot be resolved. A criterion is derived showing to which type a given set of observations belongs. Then, in Section VII.B.2, the numerical computation of the criterion is explained and, in Section VII.B.3, the conditions for validity of the criterion are discussed. It is concluded that the criterion is remarkably general and simple. In a numerical example in Section VII.B.4, the criterion is applied to Poisson distributed observations made on a pair of overlapping sinc-square functions, which is the model originally used by Rayleigh to deÞne his resolution limit. The results of Section VII.B.4 demonstrate that with statistical observations, resolving occurs with a certain probability only: the probability of resolution. Various aspects of this concept are investigated. In particular, in Sections VII.B.6 and VII.B. 7, it is emphasized that the probability of resolution may also be seriously inßuenced by systematic errors. The purpose of Section VII.C is to show that the description of resolution presented in Section VII.B may easily be extended to components in two dimensions. In a numerical example, this is demonstrated by applying these results to the most important representative of the twodimensional component functions in optics: the Airy function. Section VII.D is devoted to a discussion of the resolution criterion in the presence of coherence. Related literature is brießy surveyed in Section VII.E. Conclusions are drawn in Section VII.F.
B. Coinciding Scalar Locations 1. Conditions for Resolution In Section VI.F.1, it has been shown that the nondegenerate part of the loglikelihood function in the neighborhood of the one-component stationary point
326
A. VAN DEN BOS AND A. J. DEN DEKKER
is described by 1 M+2 δm tm′2 2! m=1
(242)
The coefÞcients δm in this expression are the eigenvalues of the Hessian matrix at the maximum of the log-likelihood function of the parameters of the onecomponent model. Since this maximum is supposed to be nonsingular, these coefÞcients are supposed to be strictly negative. The degenerate part of the likelihood function around the same point has been derived in Sections VI.F.1 through VI.F.3 and is described by Expr. (236): 1 ρ2 2!
+
1 σ 3 3!
+
1 τ 4 4!
(243)
where = b1 − b2 with b1 and b2 the locations of the components. The quantities ρ, σ , and τ in Eq. (243) are deÞned in Section VI by Eqs. (184), (218), and (235), respectively. In the same section, convincing reasons have been given to assume that as opposed to the sign of τ , the sign of ρ and σ depends on the particular set of observations. Together, Eqs. (242) and (243) describe the possible structures of the log-likelihood function in the neighborhood of the one-component stationary point. In any case, these structures must include the two-maxima, one saddle point structure. A simple analysis then shows that τ must be negative. The stationary points of the degenerate part Expr. (243) satisfy ρ + 12 σ 2 + 16 τ 3 = 0
(244)
The second-order derivative of Expr. (243) at the stationary point = 0 is equal to ρ. Therefore, if ρ is positive, this point is a minimum, if it is negative it is a maximum, and the origin is a degenerate stationary point if ρ is equal to zero. The remaining stationary points are the roots of ρ + 21 σ + 16 τ 2
(245)
Therefore, they are located at 3 ′ σ 2
±
3 2
′2 8 ′ 1/2 σ + 3ρ
(246)
with ρ ′ = −ρ/τ and σ ′ = −σ/τ where the minus signs ensure that ρ ′ and σ ′ have the same sign as ρ and σ , respectively. Since the roots must be real, there are three stationary points if the discriminant satisÞes σ ′2 + 83 ρ ′ > 0
(247)
Otherwise, the origin is the only stationary point. Because ρ ′ = −ρ/τ, σ ′ =
RESOLUTION RECONSIDERED
327
−σ/τ , and τ is negative, ρ>0
(248)
is an accurate approximation to the condition Ineq. (247) if |τ | ≫ |σ | which, in practice, is typically true. Equation (218) shows that σ and, therefore, σ ′ is equal to zero if ℓ is equal to 0.5. Then Ineqs. (247) and (248) are identical. The simple condition Ineq. (248) is the key result of this article. If it is true, the origin is a minimum of the degenerate part and there are two additional stationary points. Since τ is negative, these must be maxima. Since the δm in the nondegenerate part of the log-likelihood function, described by Expr. (242), are all strictly negative, the origin is a saddle point of the log-likelihood function and both other stationary points are maxima with = 0. Of these, the absolute maximum represents the maximum likelihood solution. If Ineq. (248) is not true, the origin is a maximum and, since it is the only stationary point, this maximum is absolute and represents the maximum likelihood solution. In the special case that ρ is equal to zero, this maximum is degenerate. These considerations imply that the log-likelihood function has a maximum with = b1 − b2 = 0 if Ineq. (248) is true. Then the solutions for β 1 and β 2 are clearly distinct. On the other hand, if Ineq. (248) is not true, the origin, which is the one-component stationary point, is the absolute maximum. Then the solution is located at the origin, that is, = b1 − b2 = 0 and, therefore, the solutions for β 1 and β 2 coincide exactly. Conclusion VII.1 If ρ > 0 the parameters β 1 and β 2 can be resolved from the available observations. Otherwise, they cannot. Thus the sign of ρ is a criterion for the resolvability of the parameters. 2. Numerical Computation of the Resolution Criterion To see how the criterion ρ is computed numerically for a given set of observations, we Þrst repeat its deÞnition by Eq. (184): ∂q ö ö − ℓ) h (2) (249) ρ = aℓ(1 n (b) ∂ f n n
In this expression, q is the log-likelihood function, f n = f n (t) is the model Þtted as a function of its parameters t = (c1 . . . c M a b1 b2 )T , and hn (b) is the component function. Expression (249) is evaluated at the one-component staö is the maximum ö T where (ö c1 . . . cöM aö b) tionary point tö = (ö c1 . . . cöM aö bö b) likelihood estimate of the parameters of the one-component model gn (c) + ah n (b)
(250)
328
A. VAN DEN BOS AND A. J. DEN DEKKER
from the same observations. These considerations show that for a given set of observations, ρ may be computed as follows. First, the maximum likelihood ö T for the parameters (γ1 . . . γ M α β)T of the onesolution (ö c1 . . . cöM aö b) component model is computed. Then this solution, substituted in the expression Expr. (249), produces a numerical value for ρ. 3. Conditions for the Use of the Criterion In this subsection, the conditions for the use of ρ as the criterion for resolvability of the parameters β 1 and β 2 are investigated. First, consider the approximation of Ineq. (247) by Ineq. (248). It is seen that both inequalities are identical if σ ′ = 0. Equation (218) reveals that this is so if ℓ = 0.5. This means that the heights or amplitudes of the components are equal. Therefore, in this important case, the sign of ρ decides exactly if the parameters of the two-component model may be resolved from the available observations. Furthermore, Eqs. (184) and (218) show that both ρ and σ are linear combinations of the quantities ∂q/∂ f n , n = 1, . . . , N , which are stochastic variables with an expectation close to zero if β 1 and β 2 get close. This has been explained in Sections VI.F.1 and VI.F.2. They are, therefore, typically absolutely small as compared with τ , which, as Eqs. (234) and (235) and Eqs. (223)Ð(225)show, is dominated by deÞnite quadratic forms in the derivatives of the component ö function in the point b. Finally, the dependence of ρ, σ , and τ on the value of ℓ is addressed. The pertinent expressions reveal that these coefÞcients are equal to zero if ℓ is equal to zero or, equivalently, is equal to one. This justiÞes an analysis of these coefÞcients for ℓ approaching zero. Simple considerations then show that if ℓ → 0, ρ and σ become proportional to ℓ. The expressions Eqs. (232), (234), and (223)Ð(225)show that under the same condition the coefÞcient τ becomes approximately ∂q 1 ö ö aℓ h (4) (251) n (b) 4! ∂ f n n since the other terms contributing to τ are, under the conditions concerned, proportional to ℓ2. It is observed that Expr. (251) is a linear combination of the derivatives ∂q/∂ f n , n = 1, . . . , N . Therefore, it could, like ρ and σ , change sign and become positive. This would violate the assumption that τ is strictly negative. Note, however, that this can occur only if ℓ2 ≪ ℓ, that is, if ℓ ≪ 1. This is illustrated by the following example. Example VII.1 (Validity of the CoefÞcient ρ as Discriminant under Extreme Conditions) In Example VI.11, histograms of ρ/2!, σ/3!, and τ/4! are
RESOLUTION RECONSIDERED
329
computed for a bi-Gaussian two-component model with Poisson distributed observations. The value of λ is 0.2 and ℓ is chosen accordingly. In this example, the simulations concerned are repeated but now ℓ = 0.02 instead of 0.2. Under the experimental conditions chosen, this implies that in all measurement points the weakest component is equal to the standard deviation of the ßuctuations in the observations or is smaller. The results of this simulation are as follows. In all of the 100 experiments, the value of τ was negative. Moreover, the sign of the discriminant Eq. (247) and that of ρ were the same in 99 experiments. In one experiment, producing comparatively small values for ρ and for the discriminant, the signs were different. Another 100 experiments produced similar results. Conclusion VII.2 The preceding considerations justify the use of the simple discriminant D=
∂q ö h (2) n (b) ∂ f n n
(252)
to decide on the resolvability of the components from the observations. Under exceptional conditions, such as those in Example VII.1, the validity of D as discriminant can always be checked by computing the coefÞcients σ and τ in addition to the coefÞcient ρ and subsequently comparing the result of Expr. (247) with that of Expr. (248) and checking the sign of τ . The resolution discriminant D has the following remarkably general and simple properties: r
r
r
r
It is valid for all component functions h(x; βκ ) and all background functions g(x; γ ). Therefore, it is applicable not only to point spread functions and spectroscopic peaks but to exponential functions with unknown decay and sinusoids of unknown frequency as well. It is valid for all log-likelihood functions q. This means that the loglikelihood function used need not correspond to the probability density function of the observations. For a particular component function and background function, it depends only on the observations because it is computed from the observations and the estimates aö and bö of the parameters of the one-component model and the estimates cö of the parameters of the background function. However, these estimates have been computed from the same observations. It does not depend on the parameter ℓ.
330
A. VAN DEN BOS AND A. J. DEN DEKKER
4. Application to RayleighÕs sinc-Square Model In this subsection, the proposed resolution discriminant is applied to observations with an expectation described by a pair of overlapping sinc-square functions. This function is important for the purposes of this article since it is the function used by Rayleigh to deÞne optical two-point resolution (Rayleigh, 1879). Example VII.2 (Probability of Resolving the sinc-Square Components of the Rayleigh Model) In this example, the probability of resolving the components of the Rayleigh sinc-square model is studied as a function of the standard deviation of the observations and the distance of the components. It is assumed that the expectations of the observations w1 , . . . , w N are described by α λ sinc2 {2π(xn − β1 )} + (1 − λ) sinc2 {2π(xn − β2 )} (253)
where sinc (2π x) = sin(2π x)/(2π x) . The parameter λ is chosen equal to 0.5. Therefore, the heights of the peaks are equal. Three different choices of β 1 and β 2 are considered: β1 = −β2 = 0, β1 = −β2 = 0.0125, and β1 = −β2 = 0.025. Then, the distances β of the peaks are 0, 0.025, and 0.05, respectively. The measurement points have been taken as xn = −1.45, −1.25, . . . , 1.55. Figure 12 shows the function and the location of the measurement points for β = 0.05. The observations w1 , . . . , w N are assumed to be independent and Poisson distributed. For each of the β and a number of different values of α, 300 sets of observations are generated thus distributed. For each of these sets, the values aö and bö of the parameters a and b of the one-component model a sinc2 [2π (xn − b)]
(254)
are computed that maximize the Poisson likelihood function concerned. Subsequently, these values are substituted in the expression Eq. (249) for ρ and the number of negative results is counted. This is the number of times the components cannot be resolved from the available observations since they coincide. These numbers are shown as frequencies in Figure 13, giving an impression of the probability of unresolved peaks. This is done as a function of the parameter α which controls the expectation of the number of Poisson counts in any point. The values of α chosen were 225, 450, 900, 1800, and 3600. Since the standard deviation of a Poisson stochastic variable with expectation α is equal to α 1/2 , the relative standard deviations corresponding to these values are equal to 6.6, 4.7, 3.3, 2.4, and 1.6%, respectively. Hence, since the expectations described by Eq. (253) are all smaller than or equal to α, these relative standard deviations are a lower bound on the relative standard deviations in any point.
RESOLUTION RECONSIDERED
331
Figure 12. Measurement points and expectations of the observations in Example VII.2. The expectations are the sum of two overlapping sinc-square functions.
Figure 13. Percentage of coinciding solutions for the locations of the sinc-square functions of Example VII.2 as a function of the expected number of counts for three different distances of the sinc-square functions.
332
A. VAN DEN BOS AND A. J. DEN DEKKER
These considerations show that the results of Figure 13 correspond to observations that are, from left to right, increasingly precise. From the results shown in Figure 13, the following conclusions may be drawn: r
r
r
First, for β = 0, the model deÞned by Eq. (253) is a one-component model. Therefore, the stars in Figure 13 represent the frequency of occurrence of a one-component solution for one-component observations. Figure 13 shows that this frequency is approximately 50% for all values of α. This also implies that the probability that a two-component model will be obtained from the one-component observations is about 50%. This phenomenon will be addressed later in Example VII.8. Furthermore, for β = 0.025, the frequency of nonresolution decreases from about 50% to 25% for an increasing number of counts. That is, for increasingly precise observations, the probability of resolution increases. Finally, for β = 0.05, that is, for components more widely separated than before, the frequency of unresolved components decreases from about 30% to 0% as the number of counts increases. Comparison to the results for β = 0.025 shows that, from equally precise observations, the components are more likely to be resolved if they are more widely apart in the model of the expectations. The frequency of 0% reßects the fact that the components are nearly always resolved if the observations are sufÞciently precise and the components of the expectations are sufÞciently widely apart.
Example VII.2 clearly shows the fundamental differences between the classical Rayleigh resolution criterion and the resolution criterion proposed in this article. From the point of view presented here, the Rayleigh resolution criterion is based on expectations of observations, not on observations themselves. It is a distance measured in terms of widths of the peaks. In the criterion presented here, the distance of the peaks in the expectation model only inßuences the probability of resolution. As the components of the model of the expectations get closer, a high probability of resolution requires increasingly precise observations. With hypothetical ideal observations, being equal to their expectations as in RayleighÕs considerations, the criterion presented here would, correctly, always classify the peaks as resolvable no matter how small their distance. 5. Resolution as a Property of Observations The simple resolution discriminant D described by Eq. (252) is purely operational: from the observations the parameters aö and bö of the one-component model are estimated, the resulting estimates are substituted in the expression for D, and from the sign of D it is decided whether the parameters of the
RESOLUTION RECONSIDERED
333
two-component model can be resolved. Thus resolvability is connected to the sign of the scalar D. In this subsection, however, the resolvability is directly connected to the observations. This will be done by use of bifurcation sets, which were discussed earlier in Section V.C.3. In singularity theory, bifurcation sets are the subsets of the Euclidean space of the function parameters where structural changes of the function occur. In the measurement problems discussed in this article, the observations play the role that the parameters play in singularity theory. Therefore, here the bifurcation sets are the subsets of the Euclidean space of the observations where the changes of structure of the likelihood function occur. The purpose of this subsection is to show how these bifurcation sets can be found and how they relatively simply separate observations from which the parameters can be resolved from those from which they cannot. To belong to the bifurcation set, a set of observations w = (w1 . . . w N )T must satisfy the equation D = 0, or ∂q ö h (2) (255) n (b) = 0 ∂ f n n
Furthermore, since this equation has to be evaluated at the one-component ö T, stationary point tö = (ö c T aö bö b) ∂q ∂ f n =0 (256) ∂ f n ∂t n or
∂q ∂gn =0 ∂ f n ∂cm n
and
m = 1, . . . , M
(257)
∂q ö =0 h n (b) ∂ f n n
(258)
∂q ö h (1) n (b) = 0 ∂ f n n
(259)
It is observed that Eqs. (255) and (257)Ð(259)constitute M + 3 equations in ö This means that one ö b. the N + M + 2 variables w1 , . . . , w N , cö1 , . . . , cöM , a, equation in the observations w1 , . . . , w N results after hypothetical elimination ö This single equation represents the ö b. of the M + 2 variables cö1 , . . . , cöM , a, bifurcation set in the Euclidean N-space of the observations. Generally, the number of equations specifying an object is called codimension (Poston and Stewart, 1978; Saunders, 1980). The dimension of the object is the difference
334
A. VAN DEN BOS AND A. J. DEN DEKKER
of the dimension of the space in which it is located and the codimension. Therefore, the bifurcation set has codimension 1 and dimension N − 1. Objects of codimension 1 have the interesting property that they divide the space in which they are located into two parts. For example, a point divides a line into two parts and so does a plane to the Euclidean 3-space. Conclusion VII.3 The bifurcation set divides the Euclidean space of the observations into two parts. The points w = (w1 . . . w N )T in the one part represent sets of observations from which the parameters can be resolved. The points of the other part represent sets of observations from which the parameters cannot be resolved because the corresponding solutions exactly coincide. The bifurcation set is the border of both types of observations. The bifurcation set therefore represents the limit to resolution in terms of the observations. Example VII.3 (Visualizing the Bifurcation Set) If there are N observations, the dimension of the bifurcation set is N −1 and, therefore, it cannot usually be visualized. Instead, in this example, the intersections of the bifurcation set with a number of the two-dimensional coordinate planes in the space of the observations are computed. The origin is taken as the expected values of the observations. This means that in the computation of the intersections, N − 2 of the N observations are chosen equal to their expectations. This constitutes N − 2 equations. The bifurcation set itself is deÞned by one equation. Therefore, the intersection has codimension N − 1 since it is deÞned by N − 1 equations. Then, in the N-space of the observations the intersection with the two-dimensional coordinate plane has a dimension 1: it is a curve in the plane. In this example, these curves are computed numerically for the Poisson likelihood function and observations with overlapping bi-Gaussian expectations without the background function. This is done as follows. Suppose that the intersection with the (wn1 , wn 2 )-plane has to be computed. Then the points of the intersection have to satisfy the equations & % wn ö =0 (260) − 1 h n (b) f n (tö) n & % wn ö − 1 h (1) (261) n (b) = 0 ö f ( t ) n n and
& % wn ö − 1 h (2) n (b) = 0 ö f ( t ) n n
(262)
The Þrst two equations correspond to Eqs. (258) and (259), respectively, and are easily recognized as the normal equations of the likelihood function with
RESOLUTION RECONSIDERED
335
respect to the parameters a and b of the one-component model. The last equation puts the discriminant equal to zero. It is observed that the three equations Eqs. (260)Ð(262)are linear in the observations and, therefore, in wn1 and wn2 . This will be used in the computation of the intersection of the bifurcation set with the (wn 1 , wn 2 )-plane as follows. The function fn (t) is equal to ahn (b) and, ö Substituting this in Eq. (260) yields ö n (b). therefore, f n (tö) is equal to ah aö =
wn ö h n (b)
n n
(263)
Substituting this result in Eqs. (261) and (262) and substituting f n (θ) for all f n (t) with n different from n1 and n2 then yields two equations linear in wn1 ö Solving these equations in closed form for wn1 and wn2 and nonlinear in b. and wn2 for a number of selected values of bö then yields the corresponding points of the intersection of the bifurcation set with the (wn 1 , wn2 )-plane. This procedure has been carried out for observations with expectations 6 5 6 5 (264) f n (θ) = α 0.7 exp − 21 (xn − β1 )2 + 0.3 exp − 12 (xn − β2 )2
with xn = −2.5 + 0.5(n − 1) + 0.025, with n = 1, . . . , 11, and β1 = 0 and β2 = 0.15. Figure 14 shows the function and the location of the measurement
Figure 14. Measurement points and expectations of the observations in Example VII.3. The expectations are the sum of two overlapping Gaussian functions.
336
A. VAN DEN BOS AND A. J. DEN DEKKER
Figure 15. Intersections of the bifurcation set with the coordinate planes in Example VII.3.
points. For simplicity, in this example the parameter α has been taken equal to one. The equations deÞning the bifurcation set then show that results for other α are proportional. Figure 15 shows four different intersections of the bifurcation set with coordinate planes. The quantities ωn in this Þgure are deÞned as ωn = wn − f n (θ )
(265)
which is the difference of the observation wn and its expectation. In this sense, ωn represents the error in the observation. As a result, sets of observations on the same side of the bifurcation set as the origin correspond to the two-maxima, one saddle structure of the likelihood function and, hence, to distinct solutions for the parameters. Points on the other side of the bifurcation set represent sets of observations for which the solutions for the parameters coincide. Figures 15a and 15b show the difference in sensitivity of the resolvability of the parameters to different ωn. In particular, the error ω4 has strikingly little inßuence on the resolvability. This may be explained by the fact that at x4 the second-order
RESOLUTION RECONSIDERED
337
ö derivative h (2) n (b) appearing in Eq. (262) is very small. On the other hand, it is observed in Figure 15c that an error ω11 equal to −0.004 sufÞces to make resolution impossible. SufÞciently negative values for ω3 and ω9 have the same effect, as shown in Figure 15d. The intersections shown in Figure 15 are strikingly linear. This is characteristic of bifurcation sets for one-dimensional two-component models in general. Illustrative examples dealing with the normal likelihood function and exponential models are presented in van den Bos (1981). This linearity simpliÞes the computation of the probability of resolution for a particular distribution of the observations substantially. This is the probability that a point w = (w1 . . . w N )T is on the side of the bifurcation set corresponding to distinct solutions for the nonlinear parameters. The linearity also simpliÞes the computation of the critical errors. The vector of critical errors is deÞned as the vector from the origin to the nearest point of the bifurcation set. Probability of resolution and critical errors are addressed in Section VII.B.7. First, in the next subsection, the behavior of the solutions for the parameters as a function of the observations is described. 6. Resolution from Error Disturbed Observations This subsection addresses the solutions for the parameters as a function of the observations. In particular, the solutions will be studied if in the Euclidean space of the observations the point representing the set of observations approaches the bifurcation set and eventually crosses it. As before, this is illustrated in a numerical example. Example VII.4 (Poisson Maximum Likelihood Solutions for the Location of Component Functions of a Bi-Gaussian) In this example, the component functions, the measurement points, and the likelihood function are those of Example VII.3, but, different from Example VII.3, there is a background present that is equal to zero in all points except in one point where in a number of simulation experiments it is subsequently given a number of positive and negative values. In this example, this will be the sixth point where the observation will be equal to f 6 (θ) + ω6 where f 6 (θ ) is the exact value of the two-component function with θ = (β1 β2 )T and ω6 is the background contribution. However, in the model Þtted, the background function is absent. Figure 16 shows the results. First let ω6 be equal to zero. Then the solutions bö1 and bö2 for the locations are the exact values β1 and β2 , respectively. Next ω6 is decreased. Then, as Figure 16a shows, the distance of bö1 and bö2 increases. If, on the other hand, ω6 is increased, bö1 and bö2 get closer and closer until, for ω6 approximately equal to 0.020, they coincide. Figure 16a also shows the solution bö for the location of the one-component model. Although this solution changes relatively little,
338
A. VAN DEN BOS AND A. J. DEN DEKKER
Figure 16. (a) The solutions for the locations of the Gaussian components in Example VII.4 as a function of the error in the sixth measurement point. (b) The corresponding values of the resolution discriminant.
the value of the discriminant D computed from it is seen, in Figure 16b, to change sign for about the same value of ω6 for which bö1 and bö2 coincide, as could be expected. Figure 16 shows that for ω6 = 0.005, the log-likelihood function must have the two-maxima, one saddle point structure. On the other hand, if ω6 = 0.025 the structure must be a one-maximum structure. Figure 17 shows the corresponding contours of the log-likelihood function. In Figure 17a, there are two maxima and a saddle point between. The saddle point is the point of intersection on the middle contour. It is the one-component stationary point and is, therefore, located on the (b1 Ðb2)-axis. In Figure 17b, the one-component stationary point is a maximum and is the only stationary point. Example VII.4 is somewhat artiÞcial: all observations are assumed to be exactly equal to the bi-Gaussian function values with the exception of one observation which contains a background contribution. However, it is illustrative in three different respects. First, the example shows that disregarding a seemingly insigniÞcant background contribution of approximately 2% in only one of eleven observations may obstruct the resolving of two peaks at a distance of 0.15 half-width. Second, the example shows the importance of the sign of
RESOLUTION RECONSIDERED
339
Figure 17. Contours of the log-likelihood function in Example VII.4 for the error in the sixth measurement point equal to (a) 0.005 and (b) 0.025, respectively.
the background contribution. A background contribution ω6 = 0.02 impedes resolution while ω6 = −0.02 does not. Third and most important, the example shows how detrimental systematic errors may be to resolvability. By deÞnition, a systematic error is a modeling error. This means that the model Þtted and used for the computation of the bifurcation set on the one hand and the model of the expectations of the observations on the other are members of different nonlinearly parametric families of functions. Then the origin, which represents the expectations of the observations, can no longer represent exact observations of the parametric family of the model Þtted. Therefore, it necessarily represents error corrupted observations made on such a model and may as such, in the space of observations, be located anywhere relative to the bifurcation set. This means that the origin may even be located in the region beyond the bifurcation set where the solutions for the location parameters coincide. Since, by deÞnition, the observations are distributed around the origin, this may drastically change the fraction of them corresponding to distinct solutions and, therefore, the probability of resolution. This is one of the topics addressed in the next subsection. 7. Probability of Resolution and Critical Errors The subjects of this subsection are probability of resolution and critical errors. As mentioned previously, the vector of critical errors is deÞned as the vector from the origin of the space of the observations to the nearest point of the bifurcation set. In this deÞnition, the bifurcation set is the one corresponding to the chosen likelihood function and model Þtted. In general, the critical errors are the smallest change of the exact observations causing structural change of the
340
A. VAN DEN BOS AND A. J. DEN DEKKER
likelihood function employed. This deÞnition may be shown to imply that their computation is a nonlinear minimization problem under equality constraints. This problem has been solved in van den Bos and Swarte (1993) for the ordinary least squares criterion, which, as has been explained earlier in this article, is equivalent to the log-likelihood function for independent and identically normally distributed errors. The result is remarkably simple: in this special case and in the absence of systematic errors, a very accurate approximation of the critical errors is the additive inverse of the difference of the exact observations and the one-component model best Þtting to them. Unfortunately, such a simple solution has not been found for other distributions of the observations. However, bifurcation sets as functions of the observations are strikingly linear if the component function is one-dimensional, in the sense of a function of one variable only, as were all functions studied up to now. Linear functions of codimension 1 are called hyperplanes. For bifurcation hyperplanes, the critical errors are relatively easily computed from the intersection points with the coordinate axes, as will be demonstrated now. Let ωø1 , . . . , ωøN be the points of intersection of the bifurcation hyperplane with the N coordinate axes of the space of observations, with the expectations of the observations as origin. Then, it is not difÞcult to show that the bifurcation hyperplane is described by νn ωn = 1 (266) n
where νn = 1/ωøn . Simple geometrical arguments show that the coordinates ωn′ of the point of this bifurcation set nearest the origin are described by νn ωn′ = (267) ν2
where ν is the Euclidean norm of the vector ν = (ν1 . . . ν N )T deÞned as . /1/2 ν = νn2 (268) n
′
The distance of the point ω from the origin is 1 ν
(269)
The ωn′ are the critical errors since of all increments causing structural change, they have the smallest Euclidean norm. This norm is given by Eq. (268). Errors smaller in this sense cannot cause structural change. Example VII.5 (Critical Errors for Resolving the Components of a BiGaussian from Poisson Distributed Observations) Let the expectations of the
341
RESOLUTION RECONSIDERED
observations, the measurement points, and the likelihood function be those of Example VII.3. Then simple computations such as those of Example VII.3 show that the coordinates of the points of intersection of the bifurcation set with the coordinate axes are given by ωø = (−0.0036
−0.0060 −0.0140 −0.1680 0.0272 0.0192
0.0260 0.4000 −0.0148 −0.0064
−0.0036)T
(270)
if the bi-Gaussian expectations of the observations are chosen as the origin. Then it follows from Eq. (267) that the critical errors are equal to ω′ = (−0.0013
−0.0008 −0.0003 −0.000 0.0002 0.0002
0.0002 0.000 −0.0003 −0.0007 −0.0013)T
(271)
The norm of this vector, that is, the distance from the origin to the bifurcation set, is equal to 0.0021. Notice that this is smaller, and in most cases much smaller, than any of the intercepts described by Eq. (270). Next suppose that there is a background present in the observations in the form of the linear trend κ(0.0 0.0005 . . . 0.005)
(272)
where κ is a constant and that this contribution is not included in the model Þtted and used for the computation of the bifurcation set. First, let κ be equal to −0.25. Then it is not difÞcult to show that the distance from the origin, which is now the point representing the expectations of the observations including the trend, to the bifurcation set is equal to 0.0010. The conclusion is that the origin has moved toward the bifurcation set. On the other hand, had κ been equal to 0.25 instead of −0.25, the distance would have been 0.0033. This distance exceeds the distance in the absence of the trend. Next consider the case that κ = −1. Then the distance from the origin to the bifurcation set is 0.0025. This is farther from the bifurcation set than for κ = −0.25. However, substitution of the trend for this κ in Eq. (266) deÞning the bifurcation set shows that the origin is now located on the side of the bifurcation set where the solutions coincide. Example VII.5 demonstrates that systematic errors in the observations may inßuence the resolvability of components in two respects. First, systematic errors inßuence the distance from the origin to the bifurcation set. Second, the systematic errors determine on which side of the bifurcation set the origin is located. This would also have been true if the bifurcation set had been nonlinear. Since, by deÞnition, the observations made are distributed about the point representing the expectations, the following conclusion may be drawn.
342
A. VAN DEN BOS AND A. J. DEN DEKKER
Conclusion VII.4 The probability of resolution of the components, that is, the probability that the point representing the set of observations is on the side of the bifurcation set corresponding to distinct solutions for the parameters, is determined by both the systematic errors and the statistical errors in the observations. The probability of resolution is simply the probability that a set of observations is on the side of the bifurcation set corresponding to distinct solutions for the parameters. Using this deÞnition makes computing the probability of resolution relatively easy if the bifurcation set is linear in the observations. Let v be the vector of critical errors. Then a set of observations ω is on the same side of the linear bifurcation set as the origin if its projection on the critical error vector v is, that is, if ωTυ < υ Tυ
(273)
Notice that in Ineq. (273) the elements of v are constants while those of ω are stochastic variables representing the ßuctuations in the observations. Therefore, the left-hand member of Ineq. (273) is a linear combination of stochastic variables having an expectation equal to zero. For simplicity, assume that the ßuctuations ωn are independent with a standard deviation σ n and that the central limit theorem applies to the linear combination. Then the left-hand member of Eq. (273) is a normally distributed variable with an expectation equal to zero and a variance vn2 σn2 (274) n
It follows that the probability of resolution is equal to the probability that a standard normal variable is smaller than &1/2 B% 2 2 2 v σ vn (275) n n n
n
Therefore, the procedure for computing the probability of resolution is as follows. First, compute the critical errors. Then compute the quantity described by Expr. (275). The probability of resolution is the value of the tabulated standard normal cumulative distribution function for this quantity. This procedure is illustrated in the following example, where both a trend and statistical ßuctuations are present. Example VII.6 (Probability of Resolution of the Components of a BiGaussian) The purpose of this example is to illustrate the proposed procedure for computing the probability of resolution. The model of the expectations is
343
RESOLUTION RECONSIDERED TABLE 2 Probability of Resolution (%) α
κ = 0.25
κ = 0
κ = −0.25
κ = −1
625 10,000
60 85
57 75
53 62
42 21
that of Example VII.5: f n (θ) = κ(n − 1) × 0.0005 5 6 5 6 + α 0.7 exp − 12 (xn − β1 )2 + 0.3 exp − 21 (xn − β2 )2
(276)
n = 1, . . ., 11; xn = −2.5 + 0.3(n−1) + 0.025, with n = 1, . . ., 11; and β 1 = 0 and β 2 = 0.15. The model underlying the bifurcation set is 5 6 5 6 f n (t) = a ℓ exp − 21 (xn − b1 )2 + (1 − ℓ) exp − 12 (xn − b2 )2 (277)
with ℓ = 0. 7. In addition to the systematic errors represented by the trend, there are statistical ßuctuations as a result of the Poisson statistics of the observations. In the cases considered, the parameter α controlling the expected number of counts is equal to 625 and 10,000, respectively. The chosen values of the slope κ of the trend are the same as those of Example VII.5. Table 2 shows the results. The Þrst column of Table 2 shows the number of counts α. The other columns show the probability of resolution in percent. From left to right in the table, the distance of the point representing the expectations of the observations to the bifurcation set decreases until, in the last column, this point has passed the bifurcation set. (See Example VII.5.) This explains why the probability of resolution decreases from left to right. The standard deviations relative to the expectations for α = 625 are four times larger than those for α = 10,000. Therefore, in the latter case, the observations are more concentrated around their expectations than in the former. This explains why the probability of resolution for α = 625 is smaller than that for α = 10,000 if the expectation of the observations is on the side of the bifurcation set corresponding to distinct solutions and why the opposite is true if the expectation is on the other side. The last topic addressed in this subsection is the probability of resolution as a function of the distance of the peaks. Again this is addressed by using an example. Example VII.7 (Number of Observations Needed for Resolving the Components of a Bi-Gaussian as a Function of the Distance of the Components) Again
344
A. VAN DEN BOS AND A. J. DEN DEKKER
Figure 18. Expected number of counts in Example VII.7 required for a probability of resolution of 75%.
let the expectations of the observations, the measurement points, and the likelihood function be those of Example VII.3. Note that this implies that the only errors present are the statistical Poisson ßuctuations. Then for various distances β = β2 − β1 the intersections with the coordinate axes may be computed such as has been done in Example VII.3. Substituting these intersections in Eq. (267) then yields the critical errors. Finally, Eq. (275) may be used to compute the standard deviations and, hence, the number of observations corresponding to a speciÞed probability of resolution. In this example, for a number of values of β the number of observations is computed that is required for a probability of resolution of 75%. Figure 18 shows the results. The Þgure shows that in this particular case the required number of observations α is inversely proportional to approximately the fourth power of the distance of the peaks β.
C. Coinciding Two-Dimensional Locations 1. Conditions for Resolution Unlike the Rayleigh resolution criterion, the criterion of Section VII.B may be extended to vector-valued parameters. For the purposes of this article, the
RESOLUTION RECONSIDERED
345
most important example of a vector-valued parameter is the two-dimensional location parameter (βx β y )T where β x and β y represent the location of the component with respect to the x-axis and the y-axis, respectively. Then the components are functions of two variables x and y and are said to be resolved if their location vectors differ. In optics, probably the best known example of a two-dimensional component function is that describing the diffraction pattern of a circular aperture, the Airy diffraction pattern: J1 (2πr ) 2 (278) α 2 2πr with r 2 = (x − βx )2 + (y − β y )2 , where J1 (·) is the Bessel function of the Þrst kind of order one. The general expression for the generalized twodimensional two-component model now becomes f n (t) = f (xn , yn ; t) = g(xn , yn ; c) + a[ℓ1 h(xn − bx1 , yn − b y1 ) + ℓ2 h(xn − bx2 , yn − b y2 )] (279) where (xn , yn ), n = 1,. . ., N, are the known measurement points; a is the amplitude, ℓ1 = ℓ and ℓ2 = 1 − ℓ with 0 < ℓ < 1 as before; and (bx1 b y1 )T and (bx2 b y2 )T are the coordinates of the peaks. The background function g(x, y; c) is now, of course, also a function of the two variables x and y and is parametric in the elements of the M × 1 vector c. The parameter vector t is deÞned as t = (c T
a
bx1
bx2
b y1
b y2 )T
(280)
Furthermore, the one-component model corresponding to Eq. (279) is described by f n (t) = g(xn , yn ; c) + ah(xn − bx , yn − b y )
(281)
where the parameter vector t is now described by (c T a bx b y )T . In the following analysis of the structure of the likelihood function q(t) for the model described by Eq. (279), the steps will be analogous to those for the scalar model used in Sections VI.C and VI.E and deÞned by Eq. (125). A special case, the analysis of the structure of the least squares criterion in two dimensions, is presented in van den Bos (1992). Suppose that (ö cT
aö böx
böy )T
(282)
is the maximum likelihood solution for the parameters of the one-component model described by Eq. (281). Then the two-dimensional equivalent of the
346
A. VAN DEN BOS AND A. J. DEN DEKKER
one-dimensional one-component stationary point is aö böx
tö = (ö cT
böx
böy
böy )T
(283)
The value of the likelihood function at this point is equal to the maximum value of the likelihood function for the one-component model. To establish the nature of the one-component stationary point, we compute the Hessian matrix with respect to the parameters t described by Eq. (280). Next the coordinates t are linearly transformed into a vector t† by t † = diag(I
(284)
L)t
where I is the identity matrix of order M + 1 and L is deÞned as ⎛ ⎞ ℓ1 ℓ2 0 0 0 ℓ1 ℓ2 ⎟ ⎜0 L=⎝ 1 −1 0 0⎠ 0 0 1 −1
(285)
Hence
t † = (c T
a
ℓ1 bx1 + ℓ2 bx2
ℓ1 b y1 + ℓ2 b y2
bx1 − bx2
b y1 − b y2 )T (286)
Finally, transforming the Hessian matrix H with respect to t into the Hessian matrix H† with respect to t† yields H † = L −T H L −1
(287)
Simple algebraic operations then show that & % G O H† = OT R
(288)
where G is the (M + 2) × (M + 2) Hessian matrix of the log-likelihood function for the one-component model and evaluated at its maximum deÞned by Eq. (282), O is the M × 2 null matrix, and the 2 × 2 matrix R has as its elements ∂q ∂ 2 h n ö − ℓ) r pq = aℓ(1 p, q = 1, 2 (289) ∂ f n ∂b p ∂bq n
where b1 = bx and b2 = by and all derivatives are evaluated at t = tö = (ö c T aö böx böx böy böy )T . Since G is the Hessian matrix of the likelihood function for the one-component model and is evaluated at the maximum of this function, it is negative deÞnite. Then the nature of the one-component stationary point tö = (ö cT
aö böx
böx
böy
böy )T
(290)
RESOLUTION RECONSIDERED
347
is fully determined by the matrix R or, equivalently, by its eigenvalues. If these are both positive, this point is a maximum in M + 1 directions and a minimum in both remaining ones. Then it is an M + 1 saddle. If the signs of the eigenvalues are different, the one-component stationary point is an M + 2 saddle. If both eigenvalues are strictly negative, the point is a maximum. Then tö represents the maximum likelihood estimate and, consequently, the solution vectors for the locations of both components coincide. Therefore, the two-dimensional component functions cannot be resolved from the observations available. The bifurcation set is composed of those sets of observations for which the largest eigenvalue just vanishes. Since the determinant of R is equal to the product of the eigenvalues and its trace to their sum, a set of observations w = (w1 . . . w N )T belongs to the bifurcation set if 2 det(R) = r11r22 − r12 =0
and
r11 + r22 < 0
(291)
Note that by Eq. (289), these conditions are independent of the value of ℓ if 0 < ℓ < 1. Then the bifurcation set is deÞned by the M + 3 normal equations of the log-likelihood function for the one-component model and the equation det(R) = 0. These are M + 4 equations in the N observations and the M + 3 elements of (ö c T aö böx böy )T . Hence, if, hypothetically, these elements are eliminated, one single equation in the observations results. The conclusion is that in the Euclidean space of the observations the bifurcation set has codimension 1. It, therefore, divides the space of the observations into two regions: the one region is the collection of all sets of observations from which the components can be resolved and the other region is the collection of sets of observations from which this is not possible. 2. Application to the Airy Model In this subsection, the resolution criterion for component functions of two variables is applied to the Airy component function, which is important for this article since it is the diffraction pattern for a circular aperture. Example VII.8 (Probability of Resolving the Components of a Bi-Airian Model) In this example, the probability of resolving the components of a pair of incoherent Airy patterns from error corrupted observations is illustrated. Both systematic and statistical errors are present to show their interaction. It is assumed that N observations w = (w1 . . . w N )T are available and that the expectation of the nth observation wn is described by d(xn , yn ) + α[λh(xn − βx1 , yn − β y1 ) + (1 − λ)h(xn − βx2 , yn − β y2 )] (292)
348
A. VAN DEN BOS AND A. J. DEN DEKKER
In this expression, & % J1 (2πr ) 2 h(x, y) = 2 2πr
(293)
with r 2 = x 2 + y 2 , is the function describing the two-dimensional Airy intensity pattern. See Born and Wolf (1980). Thus, the second term of Eq. (292) describes two Airy components located at the points (βx1 , β y1 ) and (βx2 , β y2 ) and having amplitudes αλ and α(1 − λ), respectively. The Þrst term, d(x, y), is a function of x and y. It represents the systematic error, that is, the contribution to the expectations of the observations not included in the model Þtted. The 49 measurement points (xn , yn ) are described by (−0.75 + 0.25(s − 1), −0.75 + 0.25( t −1)) with s, t = 1, . . . , 7. To avoid atypical conditions such as symmetrical location of measurement points, in this example we take the locations of the components as (βx1 , β y1 ) = (0.0475, 0.0575) and (βx2 , β y2 ) = (0.0699, 0.1022). Then the Euclidean distance of these locations is 0.05. Since the width at half maximum of the Airy component function is 0.5, the distance of the components is one tenth of the width. The heights of the components are taken equal, that is, λ = 0.5. It has been assumed that the observations have a Poisson distribution. Two cases are considered: α = 625 and α = 10,000. The systematic errors d(xn , yn ) are generated as follows. A one-component model ah(xn − bx , yn − b y )
(294)
with h(x, y) deÞned by Eq. (293) is Þtted to exact two-component observations: f n (θ) = α[λh(xn − βx1 , yn − β y1 ) + (1 − λ)h(xn − βx2 , yn − β y2 )] (295) Suppose that the solution is (aö böx böy )T . Then the systematic errors are taken as ö n − böx , yn − böy )] d(xn , yn ) = −κ[ f n (θ) − ah(x
(296)
where κ is a gain factor. The values of κ are chosen as 0, 0.5, 1, and 1.5, respectively. Notice that κ = 1 means that the expectations of the observations d(xn , yn ) + f n (θ)
(297)
are exactly equal to the one-component model ö n − böx , yn − böy ) ah(x
(298)
Then Eqs. (260)Ð(262)show that the observations are in the bifurcation set. For each of the chosen values of κ, 100 sets of observations are simulated
RESOLUTION RECONSIDERED
349
TABLE 3 Frequency of Resolution (%) α
κ = 0
κ = 0.5
κ = 1
κ = 1.5
625 10,000
92 100
82 100
62 51
42 12
and the relative frequency of occurrence of distinct solutions for the location parameters is computed. Table 3 shows the results. The Þrst column of Table 3 shows the value of α, the number of counts. The other columns show the measured frequency of occurrence of resolved components in percent. From left to right, κ = 0 corresponds to expected observations without systematic error, κ = 1 to expected observations in the bifurcation set, κ = 0.5 to expected observations between the origin for κ = 0 and the bifurcation set, and κ = 1.5 to expected observations beyond the bifurcation set. The relative standard deviation of the observations for α = 625 is four times as large as that for α = 10,000. As a result, for α = 625 and κ = 0 or κ = 0.5, a number of 8 and 18 sets of observations occur beyond the bifurcation set while for α = 10,000 this does not occur at all for these values of κ. If the bifurcation set would be a hyperplane, the probability of resolution for κ = 1 would be 50% and so would approximately be the corresponding relative frequencies. The relative frequencies of 62 and 51% for α = 625 and α = 10,000, respectively, may be explained by the larger area over which the observations are distributed in the former case and the corresponding decrease of linearity of the bifurcation set. The relatively low percentage of resolved components for α = 10,000 and κ = 1.5 is caused by the fact that the point representing the expectations of a set of observations is now beyond the bifurcation set while the observations are distributed over a relatively small neighborhood of this point. Example VII.8 shows, again, that the probability of resolution is determined by the combination of statistical errors and systematic errors. Precision, that is, low standard deviation of the observations, is always helpful if the expectation of the observations is on the same side of the bifurcation set as the origin. On the other hand, if the systematic errors are such that the expectation of the errors is not on the side of the bifurcation set corresponding to distinct solutions, improving the precision of the observations to improve the resolution is pointless. Example VII.8 and the theory presented in Section VII.C.1 show that the concept resolution adopted in this article is equally well applicable to onedimensional problems as to two-dimensional problems. This is a further
350
A. VAN DEN BOS AND A. J. DEN DEKKER
difference from the classical Rayleigh resolution. Application of the Rayleigh resolution criterion to the one-dimensional intersection of overlapping twodimensional Airy patterns with a vertical plane through their locations (see Orloff, 1997, p. 322) disregards all points outside this plane.
D. Nonstandard Resolution: Partial Coherence Up to now, the model components were assumed to be incoherent. If the images are coherent, the expression for the expectation of the two-point image becomes (Born and Wolf, 1980) α[λ2 h 2 (xn − βx1 , yn − β y1 ) + (1 − λ)2 h 2 (xn − βx2 , yn − β y2 ) + 2ρλ(1 − λ)h(xn − βx1 , yn − β y1 )h(xn − βx2 , yn − β y2 )]
(299)
where ρ is the real part of the complex degree of coherence and the remaining symbols have their usual meaning. Generally, this model differs from the standard two-component model used up to now as a result of the presence of the last term unless ρ = 0 or ρ = ±1. If ρ = 0, the components are incoherent. This model is covered by the two-component model assumed up to now with h(., .) replaced by h2(.,.). If ρ = 1, the sources are fully coherent and Eq. (299) may be written α[λh(xn − βx1 , yn − β y1 ) + (1 − λ)h(xn − βx2 , yn − β y2 )]2
(300)
Then, if this model is substituted for the expectations of the observations in the log-likelihood function, the theory developed in the preceding subsections is fully applicable since the likelihood function is now a function of the twocomponent model α 1/2 [λh(xn − βx1 , yn − β y1 ) + (1 − λ)h(xn − βx2 , yn − β y2 )]
(301)
The log-likelihood function is, therefore, a function of a standard twocomponent model, which is all that is required by the developed theory. For all other values of ρ, the theory described in the preceding subsection has to be modiÞed to be applicable to the model described by Eq. (299). In (van den Bos and den Dekker, 1995, 1996), this is done for one- and two-dimensional location parameters for the uniformly weighted least squares criterion. In den Dekker (1997a), a detailed computation of the probability of resolution for least squares and scalar location parameters is presented. den Dekker (1997b) provides a detailed treatment of probability of resolution in one and two dimensions for the least squares criterion. These references show that, analogous to the least squares criterion for the incoherent model, the least squares criterion for the coherent model has a
351
RESOLUTION RECONSIDERED
one-component stationary point. The one-component model is now aκ(ℓ, ρ)h(xn − bx , yn − b y )
(302)
κ(ℓ, ρ) = [ℓ2 + (1 − ℓ)2 + 2ρℓ(1 − ℓ)]
(303)
with
Furthermore, analogous expressions for the coordinate transformation deÞned by Eq. (285) are ⎞ ⎛ ′ 0 ℓ1 ℓ′2 0 ′ ′ 0 ℓ 1 ℓ2 ⎟ ⎜0 (304) L=⎝ 1 −1 0 0⎠ 0 0 1 −1 with ℓ′1 = ℓ1 (ℓ1 + ρℓ2 )/κ(ℓ, ρ) and ℓ′2 = ℓ2 (ℓ2 + ρℓ1 )/κ(ℓ, ρ) with, as before, ℓ1 = ℓ and ℓ2 = 1 − ℓ. Analogous expressions for the elements rpq of the 2 × 2 matrix R described by Eq. (289) are r pq = where
ℓ 1 ℓ2 ℓ1 ℓ2 (1 − ρ 2 )χ pq + (ℓ2 + ρℓ1 )(ℓ1 + ρℓ2 )ψ pq κ(ℓ, ρ)
χ pq = −4aö with
n
dn
∂h n ∂h n ∂b p ∂bq
and
ψ pq = −4aö
dn
n
∂ 2hn hn ∂b p ∂bq
ö ρ)h(xn − böx , yn − böy ) dn = wn − aκ(ℓ,
(305)
(306)
(307)
In Eqs. (305)Ð(307),b1 = bx and b2 = b y , while h n = h(xn − bx , yn − b y ) and its derivatives with respect to bx and by are all evaluated at the one-component stationary point (aö böx böx böy böy )T . The results described by Eqs. (305) and (306) are specialized to the least squares criterion. It is not difÞcult to show that the pertinent results for a general likelihood function q are obtained if in Eq. (306) χ pq and ψ pq are replaced by χ pq = aö
∂q ∂h n ∂h n ∂ f n ∂b p ∂bq n
and
ψ pq = aö
∂q ∂ 2 h n hn ∂ f n ∂b p ∂bq n
(308)
Conclusion VII.5 First, a comparison of the elements rpq of the matrix R in the incoherent case deÞned by Eq. (289) with those in the coherent case shows that in the latter case the elements of R are dependent on ℓ. Therefore, in the coherent case the resolvability of the parameters (βx1 , β y1 ) and (βx2 , β y2 ) depends on ℓ as well. Second, the generalization of the concept resolution introduced in this article to coherent images is relatively straightforward.
352
A. VAN DEN BOS AND A. J. DEN DEKKER
E. A Survey of Related Literature In this subsection, a short survey of earlier work related to the results presented in this article will be given. In an example illustrating a numerical method for the computation of the parameters of exponential decay models, Lanczos (1957) truncates his simulated triexponential observations to three decimals and concludes that as a result of the truncation, his triexponential observations produce a biexponential solution. This conclusion is based on the fact that a set of linear equations that has to be solved in the process is ill-conditioned for three exponentials but is not for two. Although at Þrst sight there is a certain similarity to the fundamental limits to resolution exposed in this article, LanczosÕs problem is a numerical problem only. Applying the resolution criterion proposed in this article to his observations reveals that a triexponential least squares solution does exist. The Þrst publication reporting exactly coinciding least squares solutions for the nonlinear parameters of a multicomponent model and explaining this phenomenon using singularity theory is that of van den Bos (1980). The theory concerned is discussed in more detail in van den Bos (1981). The simultaneous estimation of the component amplitudes is added in van den Bos (1983). These publications are all restricted to nonlinear least squares estimation of the parameters which is equivalent to maximum likelihood estimation for independent and identically normally distributed observations. In van den Bos (1984), the results obtained for least squares are extended to include the more general class of criteria of goodness of Þt or likelihood functions that are functions of the differences of the observations and the model only. A background function is also added to the component model. The Þrst application to optical resolution is van den Bos (1987), addressing the resolution of two overlapping incoherent intensity peaks of equal height. In van den Bos (1988), the results obtained so far are extended to include the nondifferentiable likelihood functions corresponding to least-absolute-values and minimax model Þtting. The coincidence of the least squares solutions for the locations of more than two components is the subject of van den Bos and van der Werff (1990). Coincidence of two-dimensional location parameters, needed for the deÞnition of optical or electron optical resolution in two dimensions, is analyzed in van den Bos (1992). In addition to a number of clarifying numerical examples and corresponding visualizations, den Dekker (1992) addresses component coincidence in the presence of coherence and for a generalized class of likelihood functions. In Swarte (1992), the developed resolution theory is applied to spectroscopic peaks. Critical errors deÞned as the errors that have smallest energy among all errors causing coincidence of solutions are introduced in van den Bos (1993). In the same reference, the concept probability of resolution is introduced. Critical
RESOLUTION RECONSIDERED
353
errors are also one of the main subjects of van den Bos and Swarte (1993). The coincidence of the solutions for one-dimensional and for two-dimensional component locations in the presence of partial coherence is studied in van den Bos and den Dekker (1995, 1996), respectively. In den Dekker (1997a), the probability of resolution in the presence of partial coherence is computed for least squares model Þtting. Probability of resolution in one and two dimensions and in the presence of partial coherence is analyzed and discussed in detail in den Dekker (1997b). Bettens and coauthors (1999) compute the probability of resolution of a pair of Gaussian peaks when the observations are counting results. Finally, Sijbers and coauthors (1998) show that singularity theory may also be used to explain structural change of a likelihood function used in magnetic resonance imaging. F. Summary and Conclusions In this section, a Taylor polynomial representation of the likelihood function of the location parameters of the components of overlapping component functions has been used to derive a criterion for the resolvability of the components from error corrupted observations. This has been done for one-dimensional and for two-dimensional observations. The use of the resulting criteria has been demonstrated in numerical examples employing simulated observations made on a pair of sinc-square components such as used by Rayleigh in his criterion and a pair of overlapping Airy patterns, respectively. The resulting criteria are simple and general. They clearly show that resolution is limited by errors. These are systematic errors in the sense of modeling errors and nonsystematic errors which are statistical ßuctuations of the observations. The criteria are also operational since they divide all possible sets of observations into two categories: sets of observations from which the components can be resolved and sets of observations from which they cannot be resolved. To which of the two categories a particular set of observations belongs is easily discovered without the need to actually try to resolve the components. The use of the criterion has also been extended to include resolution in the presence of coherence. VIII. Summary and Conclusions In this article, conventional resolution deÞnitions and criteria have been reviewed and an alternative deÞnition and corresponding criterion have been proposed and explained. The proposed deÞnition is this: Two overlapping component functions are said to be resolved from a given set of error disturbed observations if the estimates of their location parameters are distinct. The component functions are unresolved if the estimates of the location parameters exactly coincide.
354
A. VAN DEN BOS AND A. J. DEN DEKKER
Although exactly coinciding solutions for parameters may seem highly improbable, they regularly occur and do so more often as the component functions increasingly overlap. Singularity theory has been used to explain the occurrence of coinciding solutions and to show that they are a result of systematic errors or nonsystematic errors in the observations. The corresponding resolution criterion operates as follows. First, the parameters t = (c T a b)T of the one-component model f (x; t) = ah(x; b) + g(x; c) are estimated from the available observations w = (w1 . . . w N )T . In this expression, h(x; b) is the component model as a function of the variable x and located at b, and a is the amplitude of the component. The function g(x; c) is a background function with parameters c = (c1 . . . c M )T . As estimates of t, the ö T are taken. These maximize the maximum likelihood estimates tö = (ö c T aö b) likelihood function chosen by the experimenter. This likelihood function may be different from the likelihood function corresponding to the actual probability density function of the observations w. Next the estimates tö and the observations w are substituted in the discriminant function D=
∂q ö h (2) n (b) ∂ f n n
where q = q(t) is the logarithm of the likelihood function, f n = f (xn ; t), the ö expression h (2) n (b) is the second-order derivative of the component function ö and xn is the nth measurement point with h(xn; b) with respect to b at b = b, n = 1, . . . , N . Notice that D is a function of the observations only since tö is fully deÞned by w. Also notice that D is speciÞc for the component model, the background model, and the likelihood function chosen by the experimenter. It has been shown that if D > 0, the estimates of the location parameters are distinct and that if D ≤ 0, they coincide. This is why the sign of D has been proposed as the resolution criterion. In this article, the observations have been modeled as stochastic variables since, in our opinion, there is no more useful alternative. In this model, the expectations of the observations represent the actual two-component and background function underlying the observations. This function need not be the same function as that adopted by the experimenter and used in the computation of the discriminant D. Anyway, if the observations are stochastic variables, then the discriminant D and its sign become stochastic variables as well. Consequently, resolution of the components occurs with a certain probability only. This has been called the probability of resolution. For a chosen likelihood function, component function, and background function, the probability of resolution is determined by the expectations of the observations and the way the observations are distributed about these expectations. To explain this
RESOLUTION RECONSIDERED
355
further, we have introduced the notion of the Euclidean space of observations. In this space, the nth coordinate axis represents the nth observation wn , and a point in this space represents a set of observations (w1 . . . w N )T . It has been shown that all sets of observations corresponding to vanishing D form a hypersurface called the bifurcation set, which divides the Euclidean space of observations into two regions. From the observations in the one region, the components can be resolved. From the observations in the other region, this is not possible. The bifurcation set is characteristic of the component function, the background function, and the likelihood function chosen. It has been shown that increasing the standard deviation of the observations may decrease the probability of resolution since it increases the probability of occurrence of a set of observations on the side of the bifurcation set corresponding to coinciding solutions for the locations. The probability of occurrence of a set of observations in either region also depends on the position of the point representing the expectations of the observations relative to the bifurcation set. Systematic errors, that is, differences of the model describing the expectations and the model adopted by the experimenter, may move the bifurcation set toward this point and even past it. This causes the probability of resolution to decrease and shows how systematic errors may adversely inßuence resolution. These considerations clearly show that errors, both systematic and nonsystematic, are the only limit to resolution. The generality of the proposed deÞnition and criterion have also been demonstrated. Different from conventional deÞnitions, in the proposed definition the number of observations need not be such that the use of asymptotic statistical results is justiÞed. Furthermore, the likelihood function used in the estimation of the location parameters and for the computation of the bifurcation set need not correspond to the probability density function of the observations. Also, as set out earlier in this section, the inßuence of both systematic and nonsystematic errors may be analyzed. Since all results derived depend on the location of the measurement points, the inßuence of these experimental parameters upon resolution may also be investigated. Furthermore, it has been shown how in this approach, different from conventional approaches, the resolution deÞnition and criterion may be extended to include resolution in more dimensions. Extension to resolution in the presence of coherence has also been brießy outlined. Finally, a brief survey of relevant literature has been presented.
References Andrews, H. C., and Hunt, B. R. (1977). Digital Image Restoration. Englewood Cliffs, NJ: Prentice Hall. ArnolÕd, V. I. (1992). Catastrophe Theory. Berlin: Springer-Verlag.
356
A. VAN DEN BOS AND A. J. DEN DEKKER
Banham, M. R., and Katsaggelos, A. K. (1997). Digital image restoration. IEEE Signal Processing Magazine 14, 24Ð41. Barakat, R. (1962). Application of apodization to increase two-point resolution by the Sparrow criterion. I. Coherent illumination. J. Opt. Soc. Am. 52, 276Ð283. Barakat, R., and Levin, E. (1963). Application of apodization to increase two-point resolution by the Sparrow criterion. II. Incoherent illumination. J. Opt. Soc. Am. 53, 274Ð282. Barnes, C. W. (1966). Object restoration in a diffraction-limited imaging system. J. Opt. Soc. Am. 56, 575Ð578. Bettens, E., Van Dyck, D., den Dekker, A. J., Sijbers, J., and van den Bos, A. (1999). Modelbased two-object resolution from observations having counting statistics. Ultramicroscopy 77, 37Ð48. Biraud, Y. (1969). A new approach for increasing the resolving power by data processing. Astronomy Astrophys. 1, 124Ð127. Born, M., and Wolf, E. (1980). Principles of Optics. New York: Pergamon. Burch, S. F., Gull, S. F., and Skilling, J. (1983). Image restoration by a powerful maximum entropy method. Comput. Vision, Graphics, Image Processing 23, 113Ð128. Buxton, A. (1937). Note on optical resolution. Lond. Edinb. Dublin Philos. Magazine J. Sci. 23, 440Ð442. Castleman, K. R. (1979). Digital Image Processing. London: Prentice Hall. Cathey, W. T., Frieden, B. R., Rhodes, W. T., and Rushforth, C. K. (1984). Image gathering and processing for enhanced resolution. J. Opt. Soc. Am. A 1, 241Ð250. ChatÞeld, C. (1995). Statistics for Technology. London: Chapman & Hall. Clements, A. M., and Wilkins, J. E., Jr. (1974). Apodization for maximum encircled-energy ratio and speciÞed Rayleigh limit. J. Opt. Soc. Am. 64, 23Ð27. Cox, I. J., and Sheppard, C. J. R. (1986). Information capacity and resolution in an optical system. J. Opt. Soc. Am. A 3, 1152Ð1158. Cunningham, D. R., and Laramore, R. D. (1976). Detection in image dependent noise. IEEE Trans. Inform. Theory IT-22, 603Ð610. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B. Methodological 39, 1Ð38. den Dekker, A. J. (1992). Resolutie: Illustraties, experimenten en theorie. MasterÕs thesis, Department of Physics, Delft University of Technology, Delft, The Netherlands. (in Dutch). den Dekker, A. J. (1997a). Model-based optical resolution. IEEE Trans. Instrum. Meas. 46(4), 798Ð802. den Dekker, A. J. (1997b). Model-based resolution. Ph.D. thesis, Delft University of Technology. Delft, The Netherlands: Delft Univ. Press. den Dekker, A. J., and van den Bos, A. (1997). Resolution: A survey. J. Opt. Soc. Am. A 14, 547Ð557. Dhrymes, P. J. (1970). EconometricsÑStatistical Foundations and Applications. New York: Harper & Row. Dunn, J. H., Howard, D. D., and Pendleton, K. B. (1970). Tracking Radar. New York: McGrawHill. Falconi, O. (1967). Limits to which double lines, double stars, and disks can be resolved and measured. J. Opt. Soc. Am. 57, 987Ð993. Farrell, E. J. (1966). Information content of photoelectric star images. J. Opt. Soc. Am. 56, 578Ð587. Fellgett, P. B., and Linfoot, E. H. (1955). On the assessment of optical images. Philos. Trans. R. Soc. Lond. A. Math. Phys. Sci. 247, 369Ð407. Fried, D. L. (1979). Resolution, signal-to-noise-ratio, and measurement precision. J. Opt. Soc. Am. 69, 399Ð406.
RESOLUTION RECONSIDERED
357
Fried, D. L. (1980). Resolution, signal-to-noise-ratio, and measurement precision: Addendum. J. Opt. Soc. Am. 70, 748Ð749. Frieden, B. R. (1967). Band-unlimited reconstruction of optical objects and spectra. J. Opt. Soc. Am. 57, 1013Ð1019. Frieden, B. R. (1972). Restoring with maximum likelihood and maximum entropy. J. Opt. Soc. Am. 62, 511Ð518. Frieden, B. R. (1975). Image Enhancement and Restoration, Vol. 6. New York: Springer-Verlag. Frieden, B. R. (1980). Statistical models for the image restoration problem. Comput. Graph. Image Processing 12, 40Ð59. Frieden, B. R., and Burke, J. J. (1972). Restoring with maximum entropy. II: Superresolution of photographs of diffraction-blurred impulses. J. Opt. Soc. Am. 62, 1202Ð1210. Geman, S., and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell. PAMI-6 (6), 721Ð741. Gerchberg, R. W. (1974). Super-resolution through error energy reduction. Optica Acta 21 (9), 709Ð720. Gilmore, R. (1981). Catastrophe Theory for Scientists and Engineers. New York: Wiley. Goodwin, G. C., and Payne, R. L. (1977). Dynamic System IdentiÞcationÑExperiment Design and Analysis. New York: Academic Press. Harris, J. L. (1964a). Diffraction and resolving power. J. Opt. Soc. Am. 54, 931Ð936. Harris, J. L. (1964b). Resolving power and decision theory. J. Opt. Soc. Am. 54, 606Ð611. Helstrom, C. W. (1969). Detection and resolution of incoherent objects by a background-limited optical system. J. Opt. Soc. Am. 59, 164Ð175. Helstrom, C. W. (1970). Resolvability of objects from the standpoint of statistical parameter estimation. J. Opt. Soc. Am. 60, 659Ð666. Holmes, T. J. (1988). Maximum-likelihood image restoration adapted for noncoherent optical imaging. J. Opt. Soc. Am. A 5(5), 666Ð673. Houston, W. V. (1927). A compound interferometer for Þne structure work. Phys. Rev. 29, 478Ð 484. Hunt, B. R. (1973). The application of constrained least squares estimation to image restoration by digital computer. IEEE Trans. Comput. C-22(9), 805Ð812. Hunt, B. R. (1994). Prospects for image restoration. Int. J. Mod. Phys. C5, 151Ð178. Hunt, B. R. (1995). Super-resolution of images: Algorithms, principles, performance. Int. J. Imaging Syst. Technol. 6, 297Ð304. Hunt, B. R., and Andrews, H. C. (1973). Comparison of different Þlter structures for restoration of images, in Proceedings of the HICCS Conference, University of Hawaii, Honolulu, HI. Hunt, B. R., and Sementilli, P. (1992). Description of a Poisson imagery super-resolution algorithm, in Astronomical Data Analysis Software and Systems, Vol. 25, edited by I. Worrall et al. San Francisco: Astronomical Society of the PaciÞc, pp. 196Ð199. Hunt, G. W. (1981). An algorithm for the nonlinear analysis of compound bifurcation. Philos. Trans. R. Soc. Lond. 300(A 1455), 443Ð471. Idell, P. S., and Webster, A. (1992). Resolution limits for coherent optical imaging: Signal-tonoise analysis in the spatial-frequency domain. J. Opt. Soc. Am. A 9, 43Ð56. Jacquinot, P., and Roizen-Dossier, B. (1964). Apodisation, in Progress in Optics, Vol. 3 edited by E. Wolf. Amsterdam: North-Holland. Janson, P. A., Hunt, R. H., and Plyler, E. K. (1970). Resolution enhancement of spectra. J. Opt. Soc. Am. A 60, 596Ð599. Jennrich, R. I. (1995). An Introduction to Computational StatisticsÑRe gression Analysis. Englewood Cliffs, NJ: Prentice Hall. Kelly, E. J., Reed, I. S., and Root, W. L. (1960). The detection of radar echoes in noise. II. J. Soc. Ind. Appl. Math. 8, 481Ð507.
358
A. VAN DEN BOS AND A. J. DEN DEKKER
Kibe, J. N., and Wilkins, J. E., Jr. (1983). Apodization for maximum central irradiance and speciÞed large Rayleigh limit of resolution. J. Opt. Soc. Am. 73, 387Ð391. Kibe, J. N., and Wilkins, J. E. Jr. (1984). Apodization for maximum central irradiance and speciÞed large Rayleigh limit of resolution. II. J. Opt. Soc. Am. A 1, 337Ð343. Kosarev, E. L. (1990). ShannonÕs superresolution limit for signal recovery. Inverse Probl. 6, 55Ð76. Lanczos, C. (1957). Applied Analysis. London: Pitman. Lucy, L. B. (1974). An iterative technique for the rectiÞcation of observed distributions. Astronomical J. 79(6), 745Ð754. Lucy, L. B. (1992a). Resolution limits for deconvolved images. Astronomical J. 104(3), 1260Ð 1265. Lucy, L. B. (1992b). Statistical limits to superresolution. Astronomy Astrophys. 261, 706Ð710. Lukosz, W. (1996). Optical systems with resolving powers exceeding the classical limit. J. Opt. Soc. Am. 56, 1463Ð1472. Lukosz, W. (1967). Optical systems with resolving powers exceeding the classical limit. II. J. Opt. Soc. Am. 57, 932Ð941. Lummer, O., and Reiche, F. (1910). Die Lehre von der Bildenstehung im Mikroskop von Ernst Abbe. Braunschweig, Germany: Vieweg. Luneberg, R. K. (1944). Mathematical Theory of Optics. Providence R. I.: Brown Univ. Press. McKechnie, T. S. (1972). The effect of condenser obstruction on the two-point resolution of a microscope. Optica Acta 19, 729Ð737. Meinel, E. S. (1986). Origins of linear and nonlinear recursive restoration algorithms. J. Opt. Soc. Am. A 3, 787Ð799. Mood, A. M., Graybill, F. A., and Boes, D. C. (1987). Introduction to the Theory of Statistics, 3d ed. Auckland, New Zealand: McGraw-Hill. Nahrstedt, D. A., and Schooley, L. C. (1979). Alternative approach in decision theory as applied to the resolution of two point images. J. Opt. Soc. Am. 69, 910Ð912. Norden, R. H. (1972). A survey of maximum likelihood. Int. Stat. Rev. 40(3), 329Ð354. Norden, R. H. (1973). A survey of maximum likelihood, part 2. Int. Stat. Rev. 41(1), 39Ð58. OÕKeefe, M. A. (1992). Resolution in high-resolution electron microscopy. Ultramicroscopy 47, 282Ð297. Orhaug, T. (1969). On the resolution of imaging systems. Optica Acta 16, 75Ð84. Orloff, J., ed. (1997). Handbook of Charged Particle Optics. Boca Raton, FL: CRC Press. Osterberg, H. (1950). Microscope imagery and interpretations. J. Opt. Soc. Am. 40, 295Ð303. Osterberg, H., and Wilkins, J. (1949). The resolving power of a coated objective. J. Opt. Soc. Am. 39, 553Ð557. Osterberg, H., and Wissler, F. C. (1949). The resolution of two particles in a bright Þeld by coated microscope objectives. J. Opt. Soc. Am. 39, 558Ð566. Papoulis, A. (1965). Probability, Random Variables, and Stochastic Processes. New York: McGraw-Hill. Pask, C. (1976). Simple optical theory of super-resolution. J. Opt. Soc. Am. Lett. 66, 68Ð70. Poston, T., and Stewart, I. N. (1978). Catastrophe Theory and Its Applications. London: Pitman. Ramsay, B. P., Cleveland, E. L., and Koppius, O. T. (1941). Criteria and the intensity-epoch slope. J. Opt. Soc. Am. 31, 26Ð33. Rayleigh, Lord. (1879). Investigations in optics, with special reference to the spectroscope. Lond. Edinb. Dublin Philos. Magazine J. Sci. 8(49), 261Ð274,403Ð411, 477Ð486. Rayleigh, Lord. (1902). ScientiÞc Papers by John William Strutt, Baron Rayleigh, Vol. 3. (1887Ð 1892). Cambridge, UK: Cambridge Univ. Press, Wave Theory of Light chapter, pp. 47Ð189. Richardson, W. H. (1972). Bayesian-based iterative method of image restoration. J. Opt. Soc. Am. 62, 55Ð59.
RESOLUTION RECONSIDERED
359
Ronchi, V. (1961). Resolving power of calculated and detected images. J. Opt. Soc. Am. 51, 458Ð460. Root, W. L. (1962). Radar resolution of closely spaced targets. IRE Trans. Milit. Electronics PMIL-6, 197Ð204. Rushforth, C. K., and Harris, R. W. (1968). Restoration, resolution, and noise. J. Opt. Soc. Am. 58, 539Ð545. Saunders, P. T. (1980). An Introduction to Catastrophe Theory. Cambridge, UK: Cambridge Univ. Press. Schell, A. C. (1965). Enhancing the angular resolution of incoherent sources. The Radio and Electronic Engineer 29, 21Ð26. Schuster, A. (1924). Theory of Optics. London: Arnold. Selby, S. M. (1971). Standard Mathematical Tables. Cleveland: CRC Press. Sementilli, P. J., Hunt, B. R., and Nadar, M. S. (1993). Analysis of the limit to superresolution in incoherent imaging. J. Opt. Soc. Am. A 10(11), 2265Ð2276. Sezan, M. I., and Tekalp, A. M. (1990). Survey of recent developments in digital image restoration. Opt. Eng. 29(5), 393Ð404. Shannon, C. E. (1949). Communication in the presence of noise. Proc. IRE 37, 10Ð21. Shepp, L. A., and Vardi, Y. (1982). Maximum likelihood reconstruction for emission tomography. IEEE Trans. Med. Imaging MI-1(2), 113Ð122. Sijbers, J., den Dekker, A. J., Scheunders, P., and Van Dyck, D. (1998). Maximum likelihood estimation of Rician parameters. IEEE Trans. Med. Imaging 17(3), 357Ð361. Slepian, D., and Pollak, H. O. (1961). Prolate spheroidal wave functions, Fourier analysis and uncertainty: I. Bell Syst. Tech. J. 40, 43Ð64. Snyder, D. L., Hammoud, A. M., and White, R. L. (1993). Image recovery from data acquired with a charge-coupled-device camera. J. Opt. Soc. Am. A 10(5), 1014Ð1023. Sparrow, C. M. (1916). On spectroscopic resolving power. Astrophys. J. 44, 76Ð86. Spence, J. C. H. (1988). Experimental High-Resolution Electron Microscopy, 2d ed. New York: Oxford Univ. Press. Stuart, A., and Ord, J. K. (1994). KendallÕs Advanced Theory of Statistics, Vol. 1: Distribution Theory, 6th ed. London: Arnold. Stuart, A., Ord, J. K., and Arnold, S. (1999). KendallÕs Advanced Theory of Statistics, Vol. 2A: Classical Inference and the Linear Model, 6th ed. London: Arnold. Swarte, J. H. (1992). Precision and resolution in spectroscopic model Þtting. Ph. D. thesis, Delft University of Technology. Delft, The Netherlands: Delft Univ. Press. Swerling, P. (1964). Parameter estimation accuracy formulas. IEEE Trans. Inform. Theory 10, 302Ð314. Thompson, J. M. T. (1982). Instabilities and Catastrophes in Science and Engineering. Chicester: Wiley. Toraldo di Francia, G. (1955). Resolving power and information. J. Opt. Soc. Am. 45, 497Ð 501. van den Bos, A. (1980). A class of small sample nonlinear least squares problems. Automatica 16, 487Ð490. van den Bos, A. (1981). Degeneracy in nonlinear least squares. IEEE Proc. 128(Part D), 109Ð116. van den Bos, A. (1982). Handbook of Measurement Science, Vol. 1: Theoretical Fundamentals. Chicester: Wiley Chapter 8: Parameter Estimation, pp. 331Ð377. van den Bos, A. (1983). Limits to resolution in nonlinear least squares model Þtting. IEEE Trans. Automatic Control AC-28, 1118Ð1120. van den Bos, A. (1984). Resolution of model Þtting methods. Int. J. Syst. Sci. 15, 825Ð835. van den Bos, A. (1987). Optical resolution: An analysis based on catastrophe theory. J. Opt. Soc. Am. A 4, 1402Ð1406.
360
A. VAN DEN BOS AND A. J. DEN DEKKER
van den Bos, A. (1988). Nonlinear least-absolute-values and minimax model Þtting. Automatica 24, 803Ð808. van den Bos, A. (1992). Ultimate resolution: A mathematical framework. Ultramicroscopy 47, 298Ð306. van den Bos, A. (1993). Critical errors associated with parameter resolvability. Comput. Phys. Commun. 76, 184Ð190. van den Bos, A., and den Dekker, A. J. (1995). Ultimate resolution in the presence of coherence. Ultramicroscopy 60, 345Ð348. van den Bos, A., and den Dekker, A. J. (1996). Coherent model-based optical resolution. J. Opt. Soc. Am. A 13, 1667Ð1669. van den Bos, A., and Swarte, J. H. (1993). Resolvability of the parameters of multiexponentials and other sum models. IEEE Trans. Signal Processing SP-41, 313Ð322. van den Bos, A., and van der Werff, T. T. (1990). Degeneracy in nonlinear least squaresÑ Coincidence of more than two parameters. IEEE Proc. 137(Part D), 273Ð280. Walsh, D. O., and Nielsen-Delaney, P. A. (1994). Direct method for superresolution. J. Opt. Soc. Am. A 11, 572Ð579. Wang, G., and Li, Y. (1999). Axiomatic approach for quantiÞcation of image resolution. IEEE Signal Processing Lett. 6(10), 257Ð258. Whitney, H. (1955). On singularities of mappings of Euclidean spaces. I. Mappings of the plane into the plane. Ann. Math. 62(3), 374Ð410. Wilkins, J. E. (1950). The resolving power of a coated objective. J. Opt. Soc. Am. 40, 222Ð224. Wolf, E. (1951). The diffraction theory of aberrations. Rep. Prog. Phys. 14, 95Ð120. Zacks, S. (1971). The Theory of Statistical Inference. New York: Wiley.
Index
Adaptation rules, 24, 49 Adaptive bandpass Þlters given a known point, 49Ð52 with no known point, 52Ð56 Adaptive design, for logical structural Þlters, 61Ð63 Adaptive optimization algorithm for designing stack Þlters, 210Ð214 fast, 214Ð216 Adaptive single-parameter granulometrics. See Granulometrics, adaptive single-parameter Airy diffraction pattern, 345 Airy functions, 247, 347Ð350 Algebraic opening, 4 Amplitude transfer function, 249 Analytic continuation, 250Ð251 Antiextensive property, 4 Apodization, 247, 263 Asymptotic efÞciency, maximum likelihood estimator, 272 Asymptotic estimation, 261Ð263 Asymptotic normality, maximum likelihood estimator, 272
Buxton resolution criterion, 246
Background function, 267, 294 Bandpass Þlters. See Granulometrics, bandpass Þlters Bases, 4 Bayes rule, 253 BesselÕs inequality, 102 Bifurcation sets, 278, 283Ð284,333, 334Ð337 Binary-operation notation, 7 Biraud algorithm, 252
Catastrophe theory. See Singularity theory Cauchy sequence, 91 Channel capacity, 256 ChapmanÐKolmogorov equation, 15, 26, 29, 31, 50, 54, 63 Codimension, 333Ð334 Coherent transfer function, 249 Comparative genomic hybridization (CGH), 65Ð69 Complementary reconstructive opening, 57 Component functions, 294 Conjunctive granulometry, 8 Connected operators, 6 Consistency, maximum likelihood estimator, 272 Continuous opening spectrum, 40 Convergence theorem, 122Ð123 Convex sets, 5 Countable-interval subset, 41 Cram« erÐRaolower bound, 262, 263, 264, 269Ð271 Critical errors, probability of resolution and, 339Ð344 Cubic term, 316Ð320 Cusp catastrophe, 281Ð283 Cutoff condition, 155Ð160 Decision theory, 260Ð261 Differential/two-source resolution, 262 Diffraction limit, 249 Discrete Fourier transform, inverse Þltering and, 254
361
362
INDEX
Discrete-time Fourier transform (DTFT), 138, 139 Discrete-time warped wavelets, 137Ð140 Disjunctive granulometry, 8 single-parameter, 13Ð14 Distributive, 6 Equation error formulation, 221 Euclidean granulometries, 3Ð6 Euclidean property, 5 Extrapolation, 250Ð253 FatouÕs lemma, 109 Filter banks, Laguerre warped, 134Ð137 Finite extent, 252 Finite impulse response (FIR) Þlters, 174, 184, 186Ð187 Fisher information matrix, 270 Fisher score vector, 270 Fold catastrophe, 280Ð281 Fourier analysis local, 74 warped, 83Ð85 FourierÐPlanchereltransform, 81 Fourier transform, 78, 249Ð250 warped, 85Ð86 Frail sets, 41 Frequency warping operators, 77Ð79,82Ð86 General multiresolution approximation (GMRA), 80Ð82 satisfaction of axioms, 111Ð118 Generator, 5 Globally warped wavelets, 88Ð90 Granulometrics Euclidean, 3Ð6 historical background, 2Ð3 logical, 6Ð12,56Ð69
Granulometric size density (GSD), 40, 41, 45 Granulometrics, adaptive single-parameter, 12 comparison of Þlters in homothetic model, 21Ð24 silver-halide T-grain crystals application, 18Ð21 steady-state distribution, 14Ð21 transition probabilities, 13Ð14 Granulometrics, bandpass Þlters (GBFs) adaptive Þlters given a known point, 49Ð52 adaptive Þlters with no known point, 52Ð56 general theory, 39Ð40 silver-halide T-grain crystals application, 55Ð56 spectral theory, 40Ð43 spectral theory for univariate disjunctive granulometries, 43Ð49 Granulometrics, multiparameter adaptation rules, 24 locating cancerous cells application, 37Ð39 numerical analysis of steady-state behavior, 31Ð39 type-[I, 0] model, 25Ð27 type-[I, I] model, 28Ð29 type-[II, 0] model, 29Ð30 type-[II, I] model, 30Ð31 Granulometry, deÞned, 2, 5 Hamming distance, 209 Hessian matrix, 277Ð278,279 change of coordinates, 305Ð307 likelihood function and, 304Ð311 one-component stationary point, 307Ð311
INDEX
Homothetic logical structural Þlters, 61 Homothetic model, comparison of Þlters in, 21Ð24 Houston resolution criterion, 246 Hyperplanes, 340 Idempotent property, 4 Increasing property, 4 Information resolution, 250 Information theory, resolution and, 256Ð258 Invariance ordered, 7 Invariant class, 4 Inverse Þltering, 253Ð256 modiÞed, 255 K¬ onigs models, construction of iterated warping maps and generalized, 140Ð161 LagrangeÕs theorem, 141Ð142 Laguerre transform, discrete, 75Ð76 discrete-time warping maps and, 130Ð134 Laguerre warped Þlter bank, two-channel, 134Ð137 Leakage, 247 Least mean absolute (LMA) algorithm adaptive algorithm, 218 optimal weighted median Þlters, 216Ð221 Least mean square (LMS) algorithm, 218 Least squares estimator, 254, 265 nonlinear, 272Ð273 Lebesgue measure, 10, 41, 81 Likelihood function, 254, 272Ð273 log-, 271
363
Likelihood functions, singularity and biexponential model and normally distributed observations, 298Ð301,309, 317 bi-Gaussian model and independent Poisson distributed observations, 301Ð304,309Ð310, 318Ð320 cubic term, 316Ð320 deÞnitions, 292Ð293 degenerate part, 311Ð325 Hessian matrix, 304Ð311 modeling error, 312 quadratic term, 311Ð316 quartic term, 320Ð323 stationary points, 295Ð304 statistical errors, 313Ð315 summary of, 323Ð325 for two-component models, 293Ð295 Linear minimum mean-square-error (LMMSE) estimator, 253 Logical granulometries, 6Ð12 Logical structural Þlters (LSFs), 56 adaptive design, 61Ð63 blood cell analysis application, 64Ð65 character recognition application, 59 comparative genomic hybridization application, 65Ð69 design of, 59Ð69 Þlter representation, 57Ð59 hit-and-miss, 59 homothetic, 61 maximum and minimum, 59 median, 58 noise component, 63Ð64 positive, 58
364
INDEX
Log-likelihood function, 271 See also Likelihood function Markov chain, 12, 14, 24, 49, 50, 63 Markovian adaptation theory, 40 Markovian queuing, 32Ð33,54 Maximum a priori (MAP) estimator, 252Ð253 Maximum entropy, 252 Maximum likelihood estimator, 244, 253Ð254,255 parameter estimation, 262, 264Ð625, 271Ð273 properties of, 272 Max-min networks, 184 Mean absolute error (MAE), minimization of, 208, 221 Mean size distribution (MSD), 40, 41, 45 Measurement limit, 259 Measurement points, 294 Mirrored local stacking property, 209Ð210 Mirrored threshold decomposition, 179Ð183 thresholded signal generation, 205Ð208 Morse lemma, 279Ð280 Morse R-saddle points, 279 Multiresolution approximation (MRA), 79 frequency warping operators, 82Ð86 general, 80Ð82,111Ð118 globally warped wavelets, 88Ð90 warped, 87Ð88 Multivariate granulometry, 7 conjunctive, 49 Multivariate size distribution, 47Ð49 Noise pass set, 10, 41, 47, 49 Noise random set, 1
Nonlinear least squares estimation, 272Ð273 Normal equations, 295Ð296 Object statistics, 252Ð253 Observations background function, 267 model underlying, 266 nonsystematic/statistical errors in, 266 parametric models of, 265Ð267 of Poisson distributed biexponential decay, 266Ð267 probability density function, dependence of, 267Ð269 resolution as a property of, 332Ð337 resolution from error disturbed, 337Ð339 space of, 265 systematic/modeling errors in, 267 One-component stationary point, 297 Hessian matrix at, 307Ð311 Openings, 3Ð5 binary-operation notation for, 7 complementary reconstructive, 57 continuous, 40 Optical transfer function, 249 Order statistic (OS) Þlters and smoothers, 184, 185 Orthogonal perfect reconstruction (OPR) Þlter bank, 134Ð137 Parametric estimation and models Cram« erÐRaolower bound, 262, 263, 264, 269Ð271 Fisher information matrix, 270 Fisher score vector, 270 maximum likelihood estimator, 262, 264Ð265,271Ð273 numerical example, 273Ð276
INDEX
of observations, 265Ð267 probability density function, dependence of, 267Ð269 theory, 262, 263 ParsevalÕs theorem, 84 Pass sets, 41, 44 Pattern spectrum, 2, 40, 47 Plancherel theorem, 81, 94, 102, 109, 116, 122, 162 Point resolution, 250 Point spread functions, 247, 248 Poisson distributed biexponential decay, 266Ð267 Positive Boolean functions (PBFs), 178, 182Ð183 integer domain Þlters of linearly separable, 183Ð189 weighted median Þlters, 185Ð195 Positivity, 252 Probability density function, 252Ð253 dependence of, 267Ð269 observations, 265 Probability of resolution and critical errors, 339Ð344 Pupil function, 247 Quadratic term, 311Ð316 Quadrature mirror Þlters (QMFs), 79 warped, 99Ð104 Quartic term, 320Ð323 Random sets, 44 noise, 1 pattern spectrum, 2 signal, 1 Rayleigh and Rayleigh-like resolution, 245Ð248 sinc-square functions, 247, 261, 330Ð332 Reconstructive granulometry, 6 Reconstructive openings, 57
365
Reconstructive operator, 6 Recursive weighted median (RWM) Þlters bounded-input bounded-output (BIBO), 196 computation, 203Ð205 design of robust bandpass, 228Ð233 Þrst-order approximations, 197, 200 optimal, 221Ð222 second-order approximations, 197, 200 third-order approximations, 198, 200 Reduction algorithm, 284Ð289,316, 320Ð321 Resolution apodization, 247, 263 Cram« erÐRaolower bound, 262, 263, 264, 269Ð271 differential/twoÐsource,262 information, 250 maximum likelihood estimator, 244, 253Ð254,255, 262, 264Ð625, 271Ð273 point, 250 scale, 260 single-source, 262 structural, 250 Resolution, classical two-component/two-point deÞned, 242 natural/physical limit to, 246 point spread functions, 247, 248 pupil function, 247 Rayleigh and Rayleigh-like, 245Ð248 Resolution, digital image processing methods analytic continuation, 250Ð251 asymptotic estimation, 261Ð263
366
INDEX
Resolution, digital image processing methods (Cont.) decision theory, 260Ð261 diffraction limit, 249 Þnite extent, 252 information theory, 256Ð258 inverse Þltering, 253Ð256 maximum likelihood estimator, 244, 253Ð254,255 measurement limit, 259 object statistics, 252Ð253 positivity, 252 signal-to-noise ratio, 257, 258Ð260 superresolution/extrapolation, 250Ð256 transfer functions, 249Ð250 upper and lower bounds, 252 Resolution, singularity and Airy functions, 247, 347Ð350 bi-Gaussian model and independent Poisson distributed observations, 340Ð344 computation of criterion, 327Ð328 conditions for, 325Ð327, 344Ð347 conditions for using criterion, 328Ð329 partial coherence and nonstandard, 350Ð351 probability of critical errors and, 339Ð344 Rayleigh sinc-square functions, 330Ð332 resolution as a property of observations, 332Ð337 resolution from error disturbed observations, 337Ð339 scalar locations, 325Ð344 summary of, 353
survey of literature, 352Ð353 two-dimensional locations, 344Ð350 RichardsonÐLucy algorithm, 255, 256 RiemannÐLebesguelemma, 86 Riesz bases, warped, 90Ð96 Root mean square (rms) precision, 260 Saddle points, 277 two-maxima, one, 300 Scalar locations, 325Ð344 Scale factor, 80 Scaling functions, warped, 96, 105Ð111 Schr¬ oder equation, construction of iterated warping maps, 140Ð161 Schuster resolution criterion, 246 Segmentation algorithm, 2 Separable model, 11 Short-time Fourier transform (STFT), 74 Signal pass set, 10 Signal random set, 1 Signal-to-noise ratio (SNR), 257, 258Ð260 Signal-union-noise model, 9 Silver-halide T-grain crystals, use of, 18Ð21 bandpass Þltering of, 55Ð56 Sinc-square functions, 247, 261, 330Ð332 Single-parameter granulometrics, adaptive. See Granulometrics, adaptive single-parameter Single-source resolution, 262 Singularity theory See also Likelihood functions, singularity and; Resolution, singularity and
INDEX
applications, 277 basic principles, 277Ð278 bifurcation sets, 278, 283Ð284 deÞnitions, 278Ð279 examples of representations, 280Ð283 Hessian matrix, 277Ð278,279 Morse and splitting lemmas, 279Ð280 Morse R-saddle points, 279 reduction algorithm, 284Ð289 singularities, functions near, 284Ð291 stationary points, 277Ð278 stationary points, functions around, 279Ð284 Size distribution, 2, 40 granulometric, 40, 41, 45 mean, 40, 41, 45 multivariate, 47Ð49 Sizing condition, 7, 60 Sparrow resolution criterion, 246 Spectral components, 40 Spectral theory, 40Ð43 for univariate disjunctive granulometries, 43Ð49 Spectrogram, 74 Splitting lemma, 280 Stack Þlters See also Recursive weighted median (RWM) Þlters; Weighted median (WM) Þlters adaptive optimization algorithm for designing, 210Ð214 compared to linear Þlters, 174 deÞned, 174, 181Ð183 fast adaptive optimization algorithm for designing, 214Ð216 mirrored threshold decomposition, 179Ð183,205Ð208
367
optimization, 208Ð210 order statistic (OS), 184, 185 thresholded signal generation, 205Ð208 weighted order statistic (WOS), 184, 185 Stack Þlters, applications of design of high-pass, 223Ð226 design of robust bandpass recursive weighted median Þlters, 228Ð233 image denoising with weighted median Þlters, 226Ð228 image sharpening with weighted median Þlters, 233Ð237 Stacking constraints, 176 Stack smoothers See also Recursive weighted median (RWM) Þlters; Weighted median (WM) Þlters compared to linear Þlters, 174 deÞned, 174, 177Ð179 order statistic (OS), 184, 185 threshold decomposition, 175Ð179 weighted order statistic (WOS), 184, 185 Stationary points, 277Ð278 functions around, 279Ð284 of the likelihood functions, 295Ð304 one-component, 297, 307Ð311 Steady-state distribution, 14Ð21, 50Ð51,54, 64 numerical analysis, 31Ð39 Structural resolution, 250 Structuring elements, 3, 4Ð5 Superresolution, 250Ð256 coefÞcient, 258
368
INDEX
Threshold decomposition mirrored, 179Ð183 properties, 176Ð177 stack Þlters, 179Ð183 stack smoothers, 175Ð179 thresholded signal generation by mirrored, 205Ð208 Thresholding operator, 175 Time-frequency pictures, 74 Transfer functions, 249Ð250 Transition probabilities, 13Ð14, 25Ð31,49Ð50,53, 64 Translation invariant property, 4 Tschirnhaus transformation, 322 Two-channel Laguerre warped Þlter banks, 134Ð137 Two-component models, likelihood functions for, 293Ð295 Two-component/two-point resolution deÞned, 242 Rayleigh and Rayleigh-like, 245Ð248 Two-scale equations, warped, 96Ð99 Undulation condition, 246 Univariate disjunctive granulometries, spectral theory for, 43Ð49 Unsharp masking on one-dimensional signal, 233 Unweighted random point selection, 17 Warped Fourier analysis, 83Ð85 Warped Fourier transform, 85Ð86 Warped multiresolution approximation (WMRA), 87Ð88
construction of orthonormal bases of iterated frequency warped wavelets, 118Ð125 Warped quadrature mirror Þlters (QMFs), 99Ð104 Warped Riesz bases, 90Ð96 Warped scaling functions, 96, 105Ð111 Warped two-scale equations, 96Ð99 Warped wavelets background information, 74Ð77 construction of orthonormal bases of iterated frequency, 118Ð125 discrete-time, 137Ð140 frequency axis deformation, 75, 78 frequency warping operators, 77Ð79,82Ð86 globally, 88Ð90 multiresolution approximation, 79Ð90 regularity, 126Ð129 Warped wavelet transform, computation of dyadic, 161Ð169 Warping characteristic, 87 Warping operators, frequency, 77Ð79,82Ð86 Warping maps, construction of iterated, 129 constant parameter case, 144Ð150 cutoff condition, 155Ð160 discrete-time warped wavelets, 137Ð140 Laguerre transform and discrete-time, 130Ð134 Laguerre warped Þlter bank, 134Ð137
INDEX
Schr¬ oder equation and generalized K¬ onigs models, 140Ð161 variable parameter case, 150Ð161 Weighted median (WM) Þlters See also Recursive weighted median (RWM) Þlters advantages of, 174 computation, 201Ð203 image denoising with, 226Ð228 image sharpening with, 233Ð237
369
least mean absolute algorithm and optimal, 216Ð221 positive Boolean functions, 185Ð195 properties, 188Ð189 threshold logic to analyze, 189Ð195 Weighted order statistic (WOS) Þlters and smoothers, 184, 185 Weighted random point selection, 17 Wiener Þlter theory, 216, 253
ISBN 0-12-014759-9
90065 >
9 780120 147595