ADVANCES IN IMAGING AND ELECTRON PHYSICS VOLUME 140
EDITOR-IN-CHIEF
PETER W. HAWKES CEMES-CNRS Toulouse, France
HONORARY ASSOCIATE EDITORS
TOM MULVEY BENJAMIN KAZAN
Advances in
Imaging and Electron Physics
E DITED BY
PETER W. HAWKES CEMES-CNRS Toulouse, France
VOLUME 140
AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Academic Press is an imprint of Elsevier
Academic Press is an imprint of Elsevier 525 B Street, Suite 1900, San Diego, California 92101-4495, USA 84 Theobald’s Road, London WC1X 8RR, UK ∞ This book is printed on acid-free paper.
Copyright © 2006, Elsevier Inc. All Rights Reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the Publisher. The appearance of the code at the bottom of the first page of a chapter in this book indicates the Publisher’s consent that copies of the chapter may be made for personal or internal use of specific clients. This consent is given on the condition, however, that the copier pay the stated per copy fee through the Copyright Clearance Center, Inc. (www.copyright.com), for copying beyond that permitted by Sections 107 or 108 of the U.S. Copyright Law. This consent does not extend to other kinds of copying, such as copying for general distribution, for advertising or promotional purposes, for creating new collective works, or for resale. Copy fees for pre-2005 chapters are as shown on the title pages. If no fee code appears on the title page, the copy fee is the same as for current chapters. 1076-5670/2006 $35.00 Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail:
[email protected]. You may also complete your request on-line via the Elsevier homepage (http://elsevier.com), by selecting “Support & Contact” then “Copyright and Permission” and then “Obtaining Permissions.” For information on all Elsevier Academic Press publications visit our Web site at www.books.elsevier.com ISBN-13: 978-0-12-014782-3 ISBN-10: 0-12-014782-3 PRINTED IN THE UNITED STATES OF AMERICA 06 07 08 09 9 8 7 6 5 4 3 2 1
CONTENTS
C ONTRIBUTORS . . . . . . . . . . . . . . . . . . . . . . . . . . . P REFACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F UTURE C ONTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . .
vii ix xi
Recursive Neural Networks and Their Applications to Image Processing M ONICA B IANCHINI , M ARCO M AGGINI , AND L ORENZO S ARTI I. II. III. IV.
Introduction . . . . . . . . . . . . . Recursive Neural Networks . . . . . . Graph-Based Representation of Images Object Detection in Images . . . . . . References . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
1 9 33 39 54
Deterministic Learning and an Application in Optimal Control C RISTIANO C ERVELLERA AND M ARCO M USELLI I. II. III. IV. V. VI. VII. VIII.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . A Mathematical Framework for the Learning Problem . . . . . Statistical Learning . . . . . . . . . . . . . . . . . . . . . . Deterministic Learning . . . . . . . . . . . . . . . . . . . . Deterministic Learning for Optimal Control Problems . . . . . Approximate Dynamic Programming Algorithms . . . . . . . Deterministic Learning for Dynamic Programming Algorithms Experimental Results . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . .
. 62 . 65 . 69 . 74 . 90 . 94 . 99 . 104 . 114
X-Ray Fluorescence Holography KOUICHI H AYASHI I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 120 II. Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 v
vi
CONTENTS
III. IV. V. VI.
Experiment and Data Processing Applications . . . . . . . . . . Related Methods . . . . . . . . Summary and Outlook . . . . . References . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
138 159 174 180 181
A Taxonomy of Color Image Filtering and Enhancement Solutions R ASTISLAV L UKAC I. II. III. IV. V. VI.
Introduction . . . . Color Imaging Basics Image Noise . . . . Color Image Filtering Edge Detection . . . Conclusion . . . . . References . . . . .
AND
KONSTANTINOS N. P LATANIOTIS
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
188 190 193 199 244 257 257
General Sweep Mathematical Morphology F RANK Y. S HIH I. Introduction . . . . . . . . . . . . . . . . . . . . . . . II. Theoretical Development of General Sweep Mathematical Morphology . . . . . . . . . . . . . . . . . . . . . . . III. Blending of Swept Surfaces with Deformations . . . . . . IV. Image Enhancement . . . . . . . . . . . . . . . . . . . V. Edge Linking . . . . . . . . . . . . . . . . . . . . . . . VI. Shortest Path Planning for Mobile Robot . . . . . . . . . VII. Geometric Modeling and Sweep Mathematical Morphology VIII. Formal Language and Sweep Morphology . . . . . . . . IX. Representation Scheme . . . . . . . . . . . . . . . . . . X. Grammars . . . . . . . . . . . . . . . . . . . . . . . . XI. Parsing Algorithm . . . . . . . . . . . . . . . . . . . . XII. Conclusions . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . Further Reading . . . . . . . . . . . . . . . . . . . . .
. . . 265 . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
268 275 278 280 286 288 291 292 297 300 303 304 306
I NDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
CONTRIBUTORS
Numbers in parentheses indicate the pages on which the authors’ contributions begin.
M ONICA B IANCHINI (1), Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Siena, 53100 Siena, Italy C RISTIANO C ERVELLERA (61), Istituto di Studi sui Sistemi Intelligenti per l’Automazione, Consiglio Nazionale delle Ricerche, 16149 Genova, Italy KOUICHI H AYASHI (119), Institute for Materials Research, Tohoku University, Sendai 980-8577, Japan R ASTISLAV L UKAC (187), Multimedia Laboratory—BA 4157, The Edward S. Rogers Sr. Department of ECE, University of Toronto, Toronto, Ontario M5S 3G4, Canada M ARCO M AGGINI (1), Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Siena, 53100 Siena, Italy M ARCO M USELLI (61), Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni, Consiglio Nazionale delle Ricerche, 16149 Genova, Italy KONSTANTINOS N. P LATANIOTIS (187), Multimedia Laboratory—BA 4157, The Edward S. Rogers Sr. Department of ECE, University of Toronto, Toronto, Ontario M5S 3G4, Canada L ORENZO S ARTI (1), Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Siena, 53100 Siena, Italy F RANK Y. S HIH (265), Computer Vision Laboratory, College of Computing Sciences, New Jersey Institute of Technology, Newark, New Jersey 07102, USA
vii
This page intentionally left blank
PREFACE
The five chapters that make up this volume cover several aspects of image processing, control theory and a form of holography using X-rays. First, we have an account by M. Bianchini, M. Maggini, and L. Sarti of the role of recursive neural networks in image processing. The authors begin with a very clear description of the reasons why this approach is so powerful for pattern recognition in real-world situations. Then they present the networks themselves, the graph-based representation of images and, finally, show how objects can be detected by means of these tools. This is followed by a contribution on deterministic learning and its use in control theory by C. Cervellera and M. Muselli. Before describing deterministic learning in detail, they give a brief account of statistical learning in order to bring out the advantages of the former in certain situations. Dynamic programming algorithms are then set out and the chapter concludes with some experimental results. The subject of the third chapter is very different. Among the many forms of holography, X-ray fluorescence holography is proving very valuable. It is a recent addition to the family, owing to the difficulty of obtaining sufficiently strong signals but the results obtained at the European Synchrotron Radiation Facility show how useful it can be. K. Hayashi describes the technique itself and illustrates this with numerous applications. Finally, the technique is compared briefly with related methods, such as γ -ray and neutron holography, and photon interference XAFS. Processing color images is distinctly more complicated than with blackand-white images, and several contributions on the question are planned for these Advances. In the fourth chapter, R. Lukac and K.N. Plataniotis first examine the task of filtering such images, and then discuss enhancement and edge detection. Finally, we have another contribution in the area of mathematical morphology, which regularly appears in these pages. Here, F.Y. Shih introduces general sweep mathematical morphology, a branch of the subject useful in robotics and in automated construction and machining. After explaining what is meant by ‘sweeping’ in this context, the author gives formal definitions of the various operations required and then applies them to a wide variety of tasks. Three sections are devoted to a representation scheme, grammars, and parsing. ix
x
PREFACE
As always, I thank the authors most sincerely for their efforts to make their subjects clear to non-specialist readers. Contributions promised for future volumes in the series are listed in the next section. Peter Hawkes
FUTURE CONTRIBUTIONS
G. Abbate New developments in liquid–crystal-based photonic devices S. Ando Gradient operators and edge and corner detection A. Asif Applications of noncausal Gauss–Markov random processes in multidimensional image processing C. Beeli Structure and microscopy of quasicrystals V.T. Binh and V. Semet Cold cathodes G. Borgefors Distance transforms A. Buchau Boundary element or integral equation methods for static and time-dependent problems B. Buchberger Gröbner bases J. Caulfield (vol. 142) Optics and information sciences T. Cremer Neutron microscopy H. Delingette Surface reconstruction based on simplex meshes A.R. Faruqi Direct detection devices for electron microscopy R.G. Forbes Liquid metal ion source xi
xii
FUTURE CONTRIBUTIONS
C. Fredembach Eigenregions for image classification S. Fürhapter Spiral phase contrast imaging L. Godo and V. Torra Aggregation operators A. Gölzhäuser Recent advances in electron holography with point sources M.I. Herrera The development of electron microscopy in Spain D. Hitz (vol. 144) Recent progress on high-frequency electron cyclotron resonance ion sources D.P. Huijsmans and N. Sebe Ranking metrics and evaluation measures K. Ishizuka Contrast transfer and crystal images J. Isenberg Imaging IR-techniques for the characterization of solar cells K. Jensen Field-emission source mechanisms L. Kipp Photon sieves G. Kögel Positron microscopy T. Kohashi Spin-polarized scanning electron microscopy W. Krakow Sideband imaging R. Leitgeb Fourier domain and time domain optical coherence tomography B. Lencová Modern developments in electron optical calculations Y. Lin and S. Liu (vol. 141) Grey systems and grey information
FUTURE CONTRIBUTIONS
xiii
W. Lodwick Interval analysis and fuzzy possibility theory L. Macaire, N. Vandenbroucke, and J.-G. Postaire Color spaces and segmentation M. Matsuya Calculation of aberration coefficients using Lie algebra S. McVitie Microscopy of magnetic specimens S. Morfu and P. Morquié Nonlinear systems for image processing L. Mugnier, A. Blanc, and J. Idier (vol. 141) Phase diversity M.A. O’Keefe Electron image simulation D. Oulton and H. Owens Colorimetric imaging N. Papamarkos and A. Kesidis The inverse Hough transform K.S. Pedersen, A. Lee, and M. Nielsen The scale-space properties of natural images I. Perfilieva Fuzzy transforms E. Rau Energy analysers for electron microscopes H. Rauch The wave-particle dualism E. Recami Superluminal solutions to wave equations ˇ J. Rehᡠcek, Z. Hradil, J. Peˇrina, S. Pascazio, P. Facchi, and M. Zawisky (vol. 142) Neutron imaging and sensing of physical fields G. Ritter and P. Gader (vol. 144) Fixed points of lattice transforms and lattice associative memories
xiv
FUTURE CONTRIBUTIONS
J.-F. Rivest (vol. 144) Complex morphology P.E. Russell and C. Parish Cathodoluminescence in the scanning electron microscope G. Schmahl X-ray microscopy G. Schönhense, C.M. Schneider, and S.A. Nepijko (vol. 142) Time-resolved photoemission electron microscopy R. Shimizu, T. Ikuta, and Y. Takai Defocus image modulation processing in real time S. Shirai CRT gun design methods N. Silvis-Cividjian and C.W. Hagen (vol. 143) Electron-beam-induced nanometre-scale deposition H. Snoussi Geometry of prior selection T. Soma Focus-deflection systems and their applications W. Szmaja (vol. 141) Recent developments in the imaging of magnetic domains I. Talmon Study of complex fluids by transmission electron microscopy G. Teschke and I. Daubechies Image restoration and wavelets M.E. Testorf and M. Fiddy Imaging from scattered electromagnetic fields, investigations into an unsolved problem M. Tonouchi Terahertz radiation imaging N.M. Towghi Ip norm optimal filters D. Tschumperlé and R. Deriche Multivalued diffusion PDEs for image regularization
FUTURE CONTRIBUTIONS
E. Twerdowski Defocused acoustic transmission microscopy Y. Uchikawa Electron gun optics C. Vachier-Mammar and F. Meyer Watersheds K. Vaeth and G. Rajeswaran Organic light-emitting arrays M. van Droogenbroeck and M. Buckley Anchors in mathematical morphology M. Wild and C. Rohwer Mathematics of vision B. Yazici and C.E. Yarman (vol. 141) Stochastic deconvolution over groups J. Yu, N. Sebe, and Q. Tian (vol. 144) Ranking metrics and evaluation measures
xv
This page intentionally left blank
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 140
Recursive Neural Networks and Their Applications to Image Processing MONICA BIANCHINI, MARCO MAGGINI, AND LORENZO SARTI Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Siena, 53100 Siena, Italy
I. Introduction . . . . . . . . . . . . . . . A. From Flat to Structural Pattern Recognition . . . . B. Recursive Neural Networks: Properties and Applications II. Recursive Neural Networks . . . . . . . . . . A. Graphs . . . . . . . . . . . . . . . B. Processing DAGs with Recursive Neural Networks . . 1. Processing DPAGs . . . . . . . . . . . 2. Processing DAGs-LE . . . . . . . . . . C. Backpropagation Through Structure . . . . . . D. Processing Cyclic Graphs . . . . . . . . . . 1. Recursive-Equivalent Transforms . . . . . . 2. From Cyclic Graphs to Recursive Equivalent Trees . E. Limitations of the Recursive Neural Network Model . . 1. Theoretical Conditions for Collision Avoidance . . III. Graph-Based Representation of Images . . . . . . . A. Introduction . . . . . . . . . . . . . B. Segmentation of Images . . . . . . . . . . C. Region Adjacency Graphs . . . . . . . . . D. Multiresolution Trees . . . . . . . . . . . IV. Object Detection in Images . . . . . . . . . . A. Object Detection Methods . . . . . . . . . B. Recursive Neural Networks for Detecting Objects in Images 1. Learning Environment Setup . . . . . . . . 2. Detecting Objects . . . . . . . . . . . References . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
1 1 7 9 9 11 14 17 18 22 25 28 30 31 33 33 33 36 38 39 39 42 42 47 54
I. I NTRODUCTION A. From Flat to Structural Pattern Recognition Pattern recognition algorithms and statistical classifiers, such as neural networks or support vector machines (SVMs), can deal with real-life noisy data in an efficient way, so that they can be successfully applied in several different 1 ISSN 1076-5670/05 DOI: 10.1016/S1076-5670(05)40001-4
Copyright 2006, Elsevier Inc. All rights reserved.
2
BIANCHINI
et al.
domains. However, the majority of such tools are restricted to process real vectors of a finite and fixed dimensionality. On the other hand, most realworld problems have no natural representation as a single “table,” i.e., in several applications, the information that is relevant for solving problems is organized in entities and relationships among entities, so that applying traditional data mining methods implies that an extensive preprocessing has to be performed on the data. For instance, categorical variables are encoded by one-hot encoding, time series are embedded into finite dimensional vector spaces using time windows, preprocessing of images includes edge detection and the exploitation of various filters, sound signals can be represented by spectral vectors, and chemical compounds are characterized by topological indices and physicochemical attributes. However, other data formats and data representations exist, and can be exploited to represent patterns in a more natural way. Sets, without a specified order, can describe objects in a scene or a pool of measurements. Functions, evaluated at specific points, constitute a natural description for time series or spectral data. Sequences of any length also represent time series or spatial data. Tree structures describe terms, logical formulas, parse trees, or document images. Graph structures can be used to encode chemical compounds, images, and, in general, objects composed of atomic elements. Feature encoding of such data can produce compact vectors, even if the encodings are often problem-dependent, time-consuming, and heuristic. Moreover, some information is usually lost when complex data structures such as sequences, trees, or graphs of arbitrary size are encoded in fixed dimensional vectors. The need to deal with complex structures has focused the researchers’ efforts on developing methods able to process such types of data. However, this issue has given rise to a long-standing debate between fans of artificial intelligence methods, based on symbols, and fans of computational intelligence methods, which operate on numbers (Bezdek, 1994). As a matter of fact, in the last three decades, the emphasis in pattern recognition research has been swinging from decision-theoretic to structured approaches. Decision-theoretic models are essentially based on numerical features, which provide a global representation of patterns and are obtained using some sort of preprocessing (like those listed above). Many different decision-theoretic methods have been developed in the framework of connectionist models, which operate on symbolic pattern representations. On the other hand, syntactic and structural pattern recognition (and also artificial intelligence-based) methods have been developed that emphasize the subsymbolic nature of patterns. However, both purely decision-theoretic and syntactical/structural approaches have a limited value when they are applied to many interesting real-life problems. In fact, syntactical and structural approaches can model the structure of patterns, but they are not very well-suited for dealing with patterns corrupted by
RECURSIVE NEURAL NETWORKS
3
noise. This limitation was recognized early, and several approaches have been pursued to incorporate statistical properties into structured approaches. The data representations used for either syntactical or structural techniques have been enriched with attributes that are in fact vectors of real numbers describing appropriate features of the patterns. These attributes are expected to allow some statistical variability in the patterns under consideration. A comprehensive survey on the embedding of statistical approaches into syntactical and structural pattern recognition can be found in Tsai (1990). On the other hand, parametric or nonparametric statistical methods can nicely deal with distorted noisy patterns, but they are severely limited when the patterns are strongly structured. The feature extraction process in those cases seems to be inherently ill-posed. In fact, a structured pattern can be regarded as an arrangement of elements, deeply dependent by the interactions among them, and by the intrinsic nature of each element. Hence, the causal, hierarchical, and topological relations among parts of a given pattern yield significant information. In the past few years, some new models, which exploit the above definition of pattern as an integration of symbolic and subsymbolic information, have been developed. These models try to solve one of the most challenging tasks in pattern recognition: obtain a flat representation for a given structure (or for each atomic element that belongs to a structure) in an automatic, and possibly adaptive, way. This flat representation, computed following a recursive computational schema, takes into account both the local information associated with each atomic entity and the information induced by the topological arrangement of the elements, inherently contained in the structure. Nevertheless, pattern recognition approaches to structured data are often upgrades of methodologies originally developed for flat vectorial data. Among them, there are popular data mining methods, such as decision trees, supervised and unsupervised neural networks, Markov models, rule learners, and distance-based algorithms. With respect to Markov models, random walk (RW) techniques have recently been proposed (Gori et al., 2005a) that can compute the relevance for each node in a graph. The relevance depends on the topological information collected in the graph structure and on the information associated with each node. The relevance values, computed using an RW model, have been used in the past to compute the ranking of the Web pages inside search engines, and Google, probably the most popular one, uses a ranking technique based on a particular RW model. Classical problems related to graph theory, like graph or subgraph matching, can also be addressed in this framework. The RW approach can be used, for example, in problems of image retrieval. On the other hand, support vector machines (Boser et al., 1992; Vapnik, 1995) are among the most successful recent developments within the machine learning
4
BIANCHINI
et al.
and the data mining communities. Along with some other learning algorithms, like Gaussian processes and kernel principal component analysis, they form the class of kernel methods (Müller et al., 2001; Schölkopf and Smola, 2002). The computational attractiveness of kernel methods comes from the fact that they can be applied into high-dimensional feature spaces without the high cost of explicitly computing the mapped data. In fact, the kernel trick consists in defining a positive-definite kernel so that a set of nonlinearly separable data can be mapped onto a larger metric space, on which they become linearly separable, without explicitly knowing the mapping between the two spaces (Gärtner, 2003). Using a different kernel corresponds to a different embedding and thus to a different hypothesis language. Crucial to the success of kernel-based learning algorithms is the extent to which the semantics of the domain are reflected in the definition of the kernel. In fact, kernel functions that directly handle data represented by graphs are often designed a priori or at least they allow a limited adaptation, so that they cannot grasp any structural property that has not been guessed by the designer. Some examples are convolution kernels (Collins and Duffy, 2002), which recursively take into account subgraphs, string kernels (Vishwanathan and Smola, 2002), which require that each tree is represented by the sequence of labels generated by a depth-first visit, or graph kernels (Gärtner et al., 2003), based on a measure of the walks in two graphs that share some labels. An alternative approach consists in adaptive kernel functions, which are able to adapt the kernel to the dataset. Kernel functions defined on structured data have recently received growing attention, since they are able to deal with many real-world learning problems in bioinformatics, natural language processing, or document processing. In recent years, supervised neural networks have also been developed that are able to deal with structured data encoded as labeled directed positional acyclic graphs (DPAGs). These models are called recursive neural networks (RNNs) (Sperduti and Starita, 1997; Frasconi et al., 1998; Küchler and Goller, 1996). The essential idea of RNNs is to process each node of an input graph by a multilayer perceptron, and then to process the DPAG from its leaf nodes toward the root node [if any, otherwise such a node must be opportunely added (Sperduti and Starita, 1997)], using the structure of the graph to connect the neurons from one node to another. The output of the neurons corresponding to the root node can then be exploited to encode the whole graph. In other words, to process an input DPAG, the RNN is unfolded through the graph structure, producing the encoding network. Then, the computation of the state of the network is performed from the frontier of the graph to the root node, following, in the reverse direction, the arcs of the input structure. The state of the network at a generic node of the input structure depends on the label associated with the node and on the states of the children of the
RECURSIVE NEURAL NETWORKS
5
node itself (this kind of computation establishes a sort of causal relationship between each node and its children). A gradient descent method is then used to learn the weights of the multilayer perceptrons. The models are simplified by assuming that all multilayer perceptrons at each node and across the training set share the same parameters. This approach basically consists of an extension to graphic structures of the traditional “unfolding” process adopted by recurrent neural networks for sequences (Elman, 1990). The main limitation of this model is inherently contained in the kind of structures that can be processed. In fact, it is not always easy to represent real data using DPAGs. In this kind of graph, each edge starting from a node has an assigned position, and any rearrangement of the children of a node produces a different graph. While such an assumption is useful for some applications, it may sometimes introduce an unnecessary constraint on the representation. For example, this hypothesis is not suitable for the representation of a chemical compound and might not be adequate for several pattern recognition problems. Considering this limitation, some researchers have recently proposed several models aimed at processing more general classes of graphs. In Bianchini et al. (2001a), a new model able to process DAGs was presented. This model exploits a weight-sharing approach to relax the positional constraint. Even if interesting from a theoretical point of view, this methodology has limited applications. In fact, the complexity of the network architecture grows exponentially with the maximum outdegree of the processed structures. A different way of relaxing the positional constraint has been proposed (Bianchini et al., 2005a; Gori et al., 2003). This approach is based on processing directed acyclic graphs with labels also on the edges (DAGs-LE). The state of each node depends on the label attached to the node and on a combination of the contributions of its children weighed by the edge labels. This total contribution can be computed using a feedforward neural network or an ad hoc function, and is independent of both the number and the order of the children of the node. Therefore, the model allows processing of graphs with any outdegree. Moreover, since it is not always easy to determine useful features that can be associated with the edges of the structures, a procedure that allows a DPAG to be transformed into a DAG-LE is presented (Bianchini et al., 2004a). To process also cyclic graphs, in Bianucci et al. (2001), a collapse strategy is proposed for cycles, which are represented by a unique node that resembles all the information collected in the nodes belonging to the cycle. Unfortunately, this strategy cannot be carried out automatically, and it is intrinsically heuristic. A different technique for processing cyclic structure has been proposed (Bianchini et al., 2002, 2006). This method preprocesses the cyclic structures, transforming each graph into a forest of recursive-
6
BIANCHINI
et al.
equivalent trees. The forest of trees collects the same information contained in the cyclic graph. This method allows processing of both cyclic and undirected graphs. In fact, undirected structures can be transformed into cyclic-directed graphs by replacing each undirected edge with a pair of directed arcs with opposite directions. Finally, the graph neural networks (GNN) model (Gori et al., 2004, 2005b) is able to process general graphs, including directed and undirected structures, both cyclic and acyclic. In the GNN model, the encoding network can be cyclic and nodes are activated until the network reaches a steady state. An alternative approach for undirected structures has been proposed (Vullo and Frasconi, 2002). During a preprocessing phase, a direction for each edge of the structure is defined. Then the state of the RNN is computed for each node, considering a bidirectional computation. First, the RNN is unfolded following the chosen direction for the edges, then it is unfolded again considering the opposite direction. This model has proven to be particularly suited for bioinformatics applications. Finally, a model that relaxes the causal relationship between each node and its children has been proposed (Micheli et al., 2004). In this approach, called cascade-correlated, the state of each node depends on the attached label and both on the states of its children and the states of its parents. In fact, when real data are represented using structures, it is not always clear how the father/children relationship must be established. The aim of this approach is to relax the causal relationship established by the original model, defining a new processing schema that allows the father/children relationship to be disregarded. All the models cited above were defined inside the supervised learning paradigm. Supervised information, however, either may not be available or may be very expensive to obtain. Thus, it is very important to develop models that are able to deal with structured data in an unsupervised fashion. In the past few years, some RNN models have been proposed in the framework of unsupervised learning (Hammer et al., 2004), and various unsupervised models for nonvectorial data are available in the literature. The approaches presented (Günter and Bunke, 2001; Kohonen and Sommervuo, 2002) use a metric for self-organizing maps (SOMs) that directly works on structures. Structures are processed as a whole by extending the basic distance computation to complex distance measures for sequences, trees, or graphs. The edit distance, for example, can be used to compare strings of arbitrary length. Such a technique extends the basic distance computation for the neurons to a more expressive comparison that tackles the given input structure as a whole. Early unsupervised recursive models, such as the temporal Kohonen map or the recurrent SOM, include the biologically plausible dynamics of leaky integrators (Chappell and Taylor, 1993; Koskela et al., 1998a, 1998b). This
RECURSIVE NEURAL NETWORKS
7
idea has been used to model direction selectivity in models of the visual cortex and for time series representation (Koskela et al., 1998a, 1998b; Farkas and Mikkulainen, 1999). Combinations of leaky integrators with additional features can increase the capacity of the models as demonstrated in further proposals (Euliano and Principe, 1999; Hoekstra and Drossaers, 1993; James and Mikkulainen, 1995; Kangas, 1990; Vesanto, 1997). Recently, more general recurrences with richer dynamics have been proposed (Hagenbuchner et al., 2001, 2003; Strickert and Hammer, 2003a, 2003b; Voegtlin, 2000, 2002; Voegtlin and Dominey, 2001). These models transcend the simple local recurrence of leaky integrators and can represent much richer dynamic behavior, which has been demonstrated in many experiments. While the processing of tree-structured data has been discussed (Hagenbuchner et al., 2001, 2003), all the remaining approaches have been applied to time series. B. Recursive Neural Networks: Properties and Applications RNNs can compute maps from a graph space to an isomorph graph space or to a vector space (for instance Rn ). Some of the models presented above were studied from a theoretical point of view to understand how the approximation capabilities of feedforward neural networks can be extended to RNNs. In fact, feedforward networks were proved to be universal approximators (Hornik et al., 1989) and able to compute any function between vector spaces. In Hammer (1998), the pioneering work on the approximation capabilities of RNNs, the original RNN model, tailored to process DPAGs, was shown to behave, in probability, as a universal approximator for the space of positional trees, so that it is able to compute any function from such space to Rn . Subsequently (Bianchini et al., 2001a), RNNs with linear neurons have also been proved to be universal approximators for the domain of DPAGs, and, finally, the universal approximation capability has recently been extended to cyclic graphs (Bianchini et al., 2002, 2006) and to DAGs-LE (Bianchini et al., 2005a). Also recently (Bianchini et al., 2001a), some theoretical results have been stated on linear recursive networks to establish necessary and sufficient conditions to guarantee a unique vectorial representation for structures belonging to certain classes or, in other words, to avoid the collision phenomenon. In fact, provided that the dimension of the internal state grows exponentially with the height of trees, collisions can always be avoided. Of course, this result considerably limits the range of applicability of recursive architectures in dealing with general large structures. As a matter of fact, the problem of recognizing general trees becomes intractable as soon as their height increases. Also, it is interesting that such a negative conclusion can be directly extended to nonlinear recursive networks. Nevertheless, there are
8
BIANCHINI
et al.
some significant characteristics of trees that can be recognized by using a reasonable amount of resources. In fact, in Bianchini et al. (2001a), it has been proven that a simple class of linear recursive networks can count the number of nodes per level or the number of left and right branches per level. Thus, the linear recursive model cannot be used to encode all trees, although it can be useful, in some practical applications, to recognize different classes of trees. Such useful properties of RNNs have been exploited in many pattern recognition applications. In particular, RNNs have been applied to image analysis, chemistry, bioinformatics, Web searching, theorem proving, and natural language processing, for solving both classification and regression tasks. Actually, RNNs allow state-of-the-art results to be obtained in some bioinformatics problems, however, for the cases in which RNNs do not achieve the best results, they allow us to define general and nonheuristic techniques for pattern recognition applications. In the bioinformatics field, RNNs were applied to the prediction of protein topologies (Vullo and Frasconi, 2002; Pollastri et al., 2002). In this problem, the contact map of each protein is represented by an undirected graph, and then the birecursive architecture presented in Vullo and Frasconi (2002) is used to predict the protein secondary structures. In chemistry, an RNN application to the quantitative structure-activity relationship (QSAR) problem of benzodiazepines has been presented (Bianucci et al., 2001). This application has also allowed the performances of the recursive cascade correlation architecture proposed in Micheli et al. (2004) to be evaluated. With respect to natural language processing, RNNs were used (Sturt et al., 2003) to learn firstpass structural attachment preferences in sentences represented as syntactic trees, and in regard to structural pattern recognition, they were also used, combined with SVM classifiers, to classify fingerprints (Yao et al., 2003). For Web searching, GNN were applied (Scarselli et al., 2005) to implement an adaptive ranking, used for learning the importance of Web pages by examples. Finally, considering image analysis, RNNs were exploited for the classification of company logos (Diligenti et al., 2001), for the definition of a similarity measure useful for browsing image databases (de Mauro et al., 2003), and for the localization and detection of the region of interest in colored images. Moreover, a combination of RNNs for cyclic graphs and RNNs for DAGsLE was exploited (Bianchini et al., 2003a, 2003b) to locate faces, while an extension of the same model was proposed (Bianchini et al., 2004b, 2005a, 2005b) to detect general objects. In this chapter, the recursive neural network model is presented, paying attention to its evolution and, therefore, to its present capacity for processing general graphs. Moreover, the backpropagation through structure algorithm is briefly sketched, to allow the reader to grasp how the learning takes place.
RECURSIVE NEURAL NETWORKS
9
The computational capabilities of the recursive model are also assessed, to establish what kinds of tasks RNNs are able to face and how and when they are prone to failure. In Section III, the graph-based representation of images is described, starting from the segmentation process, and defining several different types of structures that can appropriately collect the perceptual/topological information extracted from images. Finally, in Section IV, the capacity of RNNs to process images is definitely established, showing some interesting results on object detection problems.
II. R ECURSIVE N EURAL N ETWORKS RNNs were conceived to process structured information coded as a graph. The term “recursive” reflects the fact that a local computation is recursively applied to each node in the input graph to yield a result that depends on the whole input data structure. In the proposed framework this local computation is performed by a neural network, and this choice allows us to extend the supervised learning paradigm of neural networks to structured domains. Backpropagation through structure (BPTS) (Sperduti and Starita, 1997) is a straightforward extension of the original backpropagation and backpropagation through time algorithms used to train feedforward and recurrent neural networks, respectively. Depending on the characteristics of the input graphs, different models of RNNs can be defined. The original RNN model was proposed to process DPAGs (Sperduti and Starita, 1997; Frasconi et al., 1998). Extensions of this model can process more general classes of graphs, like DAGs-LE or general cyclic graphs. In the following sections we will introduce the different architectures of RNNs and the related learning algorithm based on BPTS. Finally, some considerations on the approximation/classification capabilities of the recursive model are briefly sketched. A. Graphs A graph can encode structured data by representing elements or parts of the information as nodes and the relationships among them as arcs. A directed unlabeled graph is defined by the pair GU = (V , E), where V is the finite set of nodes and E ⊆ V × V represents the set of arcs. An arc from node u to node v is a directed link represented by the ordered pair (u, v) ∈ E, u, v ∈ V . In the following we will consider only directed graphs. An undirected graph can be conveniently represented as a directed graph by substituting each undirected edge with a pair of directed arcs: an edge between nodes u and v will correspond to the two directed arcs (u, v) and (v, u).
10
BIANCHINI
et al.
The pair (V , E) defines the structure of the graph by specifying the topology of the connections among the nodes. Anyway, when representing structured data, each node can be characterized by a set of values assigned to a predefined group of attributes. For example, if a node represents a region in an image, features describing the perceptual and geometric properties can be stored in the node to characterize the region. Thus, the data representation can be enriched by attaching a label to each node in the graph. In general a different set of attributes can be attached to each node, but, in the following, we will assume that the labels are chosen from a unique label space L (for instance, we can consider labels represented as vectors of rationals, i.e., L = Qm , or vectors of reals, i.e., L = Rm ). Thus, we define a directed labeled graph as a triple GL = (V , E, L), where V and E are the set of nodes and arcs, respectively, and L : V → L is a node labeling function that defines the label L(v) ∈ L for each node v in the graph. Finally, also the semantics of the arcs can be enriched by associating a label to each arc (u, v) in the graph. A graph with labeled edges can also encode attributes related to the relationships between pairs of nodes. For example, the arc between two nodes representing two regions in an image can encode the “adjacency” relationship. A label attached to the arc can specify a set of features that describes the mutual position of the two regions. We will assume that the labels for arcs belong to a given edge label space Le . A directed graph with labeled edges is defined by a quadruple GLE = (V , E, L, E ), where the edge labeling function E : E → Le attaches a label E ((u, v)) ∈ Le to the arc (u, v) ∈ E. Notice that, in general, the two arcs (u, v) and (v, u) can have different labels. The topology of the graph can be characterized by the following properties. Given any node v ∈ V , pa[v] = {w ∈ V | (w, v) ∈ E} is the set of the parents of v, while ch[v] = {w ∈ V | (v, w) ∈ E} represents the set of its children. The outdegree of v, od[v] = |ch[v]|, is the cardinality of ch[v], and o = maxv od[v] is the maximum outdegree in the graph, while the indegree of v is the cardinality of pa[v] (|pa[v]|). Nodes having no parents (i.e., |pa[v]| = 0) are called sources, whereas nodes having no children (i.e., |ch[v]| = 0) are referred to as leaves. We denote the class of graphs with maximum indegree i and maximum outdegree o as #(i,o) . Moreover, we denote the class of graphs with bounded indegree and outdegree (but unspecified) as #. Given a labeled graph GL , the structure obtained by ignoring the node and/or edge labels will be referred to as the skeleton of GL , denoted as skel(GL ). Finally, the class of all data structures defined over the domain of the labeling function L and skeleton in #(i,o) will be denoted as (i,o) L# and will be referred to as a structured space. A path from node u to node v in a graph G is a sequence of nodes (w1 , w2 , . . . , wp ) such that w1 = u, wp = v, and the arcs (wi , wi+1 ) ∈ E,
RECURSIVE NEURAL NETWORKS
11
i = 1, . . . , p − 1. If there is at least one path such that w1 = wp , the graph is cyclic. As we will discuss in the following sections, this property is crucial for defining the recursive schema used to process the graph. In fact, if the graph is acyclic we can define a partial ordering on the set of nodes V , such that u ≺ v if u is connected to v by a direct path. The set of the descendants of a node u, desc(u) = {v ∈ V | u ≺ v}, contains all the nodes that precede v in the partial ordering. We will focus on the class of DAGs since it allows a simple scheme for RNNs. In particular we will consider models designed to process the following subclasses of DAGs: 1. Directed Positional Acyclic Graphs (DPAGs). It is a subclass of DAGs for which an injective function ov : ch[v] → {1, . . . , o} assigns a position ov (c) to each child c of a node v. Therefore, a DPAG is represented by the tuple (V , E, L, O), where O = {o1 , . . . , o|V | } is the set of functions defining the position of the children for each node. Since the range for each function ov (c) is {1, . . . , o}, if for a node v |ch[v]| < o holds, there will be some empty positions that will be considered as null pointers (NIL). Thus, using a more intuitive view, in a DPAG the children of each node v can be organized in a fixed size vector ch[v] = [ch1 [v], . . . , cho [v]], where chk [v] ∈ V ∪ {NIL}. We denote with PTREEs the subset of DPAGs that contains graphs that are trees, i.e., such that each node in the structure has just one parent. 2. Directed Acyclic Graphs with Labeled Edges (DAGs-LE). DAGs-LE represent the subclass of DAGs for which an edge labeling function E is defined. In this case it is not necessary to define an ordering among the children of a given node. Finally, we denote with TREEs-LE the subset of DAGs-LE that contains graphs that are trees. When the result of the recursive processing is a single value for the whole graph, as it is for graph classification and regression tasks, the DAG G is required to possess a supersource, that is a node s ∈ V such that any other node in G can be reached by a directed path starting from s. Note that if a DAG does not have a supersource, it is still possible to define a convention for adding an extra node s with a minimal number of outgoing arcs, such that s is a supersource for the expanded DAG (Sperduti and Starita, 1997). B. Processing DAGs with Recursive Neural Networks We consider a processing scheme based on a set of state variables Xv that are defined for each node v. Each variable Xv is supposed to encode the information relevant for the overall computation, related to node v and all its descendants. The range for state variables is the state space X, whose
12
BIANCHINI
et al.
choice depends on the particular model we exploit. In the following, we will consider Xv ∈ Rn since neural networks can naturally compute functions on real vectors. The proposed state-based processing is closely related to the computation carried out by recurrent neural networks while analyzing a time series. The internal state of the recurrent network, which acts like an adaptive dynamic system, encodes the past history of inputs and collects all the information needed to define the future evolution of the computation. In the recursive model, the state can be computed locally at each node depending on the states of its children1 and on the input information available at the current node (the node label). This schema is analogous to the processing of recurrent neural networks where the new state of the network at time t is computed from the state at time t − 1 and the current input. Thus, this framework requires the definition of a state transition function f that is used to compute Xv given the states of the set of the children of node v, the labels eventually attached to the arcs connecting v to each child, and the label Uv stored in the node. In general, the function f can be implemented by a neural network that depends on a set of trainable parameters θf . Apart from the constraints on the number and type of inputs and outputs of the neural network, there are no other assumptions on its architecture (type of neurons, number of layers, etc.). More precisely, the state Xv is computed by the transition function Xv = f (Xch[v] , L(v,ch[v]) , Uv , θf ),
(1)
where Xch[v] = {Xchi [v] | i = 1, . . . , o(v)} is the set of the states of the children of node v and L(v,ch[v]) = {L(v,chi [v]) | i = 1, . . . , o(v)} is the set of edge labels attached to the arcs connecting v to its children. Given an input DAG G, the transition function of Eq. (1) is applied recursively to the nodes following their inverse topological order. Thus, first the state of the leaves is computed and then the computation is propagated to the upper levels of the graph until the source nodes are reached (or the supersource if there is only one source node). The use of the inverse topological order to process the nodes in a DAG guarantees that when the state of node v is computed using Eq. (1), the states of its children have already been calculated. To apply this computational scheme, the requirement of the graph to be acyclic is crucial. If a cycle was present in the graph, the state of a node belonging to the cycle would recursively depend on itself. In fact, by the recursive application of Eq. (1) the computation flows backward through the paths defined in the graph and the state of a given node v depends on all its descendants desc(v). The presence of cycles in the graph would make the 1 A child of a given node is a direct descendant in the partial ordering defined by the arcs.
RECURSIVE NEURAL NETWORKS
13
state computation undefined, unless a different scheme is used. A possible approach to extend the recursive neural network computation to cyclic graphs will be presented in a following section. The output of the state propagation is a graph having the same skeleton of the input graph G. In fact, the states Xv can be considered as new labels attached to the nodes of G. On the other hand, the computation can be viewed as a “copy” of the transition function f in each node v. Thus, moving from a local to a global view, the state computation can be seen as the application of a function that is obtained by combining different instances of the transition function following the topology of the input graph. This view yields the encoding network that is obtained by unfolding the transition function on the input graph. By applying the same transition function to different input graphs, we obtain different encoding networks featuring the same building block but assembled with a different structure. When processing a graph G having a supersource s, the state Xs can effectively be considered as the encoding of the whole graph. Figure 1 depicts how the encoding network is obtained by the unfolding on the input graph of the recursive network that implements the transition function f . The function f is replicated for each node in the graph and the network inputs are properly connected following the topology of the arcs in the graph. As can be observed in Figure 1, the
F IGURE 1. The encoding and the output networks associated with a graph. The recursive network is unfolded through the structure of the graph.
14
BIANCHINI
et al.
encoding network resulting from the unfolding is essentially a multilayered feedforward network, whose blocks share the same weights θf . Finally, an output network g can be defined to map the states to the actual output of the computation: Yv = g(Xv , θg ),
where θg is a set of trainable parameters. Yv belongs to an output space and in the following we will consider Yv ∈ Rr . The function g can be computed for each node in the input graph G, thus yielding an output graph with the same skeleton of G and the nodes labeled with the values Yv . In this case the RNN realizes a transduction from a graph G to a graph G′ , such that skel(G) = skel(G′ ). Otherwise, the output can be computed only for the supersource of the input graph, realizing a function ψ from the space of DAGs to Rr defined as ψ(G) = g(Xs , θg ). This second approach can be used in classification and regression tasks. Figure 1 shows this latter case, where the output network is applied only at the supersource. As shown in Figure 1 the pair of functions f and g defines the RNN. In particular the recursive connections on the function f define the dependencies among the variables in the connected nodes. In fact, the recursive connections define the topology of the encoding network establishing the modality of combination of the states of the children of each node. The parameters θf and θg are the trainable connection weights of the network, being θf and θg independent of node v.2 The parametric representations of f and g can be implemented by a variety of neural network models. 1. Processing DPAGs When considering DPAGs the set of children of a given node is ordered and it can be conveniently represented using a vector. The position of each child can be significant in the computation of the output, and two graphs differing just for the order of the children of a given node v may yield a different output. Thus, the transition function f of Eq. (1) must be properly redefined to take into account the position of each child. Basically, the position can be considered as a label attached to each arc connecting the node to a child, but it is simpler to code the position by organizing the children in a vector as shown in Section II.A. Thus, in the case of DPAGs the transition network is
with
Xv = f (Xch[v] , Uv , θf ), Xch[v] = [X′ch1 [v] , . . . , X′cho [v] ]′ ,
2 In this case, we say that the RNN is stationary.
(2)
o = max od[v] , v∈V
15
RECURSIVE NEURAL NETWORKS
and Xchi [v] equal to the frontier state X0 , if node v lacks its ith child. For example, when processing a leaf node, the state depends only on its label Uv and on the frontier state X0 , since Xleaf = f (X0 , . . . , X0 , Uv , θf ). If the function f is implemented by a two-layer perceptron, with sigmoidal activation functions in the hidden units and linear activation functions in the output units, the state is calculated according to o Xv = V · σ Ak · Xchk [v] + B · Uv + C + D, (3) k=1
where σ is a vectorial sigmoid function and θf collects the pointer matrices Ak ∈ Rq,n , k = 1, . . . , o, B ∈ Rq,m , C ∈ Rq , D ∈ Rn , and V ∈ Rn,q . Here, m is the dimension of the label space, n the dimension of the state space, and q represents the number of hidden neurons. As can be observed from Eq. (3), the dependency of the propagation on the position of each child is obtained by using a different pointer matrix Ak for each position k. This solution can show some limitations when the maximum outdegree in a graph is large but most of the nodes have a smaller number of children. In fact, even if the number of parameters grows linearly with the maximum graph outdegree, it can become quite large and, more importantly, for most of the nodes many pointer matrices are just used to propagate the NIL pointer value X0 , thus carrying very little information. In this case, two different solutions can be pursued. If possible, the arcs in the input graph can be conveniently pruned to reduce the maximum graph outdegree. For example, when extracting the representation of images based on the region adjacency graph, the arcs corresponding to adjacent regions sharing a border having a length under a predefined threshold could be pruned. Anyway, this approach results in the loss of part of the original information and thus is not always feasible. The second solution is to use a nonstationary transition network, in which different sets of parameters are used depending on the node outdegree. By using this approach, we avoid introducing noisy information due to the padding of empty positions with the frontier state, otherwise needed for nodes having lower outdegrees. A similar equation holds for the output function g: Yv = W · σ (E · Xv + F) + G, ′
′
′
where θg collects E ∈ Rq ,n , F ∈ Rq , G ∈ Rr , W ∈ Rr,q . A simple neural network architecture implementing the transition function of Eq. (3) is shown in Figure 2. The recursive connections link the output of the state neurons to the network inputs, corresponding to each position in the child vector. This notation specifies how the network is assembled in the encoding network. Apart from the recursive connections, the transition function is implemented by a classical feedforward network with a layer of
16
BIANCHINI
et al.
F IGURE 2. Transition function realized with a multilayer perceptron. This network can process graphs with a maximum outdegree o = 2.
hidden units. The network has n · o + m inputs corresponding to the o states of the children (n components each) and to the node label (m components). Example 1. Referring to Figure 1, where o = 2, and supposing that f and g are implemented with a three-layer perceptron, the state at each node and the output at the supersource are computed as Xd = Vσ (A1 + A2 )X0 + BUd + C + D, Xc = Vσ (A1 + A2 )X0 + BUc + C + D, Xb = Vσ (A1 Xc + A2 Xd + BUb + C) + D,
Xa = Vσ (A1 Xb + A2 Xd + BUa + C) + D,
Ya = Wσ (E1 Xa + F) + G.
Note that the states are computed starting from the leaf nodes d and c up to the root node a. Remark 1. In the case of sequences (Figure 3), each node represents a time step t and the arcs represent the relationship “follows.” Using this representation for time series, recursive networks reduce to recurrent networks. In fact, the state updating described in Eq. (3) becomes Xt = V · σ (AXt−1 + BUt + C) + D. Matrix A weighs the recurrent connections, while matrix B weighs the external inputs. When the output is computed at the supersource, an RNN implements a function from the set of DPAGs to Rr , h : DPAGs → Rr , where h(G) = Ys .
RECURSIVE NEURAL NETWORKS
F IGURE 3.
17
A temporal sequence. The arcs code the “follows” relationship.
Formally, h = g ◦ f˜, where f˜, recursively defined as X0 if G is empty, ˜ f (G) = f (f˜(G1 ), . . . , f˜(Go ), Uv ) otherwise,
denotes the process that takes a graph and returns the state at the supersource, f˜(G) = Xs . In fact, the function f˜ depends both on the topology and on the labels of the DPAG. Following Eq. (2), the state transition function f , computed by the RNN, depends on the order of the children of each node, since the state of each child occupies a particular position in the list of the arguments of f . To overcome such a limitation, in Bianchini et al. (2001b), a weight-sharing approach was described, able to relax the order constraint and to devise a neural network architecture suited for DAGs with a bounded outdegree. In fact, the weight-sharing technique used in this approach cannot be applied to DAGs with a large outdegree o, due to the factorial growth in the network parameters with respect to o. Even if the maximum outdegree can be bounded, for instance by pruning those connections that are heuristically classified as less informative, nevertheless some important information may be discarded in this preprocessing phase. 2. Processing DAGs-LE
In many applications, like image processing, the assumptions required by the DPAG-based model defined in the previous section introduce unnecessary constraints. First, in many cases the definition of a position for each child of a node is arbitrary. Second, as also noted previously, the need to bound the maximum outdegree in the graph can cause the loss of important information. For example, when considering the image representation based on the region adjacency graph, the order of the adjacent regions may be significant, but assigning a specific position to them by choosing a starting direction may be arbitrary. Moreover, to bound the maximum graph outdegree some of the arcs must be pruned and, anyway, for each node v for which |ch[v]| < o the last positions in the child vector have to be arbitrarily padded with the frontier state X0 . In fact, the need to consider exactly o children is a limitation of the model and not a feature of the problem. When considering DAGs-LE such limitations can be effectively removed (Gori et al., 2003). In fact, the edge label can encode the relevant features of
18
BIANCHINI
et al.
the relationship represented by the arcs and the constraints on the number of children can be avoided. For DAGs-LE, we can define a transition function f that does not have a predefined number of arguments and that does not depend on their order. The different contribution of each child depends on the label attached to the corresponding arc. At each node v, the total contribution X(ch[v], L(v,ch[v]) ) ∈ Rp of the children is computed as X(ch[v], L(v,ch[v]) ) =
|ch[v]| i=1
φ(Xchi [v] , L(v,chi [v]) , θφ ),
(4)
where L(v,chi [v]) ∈ Rk is the label attached to the arc (v, chi [v]) and the edgeweighting function φ : R(n+k) → Rp is a nonlinear function parameterized by θφ . Then, the state at node v is computed combining X(ch[v], L(v,ch[v]) ) and the node label Uv by a parametric function f˜, as Xv = f (Xch[v] , L(v,ch[v]) , Uv , θf ) = f˜ X(ch[v], L(v,ch[v]) ), Uv , θf˜ . (5)
With this approach, the transition function f can be applied to nodes with any number of children, and it is also independent of the order of the children. The parametric functions φ, f˜, and g, involved in the recursive network, can be implemented by feedforward neural networks. For example, φ can be computed by a two-layer perceptron with linear outputs as φ(Xchi [v] , L(v,chi [v]) , θφ ) = Vσ (AXchi [v] + BL(v,chi [v]) + C) + D,
where θφ collects A ∈ Rq,n , B ∈ Rq,k , C ∈ Rq , D ∈ Rp , and V ∈ Rp,q , with q the number of hidden neurons. On the other hand, the function φ can also be realized by an ad hoc model. In the following, we will consider the solution originally proposed in Gori et al. (2003), where φ is realized as k |ch[v]| (j ) Hj L(v,chi [v]) Xchi [v] , X(ch[v], L(v,ch[v]) ) = (6) i=1
j =1
with H ∈ Rp,n,k the edge-weight matrix. In particular, Hj ∈ Rp,n is the j th (j ) layer of matrix H and L(v,chi [v]) is the j th component of the edge label. In the following, the RNN that computes the state transition function defined in Eq. (5) will be referred as RNN-LE (recursive neural network for DAGs-LE). C. Backpropagation Through Structure A learning task for RNNs requires specification of the network architecture, a learning environment Le that contains the data used for learning, and a cost
RECURSIVE NEURAL NETWORKS
19
function to measure the error produced by the current network with respect to the target behavior. Once the learning dataset is chosen, the cost function depends only on the free parameters of the RNN, that is, the vectors θf and θg . By collecting all the network parameters in a unique vector, that is, θ = [θf′ θg′ ]′ , we can write the cost function used in the learning process as E = E(θ ). Hence, the learning task is reformulated as the optimization of a multivariate function. For a supervised learning task, the learning environment contains a set of graphs for which a supervisor provided target values for the network outputs at given nodes. More precisely, for each graph Gp , p = 1, . . . , P , in the learning set the supervisor provides a set of nodes SGp ⊆ VGp together with a target output for the network at each node in SGp , that is, the supervision is a set of pairs (v, Ytv ), v ∈ SGp , and Ytv ∈ Rr . For graph classification or regression tasks, the supervision is provided only at the graph supersource, thus only one target is specified for each graph in the training set. In this latter case, the examples can be specified in a more compact form as pairs (Gp , Yt (p)). Using a quadratic cost function, the error function on the learning set can be defined as P P
1 1
Yv (Gp , θ ) − Yt 2 , E(θ ) = EGp (θ ) = v 2 P 2P p=1
(7)
p=1 v∈SGp
where Yv (Gp , θ ) is the output produced by the RNN at node v, while processing the graph Gp , using the values θ for the neural network weights. If the functions f and g that define the RNN are differentiable with respect to the parameters θ, the cost function E(θ ) is a continuous differentiable function and can be optimized by using a gradient descent technique. In particular, the simplest approach is to update the weights at each iteration k as θk = θk−1 − ηk ∇θ E(θ )|θ=θk−1 ,
(8)
where the gradient of E(θ ) is computed for θ = θk−1 and ηk is the learning rate. The weight vector is usually initialized at step k = 0 with random small values. The weight update equation (8) is iteratively applied until a stopping criterion is met. Usually the learning algorithm is stopped when the cost function assumes a value below a predefined threshold, when a maximum number of iterations (epochs) is reached, or when the gradient norm is smaller than a given value. Unfortunately, the training algorithm is not guaranteed to halt yielding a global optimum. For example, the learning procedure can be trapped in a local minimum of the function E(θ ), which yields a suboptimal solution to the problem. In some cases suboptimal solutions can be acceptable, whereas in others a new learning procedure should be run starting from a
20
BIANCHINI
et al.
different initial set of weights θ0 . However, more sophisticated gradient-based optimization techniques can be applied to increase the speed of convergence. Thus, at each iteration, the learning algorithm requires two different steps: the computation of the gradient and the update of the RNN weights. In the scheme proposed in Eq. (8) the weight update is performed in batch mode, that is, the weight update is performed using the gradient computed on all the graphs in the learning set. This approach yields the correct optimization of the cost function of Eq. (7). Approximate versions can be defined by updating the weights after the presentation of each graph (pattern mode) or of a set of graphs (block mode). Anyway, the most computationally intensive part is the computation of the gradient of the cost function. Since the total cost E(θ ) is obtained by summing the contributions of the errors on each graph Gp , the total gradient can be obtained by accumulating the gradients computed for each example Gp in the training set. The gradient computation can be efficiently carried out by using the BPTS algorithm, which is derived by extending the original backpropagation algorithm for feedforward networks. The intuition is that the unfolding of the RNN on a given input graph Gp yields an encoding network that is substantially a multilayered feedforward network whose layers share the same set of weights. Thus the gradient computation can be performed in two steps as in the backpropagation algorithm: in the forward pass the recursive network outputs are computed and stored, building the encoding network; in the backward pass the errors are computed at the nodes where the targets are provided and they are backpropagated in the network to compute the gradient components. Let us consider a generic weight ϑ ∈ θ of the RNN. While processing the input graph Gp , the network is replicated at each node v ∈ VGp yielding the encoding network. Thus, we can consider each replica of the recursive neural network as an independent instance having a different set of weights θ (v). Under this assumption we can compute the derivative of the partial cost function EGp (θ ) with respect to the unfolded parameters θ (v). More precisely, we consider the cost function as EGp (θ (v1 ), . . . , θ (vNp )), being that VGp = {v1 , . . . , vNp }. Since the replicas of the RNN share the same set of weights, that is, θ (v1 ) = · · · = θ (vNp ) = θ, by applying the rule for the derivative of a composition of functions, we obtain the following rule for each weight ϑ: ∂EGp ∂ϑ
=
∂EGp
v∈VGp
∂ϑ(v)
.
(9)
From a practical point of view, Eq. (9) states that the complete derivative with respect to a given weight can be obtained by accumulating the contributions of the derivatives with respect to the instance of the weight at each node. The
21
RECURSIVE NEURAL NETWORKS
local derivatives ∂EGp /∂ϑ(v) can be computed using the same approach as in the original backpropagation algorithm. Basically, the derivative with respect to each local weight ϑ(v) is expanded by considering the local variable that is directly affected by a change in ϑ(v). In particular, backpropagation considers the neural unit that uses the weight ϑ(v) and rewrites the derivative as ∂EGp ∂ϑ(v)
=
∂EGp ∂zk (v) , ∂zk (v) ∂ϑ(v)
where zk (v) is the output of the affected neural unit. The term δkz (v) = ∂EGp /∂zk (v) is the generalized error that is computed recursively by the backpropagation algorithm, whereas ∂zk (v)/∂ϑ(v) is a factor that depends on the model of the considered unit. For example, for a linear unit, this last derivative equals the value of the input to the connection corresponding to the weight ϑ(v). Since the functions f and g can be implemented by very diverse architectures, it is difficult at this level of abstraction to detail the backpropagation algorithm. Anyway, assuming that the classical backpropagation procedure is implemented both for the transition network and the output network, it suffices to propagate the generalized errors available for the outputs of these networks to their inputs. The generalized errors for the outputs of each replica of the transition and output function in the unfolding graph can be propagated from the sources to the leaf nodes. In particular, let us consider the replica of the transition network corresponding to node v. Given any state variable xj (v) available at the output of this function, it affects the values of the inputs of the replicas of the transition function at the parents of v. The actual input affected by xj (v) for each node u ∈ pa[v] depends of the architecture of the transition network and on the role of v among the children of u (e.g., its position for DPAGs). Thus, we will indicate the affected input as xj(u,v) (u) [i.e., the variable depends on node u and is identified by the arc (u, v)]. The backpropagation procedure applied to the replica of the transition (u,v) function in u yields a generalized error δjx (u). Since a variation in xj (v) affects all the corresponding inputs of the parent nodes, we derive δj (v) =
∂EGp ∂xj (v)
=
∂EGp
(u,v) (u) u∈pa[v] ∂xj
(u,v)
∂xj
(u)
∂xj (v)
=
δjx
(u,v)
(u). (10)
u∈pa[v]
If we consider the output function g, the generalized errors for its output variables yj (v) at node v can be computed directly from the cost function. y y In fact, if there is no supervision at node v, δj (v) = 0; otherwise δj (v) = [yj (v) − yjt (v)]. The generalized errors can be backpropagated through the g output network g to yield the generalized errors for the inputs δj (v) =
22
BIANCHINI
et al.
g
∂EGp /∂xj (v) corresponding to the state variables at node v. This generalized error is an additional contribution to the generalized error δj (v) and, thus, Eq. (10) can be completed as (u,v) g δj (v) = δjx (u) + δj (v). (11) u∈pa[v]
Once the generalized errors δj (v) are computed, the backpropagation procedure for the replica of the transition function at node v is executed yielding both the partial gradients ∂EGp /∂ϑ(v) and the generalized errors at the network inputs. Hence, the BPTS algorithm can be implemented with a very modular structure, by realizing the backpropagation function for the transition and output networks and by combining the contributions at each node with a simple backpropagation of the generalized errors along the graph arcs following Eq. (11). This computation can be carried out by processing the nodes from the sources to the leaves. D. Processing Cyclic Graphs The model presented in the previous sections cannot be directly applied to the processing of cyclic graphs, because the unfolding of the recursive network would yield an infinite encoding network. In fact, the computation in the presence of cycles could be extended by iterating the state transition function starting from given initial values of the states, until the state values converge to a fixed point. To overcome the problems of cyclic structure processing, some techniques have been proposed based on the idea of collapsing each cycle into a unique unit that encodes the information corresponding to the nodes belonging to the cycle (Bianucci et al., 2001). Unfortunately, the strategy for collapsing the cycles cannot be carried out automatically, and it is intrinsically heuristic. Therefore, the effect on the resulting structures and the possible loss of information are almost unpredictable. In Bianchini et al. (2002, 2003c) a different approach is proposed for the case of graphs having a supersource and whose nodes store distinct labels (Directed Graphs with Unique labels, DUGs). These requirements are needed to assess the computational capabilities of this class of RNNs, but they do not pose actual limitations in real applications. For example, in image processing tasks, each node may represent a region and the label is a real valued vector. In this case it is quite unlikely that two nodes share exactly the same label. According to this framework, the encoding network has the same topology as the graph: if the graph is cyclic, the encoding network is also cyclic (see
RECURSIVE NEURAL NETWORKS
F IGURE 4.
23
The encoding and the output networks for a cyclic graph.
Figure 4). In fact, a replica of the transition network “replaces” each node of the graph and the connections between the transition networks are defined by the topology of the arcs in the graph. The computation is carried out by setting all the initial states Xv to X0 . Then, the copies of the transition network are repeatedly activated to update the states. According to Eq. (1), the transition network attached to node v produces the new state Xv . After a given number of updates, the computation can be stopped. The value of the output function is the result of the whole recursive processing. This general procedure is formalized in the following algorithm. Algorithm 1. CyclicRecursive(G) begin for each v ∈ V do Xv = X0 ; repeat < Select v ∈ V >; Xv = f (Xch[v] , Uv , θf ); until stop(); return g(Xs , θg ); end More precisely, Algorithm 1 is a generic framework that describes a class of procedures. To implement a particular procedure, we should decide the strategy adopted to select the nodes and the criterion used to halt the iterations. In fact, the theoretical results show that no particular ordering must be imposed on the sequence of node activations provided that each arc in the graph is considered at least once during the processing (Bianchini et al. 2003c, 2006). The nodes can be activated following any ordering, and also random sequences are admitted. Algorithm 1 can also be extended by adopting a synchronous activation strategy. The stopping criterion implemented in the function “stop()” should guarantee that Algorithm 1 halts after a “sufficient”
24
BIANCHINI
(a) F IGURE 5.
et al.
(b)
(c)
An artificial image (a), its RAG (b), and the corresponding directed graph (c).
number of iterations. A more precise definition of the stopping criterion will be given in the following, clarifying when the performed iterations are sufficient. In the following, Algorithm 1 is further discussed using an example. Example 2. An image can be represented by its region adjacency graph (RAG, see Section III.C), which is extracted from the image by associating a node to each homogeneous region and linking the nodes corresponding to adjacent regions (see Figure 5). Each node is labeled by a real vector that represents perceptual and geometric features of the region (perimeter length, area, average color, texture, etc.). Since RNNs can process only directed graphs, the undirected edges of RAGs must be transformed into a pair of directed arcs (see Figure 5c). When Algorithm 1 is applied to a RAG, the computation appears to follow an intuitive scheme. At each time step, a region is selected. Then, the state of the corresponding node is computed, based on the states of the adjacent nodes using the function f (see Figure 6). According to the recursive paradigm, the state of a node is an internal representation of the object denoted by that node. Thus, at each step, the algorithm adjusts the representation of a region using the representations of the adjacent regions. After some steps, the computation is stopped. Then, the output function g is applied to the
(a) Paying attention (b) Paying attention (c) Paying attention (d) Computing the outat the door: X4 = at the roof: X8 = at the sky: X1 = put: Y = g(X5 , . . .) f (X2 , X5 , . . .) f (X3 , X5 , . . .) f (X2 , . . .) F IGURE 6.
An application of Algorithm 1 to a RAG.
RECURSIVE NEURAL NETWORKS
25
state at the supersource3 to produce the output of the recursive network (see Figure 6d). Remark 2. Algorithm 1 is an extension of the original recursive processing for acyclic structures. In fact, the execution of Algorithm 1 on DAGs produces the same state at each node as the classical approach, provided that these states were updated a sufficient number of times and the order of activation is such that no node remains nonactivated for an infinite number of steps. Obviously, for acyclic graphs, whereas Algorithm 1 describes a generic way to activate each node, the ad hoc processing model constitutes the most efficient strategy. 1. Recursive-Equivalent Transforms The intuitive idea that supports the application of Algorithm 1 to DUGs is that in this case the presence of unique labels allows us to encode the presence of cycles with a finite unfolding of the graph. In fact, the presence of a cycle will be evidenced by the presence of nodes with the same label in the unfolding. Hence, we can introduce the concept of recursive equivalence between DUGs and trees stating that the computation of a recursive function on a given DUG yields the same result on a recursive-equivalent tree. From a theoretical point of view, this observation allows us to assess the computational capabilities of RNNs when processing DUGs (Bianchini et al., 2003c, 2006). In particular, it can be proved that an appropriate RNN can approximate in probability, up to any degree of precision, any real valued measurable function on the space of DUGs given any recursive equivalent transform. The processing carried out by Algorithm 1 represents a recursive equivalent transform if appropriate node selection and halting criteria are chosen. Thus, the theoretical results support the fact that any cyclic graph G can be transformed into a tree T such that Algorithm 1 applied on G produces the same output as when the recursive network is fed with T . This possibility also provides a mechanism for training RNNs on cyclic graphs. In fact, a learning set A containing cyclic graphs can be transformed into a set of recursiveequivalent trees B. Then, the recursive network can be trained using BPTS (Küchler and Goller, 1996) on the learning set B. After training, the RNN can be applied to unseen cyclic graphs by executing Algorithm 1 using the same criteria that were applied to obtain the trees in B. The concept of recursive equivalence of two graphs can be defined formally. 3 In this case, the supersource can be any node of the graph, since in RAGs there is a path between any pair of nodes.
26
BIANCHINI
et al.
Definition 1. Two arcs a = (v1 , w1 ), b = (v2 , w2 ) are said to be recursive equivalent, a ≈r b, if the labels of v1 and v2 and those of w1 and w2 are equal. Moreover, two graphs G1 , G2 are recursive equivalent, G1 ≈r G2 , if, for each arc a in G1 , there exists an arc b in G2 such that a ≈r b and, vice versa, for each arc a in G2 , there exists an arc b in G1 such that a ≈r b. Definition 2. A function F from directed graphs to trees is said to be a recursive-equivalent transform, if F (G) ≈r G, for each G. Intuitively, two graphs are recursive equivalent if they have the same arcs, where we assume that arcs can be distinguished only by looking at the labels of their delimiting nodes. Figure 7 shows a cyclic graph and two recursiveequivalent trees. It is easy to show that if G is a DUG and F is a recursiveequivalent transform, then G can be uniquely reconstructed from F (G). In fact, the nodes of a DUG are identified by their labels and thus its nodes can be obtained by merging together all the nodes in F (G) having the same label. Therefore, let us suppose that a given procedure implements a recursiveequivalent transform F . Then, any cyclic DUG can be processed by an RNN after a preprocessing phase carried out using such a procedure, since F is injective. Now let us discuss the computational properties of Algorithm 1. Notice that the state Xv (t) computed by Algorithm 1 at a given time step t depends on the states of the children of v, computed at the previous steps. Thus, due to the
F IGURE 7. (a) A cyclic graph. (b and c) Two trees that are recursive equivalent to the graph. A covering tree of the graph is represented by the continuous arcs.
RECURSIVE NEURAL NETWORKS
27
recursive processing, the dependence of Xv (t) on the previously calculated states can be represented by a computation tree Tv (t), which collects the nodes visited by the algorithm to compute Xv (t). In fact, computation trees are unfoldings of the input graphs. For example, Figure 7b and c shows two computation trees for the node v1 . The following definition formalizes the above concept. Definition 3. The computation tree Tv (t) for node v at time t is defined by ⎧ if t = 0, ⎨({v}, ∅) Tv (t) = Tree(v, Tch1 [v] (t − 1), . . . , Tcho [v] (t − 1)) if v is active at t > 0, ⎩T (t − 1) otherwise, v
where Tree( ) is a function that builds a tree from its root and subtrees.
In Algorithm 1 an important role is played by the state Xs (tH ), where tH is the time when the algorithm halts and s is the supersource of the input graph. In fact, the output of the algorithm is g(Xs (tH )). Definition 4. Xs (tH ) is the principal state of Algorithm 1. The principal unfolding function is the map H that takes a graph and returns the computation tree for the supersource s at time tH , that is, H (G) = Ts (tH ). Basically there is no difference between computing the output of the recursive network on the computation tree Ts (tH ) or on the original input graph G. In particular, if Algorithm 1 is such that for any G and any arc a in G there exists an arc b in H (G) such that a ≈r b (H (G) contains a copy of a), then H is a recursive-equivalent transform. In practice, Algorithm 1 merges the construction of a recursive-equivalent representation of the input graph G and the computation of the RNN in a unique procedure. Thus, by defining an appropriate policy for visiting the graph nodes and by halting the visit after “enough” steps, to obtain an unfolding that is recursive equivalent to the input DUG, we can choose an RNN that is able to approximate any real valued measurable function on DUGs, up to a given degree of precision. Remark 3. In Hammer (1999) it is also proved that it suffices to choose n = 2 (i.e., each state attached to a node is a vector with two entries) to reach any degree of approximation, whereas some hints are also given on how to choose the architecture of the transition network (number of layers and neurons per layer). Those results on the structural parameters of the architecture also apply to this case. Nevertheless, in practice, the selection of the optimal values for n and for the number of hidden units, both in the transition and in the output networks, is commonly a trial-and-error procedure.
28
BIANCHINI
(a) F IGURE 8.
(b)
et al.
(c)
Some graphs and their unfoldings. Gray nodes represent supersources.
Remark 4. The theoretical results on the approximation capabilities of RNNs provide a hint about the design of the halt function “stop()” and on the method for the selection of the active node in Algorithm 1. In fact, to guarantee the universal approximation property, the main loop of Algorithm 1 should be repeated until H (G) becomes recursive equivalent to G.4 Thus, for example, a solution can be to activate the nodes randomly, halting the algorithm only after many iterations, when G ≈r H (G) holds with high probability. Another solution consists of activating all the nodes at the same time and stopping the algorithm after |V | steps. In general, also admitting that some nodes share the same labels, an RNN is able to distinguish between two different graphs G1 and G2 provided it is possible to define an unfolding H such that H (G1 ) = H (G2 ). Depending on the topology of the graph and the labeling of nodes, this requirement might not be satisfied. In Figure 8, the two graphs (a) and (b) yield the same unfolding tree for any H and thus any RNN is not able to distinguish between them. Hence, Algorithm 1 can approximate, in probability, any function on general (possibly cyclic) graphs, provided that the stop function and the selection method are designed so that H produces a different unfolding for each graph of the input domain. The set of trees is a straightforward example where, even if the graphs may have shared labels, this hypothesis is satisfied. In fact, the principal unfolding of a tree is the tree itself, and H is injective on this domain, provided that all the nodes are activated a sufficient number of times. 2. From Cyclic Graphs to Recursive Equivalent Trees To train the recursive neural network to be used in Algorithm 1, the graphs in the learning set S must be preprocessed by a recursive-equivalent transform F , that is, the actual learning set is T S = {F (G) | G ∈ S}. More 4 Notice that the loop is not necessarily halted as soon as H (G) ≈ G becomes true. In fact, the r algorithm can continue for any number of steps after the condition is satisfied.
RECURSIVE NEURAL NETWORKS
29
precisely, the RNN should be trained using a set of examples that is a significant sample of the trees yielded by the principal unfolding H , that is, it is required that F ≈ H . More generally, if Algorithm 1 is nondeterministic, then F (G) ⊇ H (G) must hold and the probability distribution of the trees in F (G) should approximate the distribution of the trees in H (G). Remark 5. Note that H (G) may contain a number of recursive-equivalent trees having different depths, for example, the preprocessing unfolds G a random number of times. In this case, the network output g(Xs (t)) should reach the target value and remain stable after that, that is, the output at the supersource is the same for all the different unfoldings of G. To achieve this stable behavior, the training set must contain several unfoldings of the same graphs. In the following, an example of a procedure that implements a recursiveequivalent transform is shown. Algorithm 2. CyclicGraphToTree(G) MarkedArcs = Q; For each v set GetOriginalCopyOf (v) = v; Repeat Select an arc (v, w) in G and a node v2 in T s.t. – v = GetOriginalCopyOf (v2 ) – (v2 , w2 ) ∈ / Q being w = GetOriginalCopyOf (w2 ); Set w2 = NewCopy(w) and GetOriginalCopyOf (w2 ) = w; Extend T with node w2 and arc (v2 , w2 ); MarkedArcs = MarkedArcs ∪ {(v, w)}; until (MarkedArcs = E AND stop(T , G, . . .)); return T ; Algorithm 2 builds a covering tree T of graph G6 and then iteratively extends T with other copies of nodes in G. The extension is carried out by the main loop, where GetOriginalCopyOf is a function that keeps the relationship between the nodes in G and the corresponding copies in T , and NewCopy(v) produces a new node having the same label as v. At each step, the procedure looks for a node v2 in T that lacks children and produces a copy w2 of one of its children. The copy is then inserted into T . 5 A tree is a covering for G if it contains the same set of nodes V and its arcs Q are such that Q ⊆ E.
Covering trees can be easily computed by visiting the graph (Aho et al., 1983). 6 The construction of the covering tree is not fundamental to the definition of a recursive-equivalent transform. However, this initialization is useful to reduce the number of steps required to reach the halt condition on MarkedArcs.
30
BIANCHINI
et al.
The loop halts when all the arcs of G were visited and the function stop returns true. For our purpose, which consists of implementing a recursiveequivalent transform, we can use any function stop that depends on some or all of the program variables. In the simplest case, stop = true when all the arcs have been visited once. When stop is a more complex condition, Algorithm 2 can visit each arc and each node many times: it continues to add nodes to T until stop becomes true. Figure 7b and c shows examples of trees that can be constructed by a stop function that always returns true and by a more complex one, respectively. In fact, there is experimental evidence, at least for particular applications (Bianchini et al., 2003a, 2004b, 2005b), that repeatedly visiting some nodes in the graph makes it possible to better catch the information encoded in the cycles and to improve the network performance. E. Limitations of the Recursive Neural Network Model The approximation capability of recursive models is highly related to their ability to store a significant representation of the input structure into the internal state at the supersource. Let G1 and G2 be two distinct graphs. If the RNN fed with G1 and G2 reaches the same state at the supersource, it will produce the same output response: in the following, such a phenomenon, which sometimes cannot be avoided, will be referred to as a collision. Definition 5. Given a recursive network, we say that a collision occurs for graphs G1 and G2 if the state corresponding to the supersource of G1 and G2 is the same, that is X s1 = X s2 . Remark 6. In some applications, the presence of collisions may be undesirable, since it limits the approximation capabilities of an RNN. On the other hand, in pattern recognition, collisions may be useful when G1 and G2 are different representations of the same object7 or the representation of similar objects in the application domain. In those cases, the collision is desirable since the network yields the same result for G1 and G2 . As a matter of fact, the presence of collisions indicates a sort of robustness with respect to the noise and, moreover, collisions can be exploited to capture similarities between different objects. In this section, to guarantee collision avoidance, we investigate some conditions that are of central interest to deepen what the recursive model 7 For instance, because of the noise, the same object can be represented by different graphs.
RECURSIVE NEURAL NETWORKS
F IGURE 9.
31
A graph and a tree that are output equivalent.
is expected to realize from an approximation point of view. For the sake of simplicity, the results on the computational power of the recursive model are obtained for linear architectures. Nevertheless, they can be simply extended to the general nonlinear case. 1. Theoretical Conditions for Collision Avoidance In the simplest case, collisions happen because the symbolic representations that define the output responses are the same for different graphs. Example 3. Referring to Figure 9, where A1 , A2 are the network weights and a, b, c, d are the labels, the state responses at nodes 4, 8, and 9 obviously coincide. Therefore, if the signal is propagated bottom-up (from the leaves to the supersource), the states at the supersources (nodes 1 and 5) also coincide, producing the same output response for the two graphs: Ys = C(A1 A2 X0 + A2 A1 X0 + Ba + A1 Bb + A2 Bc + A1 A2 Bd + A2 A1 Bd).
Graphs that have the same output expression always cause the network to produce a collision at the supersource, despite the values assigned to the labels and to the frontier state. Those graphs are completely indistinguishable and are “equivalent” for the purpose of linear recursive processing. The definition of the symbolic output-equivalence ≈o freezes such idea. Definition 6. For all G1 , G2 , we say that G1 ≈o G2 , provided that ∀Ak , k = 1, . . . o, B (parameters), ∀Uv (labels), and ∀X0 (frontier state), a collision occurs for graphs G1 and G2 . Since it is always possible to transform a DPAG into an output-equivalent tree (nodes with many parents must be replicated to produce an instance for each parent), from now on, theoretical conditions for collision avoidance will
32
BIANCHINI
et al.
be stated on trees, as more general structures, like graphs, may always be reduced to output-equivalent trees. Therefore, let us address the problem of distinguishing trees with nonnull real labels. It can be proved that in this case, collisions can always be avoided, provided that a linear recursive network with a sufficiently large internal state is used. To this purpose, an enumeration of all the paths that can appear in a tree is considered. In fact, such an enumeration makes it possible to represent trees uniquely, since each tree is exactly defined by the list of all paths it contains. Example 4. A possible way to enumerate paths is based on ordering the nodes of the complete tree of height p. In fact, each node in the complete tree identifies one of the paths, so that a one-to-one relationship exists between the numbering of the paths and the numbering of the nodes. Figure 10 shows a complete ternary tree (a) where the nodes have been ordered by a breadth– first visit. The tree in Figure 10b can then be represented by the set of nodes {1, 2, 3, 4, 5, 6, 7, 9}. An alternative representation uses a binary vector [1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0], with the ith element equal to 1 when (b) contains the ith path of (a) and 0, otherwise. As suggested by Example 4, the dimension of the state representation grows exponentially with the height of the trees, limiting the practical use of recursive architectures for structures of large dimension. On the other hand (see Bianchini et al., 2001a), avoiding collisions on general trees means that the weights of the network must be coded with a number of bits that grows at least exponentially in the tree height. Therefore, the general problem of recognizing different trees becomes intractable when the height of the trees
F IGURE 10. An example of a complete ternary tree along with the enumeration of the nodes that induces an enumeration on the paths.
RECURSIVE NEURAL NETWORKS
33
increases because the number of trees grows so rapidly that an exponential number of bits is needed to distinguish all trees at the root. Nevertheless, even if an exponentially large state is required to avoid collisions, important classes of trees can be distinguished with a reasonable amount of resources. In fact, in most of the problems, the objective does not consist of recognizing every tree, but instead of distinguishing some classes. Thus, the state at the root must be able to store only a coding of the classes, not of the whole input tree. In Bianchini et al. (2001a), linear RNNs with a small number of parameters are proved to be able to recognize interesting properties of trees: for example, the number of nodes in each level, the number of leaves, and the number of left and right branches in the paths of binary trees.
III. G RAPH -BASED R EPRESENTATION OF I MAGES A. Introduction The neural network models presented in Section II are assumed to process structured data. To exploit such models to perform any tasks on images (classification, localization, or detection of objects, etc.) a preprocessing phase that allows each image to be represented by a graph is needed. In the last few years, graph-based representations of images have received growing attention, since, as a matter of fact, they allow both symbolic and structural information to be collected in a unique “pattern.” To obtain a graphic representation, a preprocessing phase, which allows a set of homogeneous regions from the image to be extracted, has to be performed, and a set of attributes that describes each region must be chosen. In the following, we will describe several segmentation methods that are used to extract the set of homogeneous regions, and some graphic structures, particularly suited to represent images. B. Segmentation of Images The preliminary phase for obtaining a graph-based representation of images consists in their segmentation. The name “segmentation” can be referred both to the process of extracting a set of regions with homogeneous characteristics and to the process of determining the boundary of the objects depicted in an image. Clearly, the second meaning is related to a very complex task, and probably a correct description of this process should be “object segmentation.” In this section, the meaning of the word “segmentation” is instead referred to the first interpretation proposed.
34
BIANCHINI
et al.
The segmentation phase is crucial for image analysis and pattern recognition systems, and very often it determines the quality of the final results. Segmenting an image means dividing it into different regions such that each region is homogeneous with regard to some relevant characteristics, while the union of any pair of adjacent regions is not. A theoretical definition of segmentation (Pat, 1993) is as follows: If P () is a homogeneity predicate defined on groups of connected pixels, then a segmentation is a partition of the whole set of the pixels F into connected subsets or regions (S1 , S2 , . . . , Sn ) such that n
i=1
Si = F,
Si ∩ Sj = ∅ (i = j ).
The predicate P (Si ), that measures the homogeneity of the set Si is true for each region, and P (Si ∪ Sj ) is false if Si and Sj are adjacent. Unfortunately, according to Fu and Mui (1981), “the image segmentation problem is basically one of psychophysical perception, and therefore not susceptible to a purely analytical solution.” Thus, there is no universal theory on image segmentation yet. All of the existing methods are, by nature, ad hoc, and they are strongly application dependent, that is, there are no general algorithms that can be considered effective for all images. However, from the late 1970s, a wide variety of methods were proposed; here we report a classification presented in a recent survey (Cheng et al., 2001). Segmentation algorithms can be classified into two main categories, according to the method used to represent the images: monochrome and color segmentation. Color segmentation attracted more attention in the past few years, because, as a matter of fact, color images provide more information with regard to gray level images; however, color segmentation is a time-expensive process, even if the rapid increase of the computational capabilities of computers allows such a limitation to be overcome. The main color image segmentation methods can be classified as follows: • Histogram thresholding: This technique is widely used for gray level images, but can be directly extended to the more general case of color images. The color space is divided with regard to each color component, then a threshold is considered for each component. However, since the color information is represented by tristimulus R, G, and B, or by their linear/nonlinear transformations, representing the histogram of a color image and selecting effective thresholds are very challenging tasks (Haralick and Shapiro, 1985). • Color space clustering: The methods belonging to this class generally exploit one or more features to determine separate clusters in the considered color space. “Clustering of characteristic features applied to image segmentation is the multidimensional extension of the concept of thresholding”
RECURSIVE NEURAL NETWORKS
35
(Fu and Mui, 1981). Applying the clustering approach to color images is a straightforward idea, because the colors tend to form clusters in the color space. The main problem of these methods is how to determine the number of clusters in an unsupervised scheme. • Region-based approaches: Region-based approaches, including region growing, region splitting, region merging, and their combination, attempt to group pixels into homogeneous regions. In the region-growing approach, a seed region is first selected, then expanded to include all homogeneous neighbors. One problem with region growing is its dependence on the choice of the seed region and the order in which pixels are examined. However, in the region-splitting approach, the initial seed region is the whole image. If the seed region is not homogeneous, it is divided, generally, into four squared subregions, which become the new seed regions. The main disadvantage of this approach is that the regions obtained are too squared. The region-merging approach is often combined with region growing and splitting with the aim of obtaining homogeneous regions as large as possible. • Edge detection: In monochrome image segmentation, an edge is defined as a discontinuity in the gray level, and can be detected only when there is a sharp difference in the brightness between two regions. However, in color images, the information about edges is much richer than that in the monochrome case. For example, an edge between two objects with the same brightness but different hue can simply be detected (Macaire et al., 1996). According to monochrome image segmentation, edge detection in color images can be performed defining a discontinuity in a three-dimensional color space. The main disadvantage of edge detection techniques is that the result of the segmentation process can be particularly affected by noise. • Other techniques: Many other segmentation methods were proposed in the past, based on fuzzy techniques, physics approaches, and neural networks. Fuzzy techniques exploit the fuzzy logic to model the uncertainty. For instance, if the fuzzy theory is used combined with a clustering method, each pixel has an assigned score for each candidate region, which represents the “degree of membership.” Physics approaches aim at solving the segmentation problem by employing physical models to locate the objects’ boundaries, while eliminating the spurious edges due to shadows or highlights. Among the physics models, the dichromatic reflection model (Shafer, 1985) and the approximate color-reflectance model (Healey and Binford, 1989) are the most commonly used. Finally, neural network approaches exploit a wide variety of network architectures (Hopfield neural network, self-organizing maps, feedforward neural networks). Generally, unsupervised approaches are preferable, since providing the target class for each pixel that belongs to an image is very difficult.
36
BIANCHINI
et al.
In the following, we describe the segmentation method we used to represent images. The proposed segmentation algorithm is independent of the considered color space. The color space can be chosen considering the particular application that should be solved. The segmentation algorithm can be sketched as follows: • A K-means clustering (Duda and Hart, 1973) of the pixels belonging to each image is performed; the clustering algorithm minimizes the Euclidean distance (defined in the chosen color space) of each pixel from its centroid. • At the end of the K-means, a region growing procedure is carried out to reduce the number of regions. In practice, the number of initial clusters k is chosen to be approximately equal to the number of regions in which the image should be correctly divided. Nevertheless, such an initial choice is not so crucial with regard to the whole process of segmentation, due to the successive region growing phase, during which the number of regions with homogeneous features decreases. In fact, the number of regions computed via the K-means algorithm is greater than the number of clusters, since each cluster is divided into a certain number of connected components (regions). After the segmentation process, a structure that represents the arrangement of the regions obtained can be extracted. Such a structure normally also collects information associated with each node, which describes the geometric and visual properties of the associated region. Instead, the edges that link the nodes of the structure are exploited to describe the topological arrangement of the extracted regions. The graph obtained can be directed or undirected; moreover, the presence of an edge can represent adjacency or some hierarchical relationship. In the following, two kinds of structures, particularly suited to represent images, will be described: RAGs and multiresolution trees. C. Region Adjacency Graphs The segmentation method yields a set of regions, each region being described by a vector of real valued features. Moreover, the structural information related to the spatial relationships between pairs of regions can be coded by an undirected graph. Two connected regions R1 , R2 are adjacent if, for each pixel a ∈ R1 and b ∈ R2 , there exists a path connecting a and b, entirely lying into R1 ∪ R2 . The RAG is extracted from the segmented image by (see Figure 11) 1. Associating a node with each region; the real vector of features represents the node label. 2. Linking the nodes associated with adjacent regions with undirected edges.
RECURSIVE NEURAL NETWORKS
F IGURE 11.
37
The original image, the segmented image, and the extracted RAG.
A RAG takes into account both the topological arrangement of the regions and the symbolic visual information. Moreover, the RAG connectivity is invariant under translations and rotations (while labels are not), which is a useful property for a high-level representation of images. The information collected in each RAG can be enriched further by associating with each undirected edge a real vector of features (an edge label), which describes the mutual position of the regions associated with the linked nodes. This kind of structure is defined as a region adjacency graph with labeled edges (RAG-LE). For example, given a pair of adjacent regions i and j , the label of the edge (i, j ) can be defined as the vector [D, A, B, C] (see Figure 12), where • D represents the distance between the two barycenters. • A measures the angle between the two principal inertial axes. • B is the angle between the intersection of the principal inertial axis of i and the line connecting the barycenters.
F IGURE 12. Features stored into the label of each edge. The features describe the relative position of the two regions.
38
BIANCHINI
et al.
• C is the angle between the intersection of the principal inertial axis of j and the line connecting the barycenters. D. Multiresolution Trees Multiresolution trees (MRTs) are hierarchical data structures that are generated during the segmentation process, as, for instance, quad-trees (Hunter and Steiglitz, 1979). While quad-trees can be used to represent a region splitting process, MRTs are used to describe the region growing phase of the segmentation algorithm described in the previous section. Some different hierarchical structures, like monotonic trees (Song and Zhang, 2002) or contour trees (Morse, 1969; Roubal and Peucker, 1985; van Kreveld et al., 1997), can be exploited to describe the set of regions obtained at the end of the segmentation process, representing the inclusion relationships established among the region boundaries. However, MRTs represent both the final result of the segmentation and the sequence of steps that produces the final set of regions. An MRT is built performing the following steps (see Figure 13): • Each region obtained at the end of the clustering phase is associated with a leaf of the tree. • During the region growing phase, any time two regions are merged together, a new node is added to the tree as the father of the nodes corresponding to the merged regions. • At the end of the region growing step, a virtual node is added as the root of the tree. Nodes corresponding to the set of regions obtained at the end of the segmentation process become the children of the root node. Each node of the MRT, except the root, is labeled by a real vector that describes the geometric and visual properties of the associated region. Moreover, each edge can be labeled by a vector that collects information regarding the merging process. Considering a pair of nodes joined by an edge, the region associated with the child node is completely contained in the region associated with the father, and it is useful to associate some features with the edge to describe how the child contributes to the creation of the father. For instance, some fruitful features can be the color distance between the regions associated with the father and the child, the distance between their barycenters, and the ratio obtained dividing the area of the region that corresponds to the child by the area of the region associated with the father. Note that MRTs do not directly describe the topological arrangement of the regions, which, however, can be inferred considering both the geometric features associated with each node (for instance, the coordinates
RECURSIVE NEURAL NETWORKS
39
F IGURE 13. Multiresolution tree generation: red nodes represent vertices added to the structure when a pair of similar regions is merged together.
of the bounding box of each region can be stored in the node label) and the MRT structure. In the following section, both RAGs and MRTs are examined as possible graph representations of images when trying to solve an object detection problem using RNNs.
IV. O BJECT D ETECTION IN I MAGES A. Object Detection Methods The ever increasing performances of image acquisition techniques imply that computer vision systems can be deployed in desktop and embedded systems (Pentland, 2000). On the other hand, the useful exploitation of computer vision software requires understanding the content of the images that must be processed. Thus, the researchers’ efforts have recently been focused on understanding the image content, with the aim of creating several
40
BIANCHINI
et al.
software tools to autoannotate images stored in a database or to help robots to understand the environment around them. A preliminary step in any image understanding system is locating significant objects. However, object detection is a challenging task because of the variability in scale, location, orientation, and pose of the instances of the object in which we are interested. Moreover, occlusions and light conditions also change the overall appearance of objects in images. A definition of the object detection problem, which represents an extension of the definition of face detection reported in Yang et al. (2002), is: “Given an arbitrary image and assuming to be interested in locating a particular object, the goal of object detection is to determine whether or not there is any object of interest and, if present, return the image location and extent of each instance of the object.” The challenges associated with the object detection problem can be attributed to the following factors: • Pose. The images of an object can vary because of the relative camera– object position, and some object features may become partially or wholly occluded. • Object deformation. Nonsolid objects can appear deformed due to some forces applied to them. • Occlusions. Objects may be partially occluded by other objects. • Image orientation. The images of object vary for different rotations and translations with regard to the camera axis. • Imaging conditions. When the image is acquired, factors such as lighting and camera characteristics affect the appearance of the objects. There are many related problems derived from object detection. Object localization aims at determining the position of a single object in an image (Moghaddam and Pentland, 1997); this is a simplified detection problem with the assumption that an input image contains only one object. In object recognition or object identification, an input image is compared to a database and matches, if any, are reported. Finally, object tracking methods continuously estimate the location and, possibly, the orientation of an object in an image sequence, in real time. Consequently, object detection is the preliminary step in any automated system that solves the above problems, and it can be seen as a two-class recognition problem in which each region of an image is classified as an object or part of it, or as an uninteresting region. Object detection methods can be classified in four main categories (Yang et al., 2002): • Knowledge-based; • Feature invariant;
RECURSIVE NEURAL NETWORKS
41
• Template matching; • Appearance-based. Knowledge-based methods exploit the human knowledge on the searched objects and use some rules to describe the object models. Those rules are then used to detect and localize objects that match the predefined models. A possible drawback of these approaches is the difficulty in translating human knowledge into well-defined rules. If the rules are detailed (i.e., strict), they may fail to detect objects that do not match all the rules. If the rules are too general, they may yield many false positives. Moreover, it is difficult to extend this approach to detect objects in different poses due to an inability to enumerate all possible cases. Instead, the aim of feature invariant approaches (McKenna et al., 1998; Leung et al., 1998) is to define a set of features that is invariant with regard to object orientation, light conditions, dimension, etc. The underlying assumption is based on the observation that humans can effortlessly detect objects in different poses and light conditions and so there must exist properties or features that are invariant over these variabilities. Template matching methods store several patterns of objects and describe each pattern by visual and geometric features. The correlation between an input image and the stored patterns is computed for detecting objects (Shina, 1995). However, this class of techniques has proved to be often inadequate for object detection in images since it cannot effectively deal with variations in scale and pose. Finally, in contrast to template matching methods, appearance-based methods (Moghaddam and Pentland, 1997; Schneiderman and Kanade, 2000) learn the templates from examples. In general, appearance-based methods rely on techniques from statistical analysis and machine learning to find the relevant characteristics of images that either contain or do not contain a certain object. Many appearance-based methods can be understood in a probabilistic framework. An image, or a representation of it, is viewed as a random variable x, which is characterized by the class-conditional density functions p(x|object) and p(x|nonobject). Bayesian or maximum likelihood classifiers can be used to decide if a candidate image location represents an object. Unfortunately, a straightforward implementation of Bayesian classification is not possible because of the high dimensionality of x. Generally, image patterns are projected to a lower dimensional space and then a discriminant function is used for classification, or a nonlinear decision surface can be exploited using multilayer neural networks (Carleson et al., 1999). Recently, methods based on SVMs were also proposed (Papageorgiou et al., 1998). Those models project the patterns to a higher dimensional space and then form a decision surface between the projected object and nonobject patterns, under the assumption that the determination of the decision surface is easier in
42
BIANCHINI
et al.
higher dimensional space with regard to the original pattern space. Among the object detection methods, those based on learning algorithms have recently attracted much attention and have demonstrated excellent results (Yang et al., 2002). In the following, we present a machine learning technique, based on RNNs, that allows us to detect objects in images. B. Recursive Neural Networks for Detecting Objects in Images Recently, RNNs have been proposed as a tool for object detection in images. These models allow us to exploit a structured representation of images in a paradigm based on learning from examples. 1. Learning Environment Setup The proposed object detection method assumes a graph-based representation of images that can be obtained performing a segmentation of the images, as described in Section III. Both the training of RNNs and the subsequent exploitation of trained networks to detect objects depend on the kind of graphic structure used to represent images. Thus, in the following, we describe how these tasks are performed when images are represented by RAGs or MRTs. a. Region Adjacency Graphs. If a RAG (or a RAG-LE) is extracted to represent an image, a target equal to 1 is attached to each node of the RAG that corresponds to a part of the object in which we are interested, whereas a target equal to 0 is attached otherwise. In Figure 14, the target association is sketched. In this example, we want to localize the “toy car.” The black nodes correspond to parts of the car and have target 1, while white nodes correspond to parts of other objects and have target 0. The target association is a crucial step since, during this phase, we provide the RNN with the information that defines the model of the object. During
F IGURE 14.
The extracted RAG and the associated targets.
43
RECURSIVE NEURAL NETWORKS
the segmentation, some spurious regions can be associated with an area of the image that corresponds only partially to the object of interest. If the target association is performed manually, the supervisor, which prepares the training set, chooses from each segmented image the set of regions that belong to the object. Otherwise, if an automatic association is performed, ground-truth information is exploited to associate the targets. In the last case, the ratio between the area of the spurious region that intersects the bounding box of the object and the whole area of the spurious region can be calculated to decide if the region belongs to the object. After the target association, since all the RNN models described in Section II can process only directed graphs, each RAG must be transformed into one or more directed graphs. The performed transformation depends on the computation scheme realized by the RNN model. In fact, if the selected RNN realizes a transduction from a graph G to a graph G′ (see Section II.B), the RAG is transformed into a unique DAG, while if it realizes a supersource transduction, the RAG is converted into a forest of trees, exploiting the recursive-equivalent transform described in Section II.D.1. If we consider a transduction from a graph G to a graph G′ , the RAG (RAGLE) is transformed into a DPAG (DAG-LE) by applying the following steps (see Figure 15): 1. A starting region (root node) is chosen. 2. An ordering is imposed among the adjacent regions; for example, adjacent regions can be ordered by scanning the region boundary clockwise, starting from the vertical axis. 3. The graph is constructed recursively using a breadth-first visit of the nodes starting from the root node; when a new node a is visited, the edges from a to the nodes bk that have not already been visited are considered; the direction of the edges is chosen to be from a to bk ; moreover, the order of these arcs is defined using the ordering established by the previous rule;
(a)
(b)
(c)
F IGURE 15. (a) The original image. (b) The segmented image and the corresponding RAG. (c) The DPAG obtained by the RAG using the top-left region as the starting node.
44
BIANCHINI
et al.
finally, the target of node a is associated with the correspondent node in the DPAG (DAG-LE). Even if this transformation can be performed very efficiently, it presents some limitations. First, the arbitrary choice of the starting region affects the DPAG generation. Moreover, since the RNN computation proceeds from the frontier of the DPAG to the root node, the network predictions associated with the leaves are performed considering only the labels (i.e., only visual and geometric properties of the regions they represent), since the leaves have no descendants and the topological arrangement of the corresponding regions is unknown. However, this limitation could be partially overcome, transforming each RAG into a set of DPAGs, which can be obtained considering a random set of nodes belonging to the original RAG as the root node. When considering an RNN model that performs supersource transductions, the transformation procedure takes an RAG R, along with a selected node n, as input, and produces a tree T having n as its root. The method must be repeated for each node of the RAG, or, more practically, for a random set of nodes. It can be proved that the forest of trees built from R is recursive equivalent to R, that is the RNN behavior is the same whether the network processes R or if it processes the forest of trees (Bianchini et al., 2002, 2006). The first step of the procedure is a preprocessing phase that transforms R into a directed RAG G by assuming that a pair of directed edges replaces each undirected one. If the original undirected graph is an RAG-LE, each edge in the pair is assigned the same label as the original undirected edge. G is unfolded into T by the following algorithm: 1. Insert a copy of n in T . 2. Visit G, starting from n, using a breadth-first strategy; for each visited node v, insert a copy of v into T , link v to its parent node preserving the information attached to each edge, if it exists. 3. Repeat step 2 until a predefined stop criterion is satisfied, and, however, until all edges have been visited at least once. 4. Attach the target associated to n to the root node of T . The above procedure represents a possible implementation of Algorithm 2, which is presented in Section II.D.1 as a general framework for processing cyclic graphs. Note that the preprocessing step that transforms an RAG into a directed structure generates a directed cyclic structure, which cannot be directly processed by an RNN. The above unfolding strategy produces a recursive-equivalent tree that holds the same information contained in R. With respect to the chosen stop criterion, if the breadth-first visit is halted when all the arcs have been visited once, the minimal recursive-equivalent tree is obtained (minimal unfolding—see Figure 16a). However, other stop criteria are acceptable. For example, each edge can be visited once, then the visit
RECURSIVE NEURAL NETWORKS
45
F IGURE 16. The transformation from an RAG-LE to a recursive-equivalent tree. The dimension of the recursive-equivalent tree depends on the stop criterion chosen during the unfolding of the directed RAG.
proceeds starting from each leaf node v ∈ T , if a stochastic variable xv is true, then all the children of v are added to T (probabilistic unfolding—see Figure 16b). Otherwise, we can replace the breadth-first visit with a random visit of the graph (random unfolding—see Figure 16c). In this case, starting from the current node v, the visit can proceed or not depending on a set of stochastic variables xv1 , . . . , xvo , one for each arc outcoming from v. The probability of visiting a given arc is uniform over the whole graphic structure. Anyway, each edge must be visited at least once to guarantee the recursive equivalence between R and T . From a cognitive point of view, the unfoldings performed during the transformation from an RAG to a forest of trees seems to reproduce the behavior of a human observer who pays attention to the parts that constitute
46
BIANCHINI
et al.
the image to detect the eventual presence of an object. However, humans usually do not need to analyze the whole image to detect a particular object. This assertion suggests relaxing the recursive-equivalence constraint and performing the unfoldings stopping the breadth-first visit before each edge has been visited once. For instance, for each node t belonging to a directed RAG, we can start the unfolding from t halting the visit when all the nodes at a certain distance from t have been reached, so generating a forest of trees, which are not recursive equivalent to the original structure, and, at the same time, simulating the behavior of a human who pays attention to an object and to a certain area around the object. Independently from the chosen unfolding strategy, we can consider a set of images as a training set, transform each image into an RAG (or an RAG-LE), and then associate the correct target to each node of the undirected graph, finally extracting the corresponding forest of trees for each RAG. Every RNN model presented in Section II is able to process a tree, and so we can train an RNN to predict if the root node of each tree is a part of the object in which we are interested. b. Multiresolution Trees. If images are represented by MRTs no preliminary transformations are needed on the structures, which are directed and can constitute an input for RNNs. As described previously in this section, when RNNs deal with transformed RAGs, they predict if each region is a part of the object in which we are interested. However, if MRTs are exploited, each node has an associate target that states if the node represents a part of the object, if it corresponds to a region that contains the object or part of it, and, finally, if it corresponds to a region that does not contain the object (see Figure 17). Thus, RNNs that process MRTs solve a multiclass classification problem. Moreover, since each node has an associated target, the RNN computes a transduction from a graph to another graph. The targets can also be associated with a subset of the nodes. In particular, since RNNs process the leaves only on the base of the visual and geometric features stored in the node labels, targets could not be associated until a certain distance from the frontier is reached. Using this method, RNNs are guaranteed to perform a prediction that depends both on the symbolic information and on the topological arrangement of the nodes. RNNs process MRTs straightforwardly and the predictions of a trained network locate the subtrees that contain the object. The targets can be associated exploiting the same methods described for RAGs. The main limitation regarding MRTs is the height of the tree. In fact, the RNN training can suffer from the so-called “long-term dependencies,” which were investigated originally with regard to recurrent neural networks (Bengio et al., 1994). The node states computed by an RNN are marginally affected by the information
RECURSIVE NEURAL NETWORKS
47
F IGURE 17. Targets associated with the nodes of an MRT. Gray nodes represent regions that contain the object or part of it, black nodes represent regions that correspond to part of the object, and white nodes represent regions that do not contain the object.
collected into far descendants, and, given a generic node, the contribution of the descendants to the state of the considered node decreases, while the distance from the node increases. Substantially, RNNs show problems in extracting properties related to long-term memory, if the processed structure is too deep. This limitation can be partially overcome cutting MRTs to a certain depth, which guarantees, however, that the information needed to detect the objects is maintained. In particular, nodes that correspond to part of the object must not be discarded. For example, if the object dimensions are known a priori (at least approximately), we can consider discarding nodes that are associated with regions much smaller than those representing the target object. 2. Detecting Objects After the set up of the learning environment, we need to select the RNN architecture. Unfortunately, no rules exist to guide this choice and a trialand-error procedure must be carried out to determine the best RNN. Then, the RNN can be trained using the BPTS algorithm (see Section II.C). Given a trained RNN and an input image, the detection procedure differs, based on whether an RAG-based or an MRT-based representation is used. If the image is represented by an RAG, then the detection of an object is obtained as follows: 1. The image is segmented and the corresponding RAG (RAG-LE) is built. 2. The RAG (RAG-LE) is unfolded, producing a forest of trees.
48
BIANCHINI
et al.
3. Each tree is processed by the trained RNN. The network predicts whether the root node of each tree is a part of the object or not. 4. Adjacent regions predicted as a part of the object are merged together to compute the minimum bounding boxes that contain the detected objects. However, if the image is represented by an MRT, the detection can be performed as described in the following: 1. The image is segmented and the corresponding MRT is built. 2. The MRT is processed by the trained RNN. The network predicts whether each node of the tree is a part of the object and whether it contains or does not contain the object. 3. Regions predicted as parts of the object are merged together to compute the minimum bounding boxes that contain the detected objects. Even if, considering an image, the related MRT usually contains more nodes than its RAG, the detection based on MRTs can be performed in a very efficient way. In fact, dealing with MRTs allows us to avoid the transformation from undirected to directed structures, which can be particularly time consuming. Independent of the kind of structures used to represent images, the detection technique described has several advantages. First, the use of structures allows us to obtain an invariant representation of the images and of the object with regard to rotations, translations, and scale.8 Moreover, the user can define the model of the object in which he or she is interested, collecting a representative set of images as a training set, and specifying which is the “concept” of the object that he or she wants to detect (by associating the target with each node of the graphs in the training set). In this way, the method described is completely object independent. In fact, the RNN, during the training, builds its model of the object, following the “concept” expressed by the supervisor. During the development of the object detection technique described above, several experiments were carried out to evaluate the effectiveness of the proposed approach. Most of the experiments have been focused on detecting faces in images, nevertheless, the proposed method does not exploit any a priori information about the particular object model and, therefore, is independent of the problem at hand. To perform any kind of experimentation, the choice of the dataset plays a crucial role. Even if several datasets were proposed in the literature for evaluating face and object detection methods, none of them fits our requirements. In fact, benchmark datasets usually contain gray level images, while our method works in the more general case of color images. Moreover, very often the benchmark datasets collect images that include only one face or one object, and so they are useful for evaluating face 8 This holds true if the change in the scale does not alter the information contained in the image.
RECURSIVE NEURAL NETWORKS
49
F IGURE 18. Examples of localized faces in images acquired by TV video sequences using the proposed method.
or object localization methods. The objects are often centered in the images and in a frontal pose, and these controlled acquisition conditions do not allow us to evaluate the robustness of the methods with regard to variations in scale, position, and pose. Finally, often no ground-truth information is available together with the benchmark datasets. Therefore, it is not always clear if an object is present or not, for instance, if it is partially visible. However, for a complete list of benchmark datasets, the reader can refer to Yang et al. (2002). We have performed our experimentation using three distinct datasets. Two datasets were chosen with the aim of using our detection method to locate faces: the first dataset contains images acquired by TV video sequences, while the second one includes images acquired by an indoor camera. The third dataset was created artificially using the objects of the COIL-100 dataset (Nene et al., 1996a). In the following, the three datasets, together with the main results obtained, are described. a. TV Video Sequences. The experimental dataset contains 201 images and 238 faces (each image contains at least one face) and was acquired by TV video sequences. The appearance of faces in images is unsettled with respect to the orientation, light conditions, dimension, etc. (see Figure 18). The images were divided into three sets: training, validation, and test sets. Both training and validation sets contain 50 images, whereas 101 images (118 faces) constitute the test set. This dataset was used to understand which kind of unfolding strategy is most promising to transform an RAG or RAG-LE into a forest of trees9 (see Section II.D.1), and to compare the performances 9 These results were already discussed in Bianchini et al. (2003a, 2003b).
50
BIANCHINI
et al.
of standard RNN for DPAGs, RNNs-LE, and feedforward neural networks.10 Each image was segmented with both the RGB and the HSV color spaces, obtaining RAGs with 90 nodes, on average. Each node of the extracted RAG has an associated label that collects some geometric (area, perimeter, barycenter coordinates, bounding box coordinates, and momentum) and color information. Moreover, for each RAG-LE, an edge label was added, whose elements are the distance between the barycenters of the regions and the angles formed by the intersection of their principal inertial axes. With respect to the transformation from undirected structures to forest of trees, the most promising unfolding strategy is the probabilistic unfolding, which allows us to reach a recall of 90% and a precision of 72%, on average. Considering the results exploited to determine the best model, RNNs-LE outperforms both the original RNNs defined for DPAGs and traditional feedforward neural networks. These results were evaluated considering only the accuracy of the RNNs, without performing the merge of adjacent regions predicted as parts of a face. RNNs-LE reached a global accuracy of 87% (94% on nonface regions and 81% on face regions), as opposed to an accuracy of 81% and 79% reached by RNNs for DPAGs and feedforward neural networks, respectively. The approach does not use a priori heuristics specific to the face detection problem. From this point of view, it is completely different from other solutions described in the literature. Moreover, no postprocessing on the detected bounding boxes was performed. False positives, for instance, often correspond to very small bounding boxes or, however, the ratio between the height and the width of the detected bounding boxes is very far from the usual ratio that can be obtained dividing the height of a face by its width. Thus, considering such a naive postprocessing procedure, which checks some geometric properties of the detected bounding boxes, allows us to significantly improve the performances. b. Images Acquired by an Indoor Camera. The experimental dataset contains 500 images and 348 faces (each image contains at most one face) and was acquired by an indoor camera, which was placed in front of a door. One person at a time went in through the door and walked until he or she was out of the camera eye. Each image corresponds to a frame of the acquired scene. We are interested in detecting only the face position, whereas no tracking of the faces was performed, and no information derived by the movement of the object was exploited. The faces appear in different orientations, dimensions, and positions (see Figure 19). Both the training and the cross-validation sets contain 100 images, whereas 300 images (199 faces) constitute the test set. This experimentation was performed to investigate how indoor light conditions affect the performance of our method. In fact, images acquired by 10 These results were presented in Bianchini et al. (2005b).
RECURSIVE NEURAL NETWORKS
51
F IGURE 19. Variability of face appearance in the indoor camera dataset. Faces vary with regard to dimension and pose and can be partially occluded. The images used to perform the experimentation were provided by ELSAG S.p.A.; all the images were used strictly for research purpose and are published under license of the reproduced persons.
TV video sequences usually have controlled light conditions, which allow a high-quality TV video to be obtained. In a TV studio, the lights are oriented in a way that minimizes shadows and, however, in our TV video sequence dataset the skin color of the depicted persons is close to pink (except for black persons). However, the acquisition condition used to collect our indoor dataset produces a skin color that ranges from green to gray, and several shadows are visible on faces and other objects. This particular skin color is due to the neon lighting, which causes a prevalence of green not only with regard to the skin color but in the whole image. This situation limits the relevance of the color as a discriminative feature. Therefore, a fundamental contribution can be provided by the exploitation of the information derived by the mutual position of the regions, and the proposed object detection model actually uses this information to improve its performances. Moreover, in these experiments only RNNs-LE are used since we are interested in determining how the implementation of the function φ (the function exploited to compute the average contribution of the children to the state of their parents—see Section II.B.2) affects the detection results. During the described experimentation, each image was represented in the RGB color space and segmented, producing an RAG-LE, with 100 nodes, on average. The geometric and visual features stored in the label associated with each node and the mutual spatial position represented by the label associated with each edge are exactly the same as described for the experimentation on the TV video sequence dataset. Each RAG-LE was subsequently unfolded using the probabilistic unfolding strategy, described in Section IV.B. To obtain balanced training and cross-validation sets, each RAG-LE, corresponding to a training or a validation image, was unfolded starting the breadth-first visit from all the nodes belonging to a part of a face and from
52
BIANCHINI
et al.
a randomly chosen set of nodes that do not belong to parts of faces. We assume that the number of nodes corresponding to parts of faces is smaller than the number of nodes of other kinds, and this assumption is always true considering our dataset. However, the test set is obtained performing the probabilistic unfolding for all the nodes belonging to each RAG-LE. In fact, to locate faces, the trained RNN-LE must be able to predict whether each node in the recursive-equivalent trees represents a part of a face. Several RNNs-LE were trained to determine how the implementation of the function φ affects the detection results. In fact, the function φ can be implemented using a feedforward neural network [neural φ—see Eq. (4)] or using an ad hoc model (linear φ). In the neural case, φ can be obtained considering a two-layer feedforward neural network with sigmoidal hidden units and linear output units, while in the linear case a three-dimensional weight matrix can be considered [see Eq. (6)]. As discussed in Bianchini et al. (2004b, 2005a), the choice of the neural φ allows us to obtain better performances. The accuracy rate obtained by RNNs with neural φ is 87%, on average, while the choice of linear φ allows us to reach an average accuracy of 82%. Moreover, all the tested RNN architectures succeeded in their learning task showing how the proposed model is able to generalize, even if the training is performed on perfectly balanced data sets. Since the focus of these experiments is mainly on the RNN-LE model, the postprocessing procedure was not carried out. However, the results achieved on the TV video sequence dataset show that the accuracy in the detection of the bounding boxes is generally greater than the accuracy reached by the RNN. Actually, the correct bounding boxes can be computed even if some regions belonging to faces are not correctly classified. c. Artificial Dataset Generated from COIL-100. The COIL-100 dataset contains 7200 color images of 100 objects (72 images per object, one at every 5 degrees of rotation—see Figures 20 and 21) and it has been used in the past to evaluate the performances of three-dimensional object recognition systems (Nene et al., 1996b; Pontil and Verri, 1998). The object appearance varies both in its geometric and reflectance characteristics. We created some artificial datasets pasting, for each generated image, three COIL objects, which were chosen randomly with regard to the depicted object and to its degree of rotation, on a dark canvas. The pasting position of the first object was chosen randomly, while the second and third objects were located checking that objects already present on the canvas were not completely occluded. The images generated have the same properties as the COIL collection, thus, objects can vary their appearance with regard
RECURSIVE NEURAL NETWORKS
53
F IGURE 20. Images of the 100 objects of the COIL database. (All trademarks remain the property of their respective owners. All trademarks and registered trademarks are used strictly for educational and scholarly purposes and without intent to infringe on the copyright owners.)
F IGURE 21.
Twenty-four of 72 images of a COIL object.
to orientation, light conditions, and scale; moreover some objects can be partially occluded (see Figure 22). The above generation technique allows us, at the same time, to create a set of images and their associated ground truth.
54
BIANCHINI
et al.
F IGURE 22. Examples of images generated pasting COIL objects on a black canvas. (All trademarks remain the property of their respective owners. All trademarks and registered trademarks are used strictly for research purposes and without intent to infringe on the copyright owners.)
To evaluate the ability of RNNs to compute transductions that take as input a graph and produce as output another graph, we generated a dataset whose images always contained a white piggybank. The dataset collects 250 images, and each image contains exactly one piggybank. The images were segmented, producing RAGs-LE with about 70 nodes. Subsequently, each RAG-LE was transformed into a DAG-LE, using the procedure described in Section IV.B.1, and a target was associated with each node, assessing if the node corresponds to a part of the piggybank. Several RNNs-LE were trained and evaluated, varying their architecture. The accuracy obtained, on average, was equal to 87%. Moreover, the greatest part of the misclassified regions belonged to the leaves of the DAGs-LE. This situation is probably due to the absence of topological information related to the regions associated with the leaves, and it shows how such types of information play a crucial role in our object detection approach.
R EFERENCES Aho, A., Hopcroft, J., Ullman, J. (1983). Data Structures and Algorithms. Addison-Wesley, Reading, MA. Bengio, Y., Frasconi, P., Simard, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 157–166. Bezdek, J. (1994). What is computational intelligence? In: Computational Intelligence: Imitating Life. IEEE Press, New York, pp. 1–12. Bianchini, M., Gori, M., Scarselli, F. (2001a). Theoretical properties of recursive networks with linear neurons. IEEE Trans. Neural Netw. 12 (5), 953–967.
RECURSIVE NEURAL NETWORKS
55
Bianchini, M., Gori, M., Scarselli, F. (2001b). Processing directed acyclic graphs with recursive neural networks. IEEE Trans. Neural Netw. 12 (6), 1464–1470. Bianchini, M., Gori, M., Scarselli, F. (2002). Recursive processing of cyclic graphs. In: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2002), pp. 154–159. Bianchini, M., Mazzoni, P., Sarti, L., Scarselli, F. (2003a). Face spotting in color images using recursive neural networks. In: Gori, M., Marinai, S. (Eds.), IAPR—TC3 International Workshop on Artificial Neural Networks in Pattern Recognition (Florence, Italy). Bianchini, M., Gori, M., Mazzoni, P., Sarti, L., Scarselli, F. (2003b). Face localization with recursive neural networks. In: Marinaro, M., Tagliaferri, R. (Eds.), Neural Nets—WIRN ’03, Vietri (Salerno, Italy). Springer, Berlin. Bianchini, M., Gori, M., Sarti, L., Scarselli, F. (2003c). Backpropagation through cyclic structures. In: Cappelli, A., Turini, F. (Eds.), LNAI — AI*IA 2003: Advances in Artificial Intelligence (Pisa, Italy), LNCS. Springer, Berlin, pp. 118–129. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F. (2004a). Recursive neural networks for processing graphs with labelled edges. In: Proceedings of ESANN 2004 (Bruges, Belgium), pp. 325–330. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F. (2004b). Recursive neural networks for object detection. In: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2004), pp. 1911–1915. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F. (2005a). Recursive neural networks for processing graphs with labelled edges: Theory and applications. Neural Netw. 18, 1040–1050. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F. (2005b). Recursive neural networks learn to localize faces. Pattern Recognit. Lett. 26, 1885–1895. Bianchini, M., Gori, M., Sarti, L., Scarselli, F. (2006). Recursive processing of cyclic graphs. IEEE Trans. Neural Netw. 17, 10–18. Bianucci, A., Micheli, A., Sperduti, A., Starita, A. (2001). Analysis of the internal representations developed by neural networks for structures applied to quantitative structure-activity relationship studies of benzodiazepines. Chem. Info. and Comp. Sci. 41 (1), 202–218. Boser, B., Guyon, I., Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In: Haussler, D. (Ed.), Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory. ACM Press, New York, pp. 144–152. Carleson, A., Cumby, C., Rosen, J., Roth, D. (1999). The SNoW learning architecture. Tech. Rep. UIUCDCS-R-99-2101, University of Illinois at Urbana–Campaign, Computer Science Department. Chappell, G., Taylor, J. (1993). The temporal Kohonen map. Neural Netw. 6, 441–445.
56
BIANCHINI
et al.
Cheng, H.D., Yang, X.H., Sun, Y., Wang, J.L. (2001). Color image segmentation: Advances and prospects. Pattern Recognit. 34, 2259–2281. Collins, M., Duffy, N. (2002). Convolution kernels for natural language. In: Dietterich, T., Becker, S., Ghahramani, Z. (Eds.), Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA. de Mauro, C., Diligenti, M., Gori, M., Maggini, M. (2003). Similarity learning for graph based image representation. Pattern Recognit. Lett. 24 (8), 1115– 1122. Diligenti, M., Gori, M., Maggini, M., Martinelli, E. (2001). Adaptive graphical pattern recognition for the classification of company logos. Pattern Recognit. 34, 2049–2061. Duda, R., Hart, P. (1973). Pattern Classification and Scene Analysis. Wiley, New York. Elman, J. (1990). Finding structure in time. Cog. Sci. 14, 179–211. Euliano, N., Principe, J. (1999). A spatiotemporal memory based on SOMs with activity diffusion. In: Oja, E., Kaski, S. (Eds.), Kohonen Maps. Elsevier, Amsterdam. Farkas, I., Mikkulainen, R. (1999). Modeling the self-organization of directional selectivity in the primary visual cortex. In: Proceedings of the International Conference on Artificial Neural Networks. Springer, pp. 251– 256. Frasconi, P., Gori, M., Sperduti, A. (1998). A general framework for adaptive processing of data structures. IEEE Trans. Neural Netw. 9 (5), 768–786. Fu, K., Mui, J.K. (1981). A survey on image segmentation. Pattern Recognit. 13, 3–16. Gärtner, T. (2003). A survey of kernels for structured data. SIGKDD Explorations 5 (1), 49–58. Gärtner, T., Flach, P., Wrobel, S. (2003). On graph kernels: Hardness results and efficient alternatives. In: Proceedings of the 16th Annual Conference on Computational Learning Theory and the 7th Kernel Workshop, pp. 129– 143. Gori, M., Maggini, M., Sarti, L. (2003). A recursive neural network model for processing directed acyclic graphs with labeled edges. In: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2003), pp. 1351–1355. Gori, M., Hagenbuchner, M., Scarselli, F., Tsoi, A.-C. (2004). Graphicalbased learning environment for pattern recognition. In: Proceedings of SSPR 2004. Gori, M., Maggini, M., Sarti, L. (2005a). Exact and approximate graph matching using random walks. IEEE Trans. Pattern Anal. Mach. Intell. 27 (7), 1100–1111. Gori, M., Monfardini, G., Scarselli, F. (2005b). A new model for learning in graph domains. In: Proceedings of IJCNN 2005, vol. 2, pp. 729–734.
RECURSIVE NEURAL NETWORKS
57
Günter, S., Bunke, H. (2001). Validation indices for graph clustering. In: Jolion, J.-M., Kropatsch, W., Vento, M. (Eds.), Proceedings of the Third IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition, pp. 229–238. Hagenbuchner, M., Tsoi, A.-C., Sperduti, A. (2001). A supervised selforganizing map for structured data. In: Allison, L.A.N., Yin, H., Slack, J. (Eds.), Advances in Self-Organizing Maps. Springer, Berlin, pp. 21–28. Hagenbuchner, M., Sperduti, A., Tsoi, A.-C. (2003). A self-organizing map for adaptive processing of structured data. IEEE Trans. Neural Netw. 14 (3), 491–505. Hammer, B. (1998). On the approximation capability of recurrent neural networks. In: NC’98, International Symposium on Neural Computation (Vienna, Austria). Hammer, B. (1999). Approximation capabilities of folding networks. In: ESANN ’99 (Bruges, Belgium), pp. 33–38. Hammer, B., Micheli, A., Stricker, M., Sperduti, A. (2004). A general framework for unsupervised processing of structured data. Neurocomputing 57, 3–35. Haralick, R., Shapiro, L. (1985). Image segmentation techniques. Comput. Vision, Graph. Image Process. 29, 100–132. Healey, G., Binford, T. (1989). Using color for geometry-insensitive segmentation. J. Opt. Soc. Am. 22 (1), 920–937. Hoekstra, A., Drossaers, M. (1993). An extended Kohonen feature map for sentence recognition. In: Gielen, S., Kappen, B. (Eds.), Proceedings of the International Conference on Artificial Neural Networks. Springer, Berlin, pp. 404–407. Hornik, K., Stinchcombe, M., White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366. Hunter, G.M., Steiglitz, K. (1979). Operations on images using quadtrees. IEEE Trans. Pattern Anal. Mach. Intell. 1, 145–153. James, D., Mikkulainen, R. (1995). SARDNET: A self-organizing feature map for sequences. In: Tesauro, G., Touretzky, D., Leen, T. (Eds.), Advances in Neural Information Processing Systems, vol. 7. MIT Press, Cambridge, MA, pp. 577–584. Kangas, T. (1990). Time-delayed self-organizing maps. In: Proceedings of IEEE/INNS IJCNN, vol. 2, pp. 331–336. Kohonen, T., Sommervuo, P. (2002). How to make large self-organizing maps for nonvectorial data. Neural Netw. 15 (8–9), 945–952. Koskela, T., Varsta, M., Heikkonen, J., Kaski, K. (1998a). Recurrent SOM with local linear models in time series prediction. In: Verleysen, M. (Ed.), Proceedings of the 6th European Symposium on Artificial Neural Networks, pp. 167–172.
58
BIANCHINI
et al.
Koskela, T., Varsta, M., Heikkonen, J., Kaski, K. (1998b). Time series prediction using recurrent SOM with local linear models. In: Proceedings of the Int. J. Conf. of Knowledge-Based Intelligent Engineering Systems, vol. 2(1), pp. 60–68. Küchler, A., Goller, C. (1996). Inductive learning in symbolic domains using structure-driven recurrent neural networks. In: Görz, G., Hölldobler, S. (Eds.), Advances in Artificial Intelligence. Springer, Berlin, pp. 183–197. Leung, T.K., Burl, M.C., Perona, P. (1998). Probabilistic affine invariants for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 678–684. Macaire, L., Ultre, V., Postaire, J. (1996). Determination of compatibility coefficients for color edge detection by relaxation. In: Proceedings of the ICIP, pp. 1045–1048. McKenna, S., Raya, Y., Gong, S. (1998). Tracking colour objects using adaptive mixture models. Image Vision Comput. 17 (3/4), 223–229. Micheli, A., Sona, D., Sperduti, A. (2004). Contextual processing of structured data by recursive cascade correlation. IEEE Trans. Neural Netw. 15 (6), 1396–1410. Moghaddam, B., Pentland, A. (1997). Probabilistic visual learning for object recognition. IEEE Trans. Pattern Anal. Mach. Intell. 19 (7), 696–710. Morse, S. (1969). Concepts of use in computer map processing. Commun. ACM 12 (3), 147–152. Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Trans. Neural Netw. 2 (2), 181–201. Nene, S., Nayar, S., Murase, H. (1996a). Columbia object image library (COIL-100). Tech. Rep. CUCS-006-96, Columbia University. Nene, S., Nayar, S., Murase, H. (1996b). Real-time 100 object recognition system. In: Proceedings of the IEEE Conference on Robotics and Automation, vol. 3, pp. 2321–2325. Papageorgiou, C., Oren, M., Poggio, T. (1998). A general framework for object detection. In: Proceedings of the 6th IEEE International Conference on Computer Vision, pp. 555–562. Pat, S.K. (1993). A review on image segmentation techniques. Pattern Recognit. 29, 1277–1294. Pentland, A. (2000). Perceptual intelligence. Commun. ACM 43 (3), 35–44. Pollastri, G., Baldi, P., Vullo, A., Frasconi, P. (2002). Prediction of protein topologies using generalized IOHMMs and recursive neural networks. In: Proceedings of NIPS. Pontil, M., Verri, A. (1998). Support vector machines for 3D object recognition. IEEE Trans. Pattern Anal. Mach. Intell. 20 (6), 637–646. Roubal, J., Peucker, T. (1985). Automated contour labeling and the contour tree. In: Proceedings of AUTO-CARTO 7, pp. 472–481.
RECURSIVE NEURAL NETWORKS
59
Scarselli, F., Yong, S., Gori, M., Hagenbuchner, M., Tsoi, A.-C., Maggini, M. (2005). Graph neural networks for ranking Web pages. In: Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, pp. 666– 672. Schneiderman, H., Kanade, T. (2000). A statistical method for 3D object detection applied to faces and cars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 746–751. Schölkopf, B., Smola, A. (2002). Learning with Kernels. MIT Press, Cambridge, MA. Shafer, S. (1985). Using color to separate reflection components. Color Res. Appl. 10, 201–218. Shina, P. (1995). Processing and Recognizing 3D Forms. Ph.D. thesis, Massachusetts Institute of Technology. Song, Y., Zhang, A. (2002). Monotonic tree. In: Proceedings of the 10th International Conference on Discrete Geometry for Computer Imagery (Bordeaux, France). Sperduti, A., Starita, A. (1997). Supervised neural networks for the classification of structures. IEEE Trans. Neural Netw. 8, 714–735. Strickert, M., Hammer, B. (2003a). Neural gas for sequences. In: Proceedings of WSOM ’03, pp. 53–57. Strickert, M., Hammer, B. (2003b). Unsupervised recursive sequence processing. In: Verleysen, M. (Ed.), Proceedings of the European Symposium on Artificial Neural Networks, pp. 27–32, D-side publications. Sturt, P., Costa, F., Lombardo, V., Frasconi, P. (2003). Learning first-pass structural attachment preferences with dynamic grammars and recursive neural networks. Cognition 88 (2), 133–169. Tsai, W. (1990). Combining statistical and structural methods. In: Syntactic and Structural Pattern Recognition: Theory and Applications. World Scientific, Singapore, pp. 349–366. van Kreveld, M., van Oostrum, R., Bajaj, C., Pascucci, V., Schikore, D. (1997). Contour trees and small seed sets for iso-surface traversal. In: Proceedings of the 13th Annual Symposium on Computational Geometry, pp. 212–220. Vapnik, V. (1995). The Nature of Statistical Learning Theory. SpringerVerlag, Berlin. Vesanto, J. (1997). Using the SOM and local models in time-series prediction. In: Proceedings of the Workshop on Self-Organizing Maps, pp. 209–214. Vishwanathan, S., Smola, A. (2002). Fast kernels for string and tree matching. In: Becker, S., Thrun, S., Obermayer, K. (Eds.), Advances in Neural Information Processing Systems, vol. 15. MIT Press, Cambridge, MA. Voegtlin, T. (2000). Context quantization and contextual self-organizing maps. In: Proceedings of the IJCNN, vol. 5, pp. 20–25.
60
BIANCHINI
et al.
Voegtlin, T. (2002). Recursive self-organizing maps. Neural Netw. 15 (8–9), 979–992. Voegtlin, T., Dominey, P.F. (2001). Recursive self-organizing maps. In: Allison, N., Yin, H., Allinson, L., Slack, J. (Eds.), Advances in SelfOrganizing Maps. Springer, Berlin, pp. 210–215. Vullo, A., Frasconi, P. (2002). A bi-recursive neural network architecture for the prediction of protein coarse contact maps. In: Proceedings of the 1st IEEE Computer Society Bioinformatics Conference (Stanford). Yang, M.-H., Kriegman, J., Ahuja, N. (2002). Detecting faces in images: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 24 (1), 34–58. Yao, N.Y., Marcialis, G.L., Pontil, M., Frasconi, P., Roli, F. (2003). Combining flat and structural representations for fingerprint classification with recursive neural networks and support vector machines. Pattern Recognit. 36 (2), 397–406.
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 140
Deterministic Learning and an Application in Optimal Control CRISTIANO CERVELLERAa AND MARCO MUSELLIb a Istituto di Studi sui Sistemi Intelligenti per l’Automazione, Consiglio Nazionale delle Ricerche,
16149 Genova, Italy b Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni,
Consiglio Nazionale delle Ricerche, 16149 Genova, Italy
I. Introduction . . . . . . . . . . . . . . . Notation . . . . . . . . . . . . . . . II. A Mathematical Framework for the Learning Problem . . . III. Statistical Learning . . . . . . . . . . . . . IV. Deterministic Learning . . . . . . . . . . . . A. The Distribution-Free Case . . . . . . . . . . B. Ensuring a Bounded Variation . . . . . . . . . 1. Feedforward Neural Networks . . . . . . . . 2. Radial Basis Functions . . . . . . . . . . C. Bounds on the Convergence Rate of the ERM Approach . D. The Distribution-Dependent Case . . . . . . . . E. The Noisy Case . . . . . . . . . . . . . V. Deterministic Learning for Optimal Control Problems . . . VI. Approximate Dynamic Programming Algorithms . . . . A. T-SO Problems . . . . . . . . . . . . . B. ∞-SO Problems . . . . . . . . . . . . . 1. Approximate Value Iteration . . . . . . . . . 2. Approximate Policy Iteration . . . . . . . . . C. Performance Issues . . . . . . . . . . . . VII. Deterministic Learning for Dynamic Programming Algorithms A. The T-SO Case . . . . . . . . . . . . . B. The ∞-SO Case . . . . . . . . . . . . . VIII. Experimental Results . . . . . . . . . . . . A. Approximation of Unknown Functions . . . . . . B. Multistage Optimization Tests . . . . . . . . . 1. The Inventory Forecasting Model . . . . . . . 2. The Water Reservoir Network Model . . . . . . References . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
62 64 65 69 74 75 80 83 84 85 87 88 90 94 94 96 96 97 98 99 99 102 104 104 107 108 109 114
61 ISSN 1076-5670/05 DOI: 10.1016/S1076-5670(05)40002-6
Copyright 2006, Elsevier Inc. All rights reserved.
62
CERVELLERA AND MUSELLI
I. I NTRODUCTION In a wide variety of real world situations a functional dependence y = g(x) has to be estimated from a set of observations (x L , y L ) = {(x l , yl ), l = 0, . . . , L − 1} concerning a phenomenon of interest. This is the case when the behavior of a continuous signal has to be forecast starting from its previous history or when the value of an unmeasurable quantity has to be inferred from the measurements of other related variables. If an insufficient amount of a priori information about the form of the functional dependence g is available, its estimation must provide for two different actions: 1. at first a sufficiently large class Γ of functions must be properly selected (model selection); 2. then, the best element g ∈ Γ must be retrieved by adopting a suitable optimization algorithm (training phase). The model selection task is usually performed by taking a very general paradigm, whose complexity can be controlled by acting on a small number of constant values. For example, the usual polynomial series expansion can approximate arbitrarily well every measurable function, that is, polynomials are universal approximators. However, by including in Γ only the functions whose polynomial series expansion does not contain terms with exponent greater than a prescribed maximum k, we can control the richness of the class Γ . In particular, if k = 1 only linear functions are included in Γ ; if k = 2 the expansion can realize only linear and quadratic functions, etc. Other general paradigms have been extensively used for model selection: neural networks have been shown to possess the universal approximation property (Cybenko, 1989; Hornik et al., 1989; Barron, 1993; Girosi et al., 1995) and have been successfully applied in many different fields. In this case, the complexity of the class Γ can be controlled by acting on the architecture of the network (the number of layers and the number of neurons in the feedforward structure). Once the class Γ has been chosen, the optimization algorithm to be employed in the training phase is selected accordingly. For example, the backpropagation technique (and its modifications) is often adopted to retrieve the function in Γ that best fits the collection (x L , y L ) of observations at our disposal, usually called training set. However, the basic goal to be pursued is to obtain a function g that generalizes well, that is, that behaves correctly even in correspondence with other points of the domain not included in the training set. How can it be guaranteed that the element of Γ that best fits our observations also generalizes well? This is a fundamental question in the context of learning theory.
DETERMINISTIC LEARNING AND AN APPLICATION
63
Since in many practical situations the input vectors x l cannot be freely chosen and the training set (x L , y L ) can be corrupted by noise, most results on learning theory are based on a statistical framework, which arose in the pattern recognition community (Vapnik and Chervonenkis, 1971; Valiant, 1984; Blumer et al., 1989; Devroye et al., 1997) and has been naturally extended to other inductive problems, like regression estimation (Pollard, 1990; Vapnik, 1995; Alon et al., 1997) and probability density reconstruction (Vapnik, 1995). In this framework, called statistical learning (SL), the input vectors x l , for l = 0, . . . , L − 1, are viewed as realizations of a random variable, generated according to an unknown (but fixed) probability density p. On the other hand, there are several cases where the position of the points x l in the input space can be suitably selected for the problem at hand. If a deterministic algorithm is employed to choose the input vectors x l , SL is no longer the most appropriate approach. In this case a new framework, called deterministic learning (DL), is able to catch the peculiarities of the situation at hand, thus providing precise conditions about the generalization ability of the function g ∈ Γ that best fits the observations of the training set (x L , y L ). This chapter presents a survey of DL, comparing its results with those obtained by standard SL. In particular, basic quantities, like variation and discrepancy, are introduced, pointing out their centrality in the derivation of upper bounds for the generalization error that decreases as 1/L (apart from logarithmic factors) with the size L of the training set. This behavior outperforms the equivalent result obtained by SL, where a convergence rate of 1/L2 has been derived. An important application of DL concerns system control and, specifically, the solution of multistage stochastic optimization (MOS) problems, a particular kind of Markovian decision process. In such problems, the aim is to minimize a cost that depends on the evolution of a system, affected by random disturbances, through a horizon of an either finite or infinite number of stages. This very classic framework is widely employed in many different contexts, such as economics, artificial intelligence, engineering, etc. Since in most practical situations optimal control and cost functions cannot be obtained in an analytical form, a numerical approach is needed to solve MOS problems. The standard tool is dynamic programming (DP), introduced by Bellman (1957), as is documented by the large number of studies devoted to this method through the years. The basic idea underlying the DP procedure is to define, at each stage, a function, commonly named cost-to-go or value function, that quantifies the cost that has to be paid from that stage on to the end of the time horizon. In this way, it is possible to transform the MOS problem into a sequence of simpler static optimization subproblems, which can be solved recursively. The basics
64
CERVELLERA AND MUSELLI
of the recursive solution adopted in DP are introduced and discussed in several classic references (see, for example, Bellman, 1957; Bellman and Dreyfus, 1962; Larson, 1968). Among the most recent surveys on DP techniques and Markov decision processes in general are two excellent monographs (Puterman, 1994; Bertsekas, 2000). Although efficient variations of the DP procedure exist for the “deterministic” version of the MOS problem, such as differential dynamic programming (Jacobson and Mayne, 1970), the general approach followed to implement DP in a practical situation requires choosing for each stage a number of sampling points in the d-dimensional state space, then approximating the cost-to-go functions outside these points. In this way, solving the original MOS problem implies the reconstruction of several functional dependencies, one for each stage. The most common sampling technique used in the literature is the “full uniform” grid, that is, the uniform discretization of each component of the state space in a fixed number of values. This clearly leads to an exponential growth of the number of points commonly known as curse of dimensionality: if each of the d components of the state space is discretized by means of m equally spaced values, the number of points of the grid is equal to md . A nonexponential complexity can be obtained by adopting a finer sampling scheme, like those proposed by SL and DL. In the former case, a uniform probability density is employed to generate the training set x l , l = 0, . . . , L − 1; in the latter approach points x l are selected by using a deterministic algorithm, which is able to guarantee an almost linear rate of convergence. Numerical simulations confirm the superiority of DL over SL when dealing with complex MOS problems, solved through the DP procedure. Notation X ⊂ Rd , Y ⊂ R: x ∈ X, y ∈ Y : xi ∈ R: g(x): (x L , y L ): L: (x l , yl ): x L ∈ XL : Γ: ψ(x, α): Λ ⊂ Rk : α ∈ Λ:
input and output space input vector and scalar output ith component of the input vector x unknown function to be estimated training set for estimating the functional dependence number of points in the training set lth example of the training set, with l = 0, . . . , L − 1 collection of all the input vectors x l in the training set family of models generic model in Γ parameter space for the family Γ parameter vector of the model ψ(x, α)
DETERMINISTIC LEARNING AND AN APPLICATION
ℓ(·, ·): RQ (α): Q(x): q(x): R(α): Remp (α, x L ): AL (x L , y L ): Ψ (L): λ(B): cB (x): D(x L ), D ∗ (x L ): (ϕ, B): V (d) (ϕ): VHK (ϕ): ∂i1 ,...,ik ϕ: WM (B): η ∈ R: x t ∈ Xt ⊂ Rd : ut ∈ U t ⊂ R m : θ t ∈ Θt ⊂ R q : f (x t , ut , θ t ): µt (x t ): h(x t , ut , θ t ): β ∈ R: J ◦ (x): J˜◦ (x): ◦ ˆ J (x, α):
65
loss function expected risk for the model ψ(x, α) probability measure for the evaluation of the expected risk probability density function for the measure Q expected risk computed with the uniform distribution (risk functional) empirical risk computed on the sample x L training algorithm minimizing the empirical risk deterministic function for the selection of x L Lebesgue measure of the set B ⊂ Rd characteristic function of the set B ⊂ Rd discrepancy and star discrepancy of the sample x L alternating sum of the function ϕ at the vertexes of the interval B variation in the sense of Vitali of the function ϕ variation in the sense of Hardy and Krause of the function ϕ kth partial derivative of ϕ(x) with respect to the components x i1 , . . . , x ik class of functions ϕ such that ∂i1 ,...,ik ϕ is continuous and bounded random noise with zero mean state vector of a dynamic system at stage t control vector of a dynamic system at stage t random vector acting on a dynamic system at stage t state equation of a dynamic system closed-loop control function (policy) at stage t cost function for the single stage t discount factor for infinite-horizon MOS problems cost-to-go function of MOS problems generic approximated value of J ◦ (x) approximated value of J ◦t (x) based on a parameterized model
II. A M ATHEMATICAL F RAMEWORK FOR THE L EARNING P ROBLEM We want to estimate, inside a family of functions (models) Γ = {ψ(x, α): α ∈ Λ ⊂ Rk }, the parameter α ∗ corresponding to the ψ that best approximates a functional dependence of the form y = g(x), where x ∈ X ⊂ Rd and
66
CERVELLERA AND MUSELLI
y ∈ Y ⊂ R, on the basis of a training set (x L , y L ) ∈ (XL × Y L ) containing L samples (x l , yl ) with l = 0, . . . , L − 1. The quality of a model ψ ∈ Γ can be evaluated at any point of X by a loss function ℓ : Y 2 → R that measures the difference between the output of ψ and the function g. ℓ must be symmetric and nonnegative; furthermore, ℓ(y, y ′ ) = 0 if and only if y = y ′ . The output y assigned to each observation point x is generally noisy; thus, we suppose that y is the realization of a random variable on Y with density p(y|x). described by a conditional probability P ˜ An overall evaluation of the model ψ(x, α) can then be obtained by averaging the value of the loss function ℓ[y, ψ(x, α)] over the whole input domain X. To this aim we assume the existence of a probability measure Q that determines the occurrence frequency of any input vector x. Again we suppose that Q admits a probability density q(x). With this notation we can define the expected risk RQ (α) ℓ y, ψ(x, α) p(y|x)q(x) ˜ dy dx RQ (α) = X×Y
which measures the mean error committed by the model ψ(x, α) over the whole space X. The learning problem can then be stated as follows: Problem 1. Find α ∗ ∈ Λ such that RQ (α ∗ ) = minα∈Λ RQ (α). If the minimum does not exist, the target of our problem can be to find α ∗ ∈ Λ such that RQ (α ∗ ) < infα∈Λ RQ (α) + ε for some fixed ε > 0. The most common loss function is the squared error 2 ℓ y, ψ(x, α) = y − ψ(x, α) .
With this choice, when Y is an interval of R, the solution to Problem 1 corresponds to the function ψ ∈ Γ that is closest to the regression function given by ∗ g (x) = y p(y|x) ˜ dy.
For this reason, when y assumes continuous values, Problem 1 is usually referred to as a regression estimation problem or, simply, a regression problem. To verify the above assertion, we can write the risk as 2 y − g ∗ (x) + g ∗ (x) − ψ(x, α) p(y|x)q(x) RQ (α) = ˜ dy dx X×Y
=
X×Y
2 y − g ∗ (x) p(y|x)q(x) ˜ dy dx
DETERMINISTIC LEARNING AND AN APPLICATION
+
X×Y
+2
2 ∗ ˜ dy dx g (x) − ψ(x, α) p(y|x)q(x)
X×Y
=
∗ RQ
+
X
∗ RQ
67
˜ dy dx y − g ∗ (x) g ∗ (x) − ψ(x, α) p(y|x)q(x)
2 ∗ g (x) − ψ(x, α) q(x) dx
(1)
where ˜ dy dx is the expected risk of the = X×Y [y − g ∗ (x)]2 p(y|x)q(x) ∗ regression function g (x). In the derivation of Eq. (1) for RQ (α) we have used the following identity: y − g ∗ (x) g ∗ (x) − ψ(x, α) p(y|x)q(x) ˜ dy dx X×Y
=
X
Y
˜ dy g ∗ (x) − ψ(x, α) q(x) dx = 0 y − g ∗ (x) p(y|x)
since by definition of the regression function ˜ dy = 0. y − g ∗ (x) p(y|x) Y
However, the probability densities p˜ and q are unknown; hence, we cannot derive the behavior of g ∗ (x). Consequently, the regression problem must be solved only by employing the knowledge inherent in the training set (x L , y L ). A typical way of proceeding consists of minimizing the empirical risk Remp (α, x L ), which evaluates a measure of RQ (α) on the samples included in the training set. In general, the empirical risk is defined as L−1 1 ℓ yl , ψ(x l , α) Remp α, x L = L l=0
which becomes, in the case of quadratic loss function,
L−1 2 1 L Remp α, x = yl − ψ(x l , α) . L l=0
α ∗L
Denote with ∈ Λ the point of minimum of the empirical risk Remp (α, x L ); a nonlinear optimization method AL : (XL × Y L ) → Λ can
68
CERVELLERA AND MUSELLI
be adopted as a learning algorithm to determine a close approximation to the optimum α ∗L . In particular, since optimization techniques are generally (m) iterative, we can define AL as the learning algorithm obtained by taking L L = A(m) the first m iterations of AL . Accordingly, let α (m) L L (x , y ) be the (m) parameter vector produced by AL . A suboptimal solution to Problem 1 can then be retrieved from the training set (x L , y L ) by performing a sufficiently high number (mL ) of iterations with (m ) the learning algorithm AL and by using the resulting parameter vector α L L as an approximation for α ∗ . This approach, called empirical risk minimization (mL ) (mL ) ) of the expected risk in α L is (ERM), is successful if the value RQ (α L ∗ close to the minimum RQ (α ). Note that (m ) (m ) RQ α L L − RQ (α ∗ ) ≤ RQ α L L − RQ α ∗L + RQ α ∗L − RQ (α ∗ ). Thus, the ERM approach is valid if the following two basic conditions are satisfied; when this is the case, Problem 1 is said to be learnable.
Condition 1. The sequence {α ∗L }∞ L=1 of minima of the empirical risk Remp (α, x L ) converges to the desired minimum α ∗ of the expected risk RQ (α). (m)
Condition 2. For every L the sequence {α L }∞ m=1 of optimal points found by the learning algorithm AL at different iterations converges to the minimum α ∗L of the empirical risk. The former condition depends on the characteristics of the learning problem at hand, whereas the latter is related to the behavior of the optimization technique employed to search for the minimum of the empirical risk. In particular, if the learning algorithm belongs to the class of global optimization methods, which are always able to find the global minimum of a cost function when the number of iterations increases indefinitely, Condition 2 is surely verified, at least in probability. However, an analysis of the properties of an optimization technique that lead to the fulfillment of Condition 2 is a central topic in nonlinear programming theory and will not be included in the present chapter. The interested reader is referred to dedicated monographs, such as Törn and Žilinskas (1989). In the following sections the focus will be centered on Condition 1 examining the hypotheses on the learning problem that ensure its fulfillment. Two different situations will be considered:
DETERMINISTIC LEARNING AND AN APPLICATION
69
Passive learning: when the generation of the points in x L for the training set is not under our control; in this case they are viewed as realizations of a random variable with an unknown probability measure. Active learning: when points in x L are produced by a generation algorithm, which can be freely chosen. In particular, the behavior of the difference RQ (α ∗L ) − RQ (α ∗ ) when L increases is examined; this allows us to obtain lower bounds for the size L of the training set, which guarantees the achievement of a desired generalization error. Since the context of passive learning is intrinsically probabilistic, the convergence involved in Condition 1 can be ensured only in a probabilistic sense. The analysis of this case forms the subject of SL, whose main results will be presented in the following section. On the other hand, active learning can be studied in a deterministic way, thus leading to hypotheses for standard convergence in the fulfillment of Condition 1. This is the subject of DL, whose treatment is contained in Section IV.
III. S TATISTICAL L EARNING If the generation of the input points x l to be included in the training set (x L , y L ) is not under our control, we can assume there is an external random source that generates them. Denote with P the probability measure that characterizes this external source and suppose that P admits a density p(x), whose behavior is unknown. However, the learning problem at hand can be solved only if the probability measure P is related to the probability Q adopted to evaluate the expected risk RQ introduced in the previous section. Specifically, the following condition of absolute continuity must hold: if P (S) = 0 for some S ⊂ X then it must also be Q(S) = 0. If this condition is not true for a subset S, but we have P (S) = 0 and Q(S) > 0, there is no hope of minimizing the contribution to the expected risk due to S by examining the points of the training set (x L , y L ), which cannot belong to S. To rule out critical situations, the following two assumptions are normally supposed to hold in the SL framework: Assumption 1. Points in x L are generated by i.i.d. realizations of an unknown density p. Assumption 2. The density p, used in the generation of the training set, is equal to the density q adopted in the evaluation of the expected risk. The first requirement is rarely verified in real world situations; nevertheless, the removal of the i.i.d. hypothesis limits the applicability of typical theoreti-
70
CERVELLERA AND MUSELLI
cal results such as those reported in the following (Vidyasagar, 1997), which are heavily based on Hoeffding’s inequality (Devroye et al., 1997; Hoeffding, 1961). An attempt in this direction is described in Najarian et al. (2001), but its validity is restricted to nonlinear FIR models. Assumption 2 regarding the equality between p and q cannot be verified in practice; it can only be hoped that the mechanism involved in obtaining the samples for the training phase remains almost unchanged when new data are generated. On the other hand, if p and q are radically different from each other, the indirect minimization of the expected risk can lead to poor results. In the SL framework the empirical risk Remp (α) is a random variable, since it depends on the training set (x L , y L ). It follows that the point of minimum α ∗L is also a random variable and therefore the convergence involved in Condition 1 must be formulated in a probabilistic way. For example, it can be rewritten as (2) lim P RQ α ∗L − RQ (α ∗ ) > ε = 0 for every ε > 0, L→∞
which amounts to considering the convergence in probability of RQ (α ∗L ), or as P lim RQ α ∗L = RQ (α ∗ ) = 1, (3) L→∞
which corresponds to the convergence a.s. of the sequence {RQ (α ∗L )}. The probabilities involved in Eqs. (2) and (3) are defined on the product space of the possible training sets (x L , y L ). If Eq. (2) holds and Condition 2 is verified, the learning problem is said to be probably approximately correct (PAC) learnable (Valiant, 1984; Angluin, 1987). The following theorem gives sufficient conditions for the regression problem to be PAC learnable. Theorem 1.
Condition 1 is verified if the following convergence holds: lim P sup Remp α, x L − RQ (α) > ε = 0 for every ε > 0. (4)
L→∞
α∈Λ
The proof can be found, for example, in Vidyasagar (1997). Condition (4) is often referred to as uniform convergence of empirical means; sufficient conditions on the class Γ of functions that ensure its validity can be derived by using the notion of Pollard dimension (or pseudodimension, or P -dimension), a generalization of the VC-dimension (“Vapnik–Chervonenkis dimension”), which is at the core of all the relevant results in SL, as it provides a way of measuring the “richness” of a set of functions. A description of the VC-dimension, more suited to classification problems, can be found in several papers and books on statistical learning, such as
DETERMINISTIC LEARNING AND AN APPLICATION
F IGURE 1.
71
P-shattering.
(Vapnik, 1995) and the references therein. P-dimension was first introduced by Pollard (1990). Without loss of generality, we will suppose henceforth that Y = [0, 1]. Definition 1. A set S = {x 0 , . . . , x j −1 } is P-shattered by the family of functions Γ if there exists a vector c ∈ [0, 1]j such that, for every binary vector e ∈ {0, 1}j , there exists a corresponding function ψ(α e , x i ) ∈ Γ such that ψ(α e , x i ) > ci when ei = 1 and ψ(α e , x i ) < ci when ei = 0 for i = 0, . . . , j − 1, where ei is the ith component of e. In other terms, if the set S is P-shattered, there must exist a vector [c0 , . . . , cj −1 ] such that it is possible to find a function in Γ that can arbitrarily “pass” above or below the various cj . Figure 1 illustrates graphically the concept of P-shattering. Definition 2. The P-dimension of Γ is the largest integer m such that there exists a set S of cardinality m that is P-shattered by Γ . As an example, consider (for the one-dimensional case) the family Γ = {y = k1 x + k2 , k1 , k2 ∈ R}. Figure 2 depicts this situation. As we can see from the figure, it is not possible to find three points that the function can arbitrarily pass above or under, while all the combinations are possible if we consider two points. As an example, in the situation represented in the right part of Figure 2, a function in Γ for e = [1, 0, 1] cannot be found. Therefore, we can conclude that the P-dimension for this particular family of functions is equal to 2.
72
CERVELLERA AND MUSELLI
F IGURE 2.
P-dimension of one-dimensional linear functions.
The exact value of the P-dimension can be computed only for very simple classes of functions, such as hyperplanes or hyperspheres. For more realistic models, like neural networks or radial basis function networks, only upper bounds or asymptotic behaviors are available (Anthony and Bartlett, 1999). However, the definition of P-dimension allows us to obtain some results about uniform convergence of empirical means. The following notation will be used: ρ(L, ε, ℓ, Γ ) = sup P sup Remp α, x L − RQ (α) > ε Q∈Q
α∈Λ
where Q is the set of all the probability measures on X. It can be observed that if ρ(L, ε, ℓ, Γ ) → 0 when L → ∞, for every ε > 0, the uniform convergence of the empirical means (4) occurs independently of the underlying probability. In this case the term distribution-free convergence is usually adopted. The following theorem gives sufficient conditions for the validity of condition (4) as well as an explicit upper bound for the number L of samples needed to achieve a desired accuracy for ρ(L, ε, ℓ, Γ ). The proof of the theorem can be found in Vidyasagar (1997). Theorem 2. Suppose the family Γ has finite P-dimension m and the loss function ℓ satisfies the following uniform Lipschitz condition ℓ(y, u1 ) − ℓ(y, u2 ) ≤ μ|u1 − u2 | for every y, u1 , u2 ∈ [0, 1] (5) for some constant μ. Then, the property of distribution-free uniform convergence of empirical means holds: lim ρ(L, ε, ℓ, Γ ) = 0 for every ε > 0.
L→∞
DETERMINISTIC LEARNING AND AN APPLICATION
73
Moreover, the inequality ρ(L, ε, ℓ, Γ ) ≤ δ is satisfied, provided at least 16eμ 16eμ 8 32 + ln ln (6) L ≥ 2 ln + m ln δ ε ε ε samples are drawn. This theorem states that if we want the difference between the empirical risk and the expected risk to be less than ε with probability 1 − δ for each ψ ∈ Γ , we must choose the number of samples L according to Eq. (6). This lower bound is independent of the probability measure adopted to generate the training set, provided that Assumptions 1 and 2 are valid. We can use this bound to relate the number L of samples to the error between the expected risk RQ (α ∗L ) in the point of minimum of Remp (α, x L ) and the best achievable risk RQ (α ∗ ). For what concerns the uniform Lipschitz condition (5) on ℓ, commonly used loss functions such as ℓ(y, u) = |y − u|n satisfy this property. Equation (6) and Theorem 1 provide an explicit indication about the sample complexity of the learning problem, that is, how many samples we must draw to attain a given error between the best approximating function in Γ and the one obtained with our training algorithm. The first thing to be noted is that the bound is, apparently, independent of the dimension d of the input vector. In most situations this is not the case, since in general the P-dimension m depends on d. However, we have L = O(ln m); consequently, if we choose a class of approximating functions with a P-dimension that does not grow superexponentially with d, we can say that the curse of dimensionality is avoided, at least for what concerns the sample complexity. Inequality (6) seems to suggest that a certain accuracy can be achieved by using fewer training samples providing that a class Γ with a small Pdimension is adopted. Nevertheless, if we reduce the complexity too much, then the approximation capability becomes too limited. In other words, there exists a trade-off between the estimation error RQ (α ∗L ) − RQ (α ∗ ) and the approximation error RQ (α ∗ ) itself: by reducing the P-dimension m we can obtain a small estimation error at the expense of increasing the approximation error. On the other hand, if we increase m to retrieve a good approximation for the function we want to learn, we need to increase the number L of samples to obtain an acceptable estimation based on the ERM approach. If we use too few samples, the phenomenon generally known as overfitting (Bishop, 1995) occurs, where too much freedom in the choice of the approximating function leads the trained model to show made-up complex behavior in undersampled regions.
74
CERVELLERA AND MUSELLI
It is important to note that the bound in Eq. (6) is derived in a distributionfree context. This means that it is valid for every possible probability measure on X. This is useful when we do not have any prior knowledge about the underlying probability by which the samples are drawn. Including this knowledge may probably lead to better bounds on L, especially if we choose a suitable training algorithm. For a discussion about this particular topic the reader is referred to Vidyasagar (1997). It is also important to point out that the bound on the number of samples corresponds to a quadratic rate for the sample complexity, since it depends on ε −2 . This is consistent with typical convergence results of random methods and Monte Carlo algorithms. Furthermore, such bounds are probabilistic in nature, that is, we must expect the results to hold true within a certain confidence interval δ. We will show in the next section how the possibility of choosing the points x l of the training set can lead to a significant improvement in the rate of sample complexity, besides providing the possibility of retrieving deterministic results not involving any confidence value.
IV. D ETERMINISTIC L EARNING If the location of the input patterns x l to be included in the training set (x L , y L ) is not fixed beforehand, but is part of the learning process, the SL approach is no more applicable since Assumptions 1 and 2 do not hold anymore. In fact, the position of the points x l in active learning is typically decided by a deterministic algorithm that does not make subsequent choices in an independent manner. Most existing active learning methods use an optimization procedure for the generation of the input sample x l+1 on the basis of the information contained in previous training points (MacKay, 1992; Cohn, 1994; Kindermann et al., 1995; Fukumizu, 1996), which possibly leads to a heavy computational burden. In addition, some strong assumptions on the observation noise y − g(x) [typically, noise with normal density (Cohn, 1994; Fukumizu, 2000)] or on the class of learning models are introduced. In this section, the validity of Condition 1 in a very general situation is examined. In particular, it will be shown that even if the location of the input patterns x l is decided a priori, the learning problem can still be learnable, provided that some mild assumptions on the class Γ of models is verified. Three different cases will be examined in the following sections: 1. the distribution-free case, where the probability density q, employed to evaluate the expected risk, is unknown (Section IV.A), but the output y = g(x) is observed without noise;
DETERMINISTIC LEARNING AND AN APPLICATION
75
2. the distribution-dependent case, where q is known and can be suitably taken into account (Section IV.D); again, the output is supposed to be noise free; and 3. the noisy case, where the observation noise can be described by any probability distribution (Section IV.E), provided that it does not depend on x and its mean is zero. In the first two cases we suppose that the output y for a given input x is observed without noise, that is, y = g(x). With this assumption the expected risk RQ (α) becomes RQ (α) = ℓ g(x), ψ(x, α) q(x) dx X
since the output y is no longer a random variable. The generation of the points x l to be included training set is in the l such that Ψ (L) performed by a deterministic algorithm Ψ : N → ∞ X l=1 is a collection of exactly L input patterns x l , which can be written as x L . Ψl (L) denotes the single point x l of the sequence. We assume henceforth that X is the d-dimensional semiclosed unit cube [0, 1)d . However, it is possible to extend the results to other intervals of Rd or more complex input spaces, such as spheres and other compact convex domains like simplexes, by suitable transformations (Fang and Wang, 1994). If the input space X is not compact, it is always possible to find a compact K ⊂ X such that the probability measure of the difference X \ K is smaller than any fixed positive value ε [see the Ulam’s theorem (Dudley, 1989)]. Now, the smallest interval I including K can be considered as the input space by simply assigning null probability to the measurable set I \ K and by defining g(x) = 0 for x ∈ I \ K. A. The Distribution-Free Case In this section we suppose that no information is available about the behavior of the probability Q. We only know that Q belongs to a subset Q of the complete collection Q including all the probability measures on X. With this assumption privileging certain regions of the input space over others would be unreasonable; consequently, we can consider the uniform probability density instead of q in the computation of the expected risk, which reduces to the following risk functional: R(α) = ℓ g(x), ψ(x, α) dx. (7) X
76
CERVELLERA AND MUSELLI
The use of this risk functional in place of RQ (α) is theoretically motivated by the following result. Theorem 3. Suppose that every Q ∈ Q admits a density q with q∞ ≤ M for some fixed M ∈ R and that the risk functional R(α) can be minimized up to any desired accuracy (zero-error hypothesis), that is, minα∈Λ R(α) = 0. Then, R(α) = 0 implies RQ (α) = 0 for every Q ∈ Q. Proof. Consider α ∈ Λ such that R(α) = 0. For every Q ∈ Q, having density q, we can write RQ (α) = ℓ g(x), ψ(x, α) q(x) dx ≤ M ℓ (x), ψ(x, α) dx X
X
= MR(α) = 0.
Thus, RQ (α) = 0 since it is a nonnegative quantity. This result ensures that if the risk R(α) can be minimized up to any accuracy (as is the case for many approximators, including most neural network architectures), conditions for learnability obtained by considering the risk functional (7) hold also for the expected risk RQ (α), provided that the probability measure Q is absolutely continuous with respect to the uniform one and its density is bounded. Under this assumption the validity of Condition 1 can be established by employing the following result, which can be viewed as the parallel of Theorem 1. Theorem 4. Condition 1 is verified if the sequence {Ψ (L)}∞ L=1 satisfies lim sup Remp α, Ψ (L) − R(α) = 0. (8) L→∞ α∈Λ
¯ such Proof. If condition (8) holds, for any ε > 0 we can choose an L¯ = L(ε) that for every L ≥ L¯ ε R(α ∗L ) ≤ Remp α ∗L , Ψ (L) + 2 and ε Remp α ∗ , Ψ (L) ≤ R(α ∗ ) + . 2 Since, by definition of α ∗L we have Remp [α ∗L , Ψ (L)] ≤ Remp [α ∗ , Ψ (L)] the fulfillment of Condition 1 follows.
DETERMINISTIC LEARNING AND AN APPLICATION
77
Condition (8) can be considered as the equivalent for DL of the uniform convergence of empirical means property analyzed in Section III for SL. Since we are using the uniform density to compute the risk functional R(α), a basic requirement for the fulfillment of such a condition is that the points of the deterministic sequence x L = Ψ (L) are well spread over the input space X. If β is a collection of Lebesgue-measurable subsets of X and B ∈ β, denote with cB the characteristic function 1 if x ∈ B cB (x) = 0 otherwise and with C(B, x L ) the number of points of x L that belong to B L−1 cB (x l ). C B, x L =
(9)
l=0
Then, if λ(B) is the Lebesgue measure of the subset B, the spreading of the set of points x L over B can be measured by the absolute difference between the ratio C(B, x L )/L and λ(B). If we consider the whole collection β, this measure gives rise to the quantity C(B, x L ) L (10) − λ(B). Dβ x = sup L B∈β The following particular choices of β are commonly employed in numerical analysis (Fang and Wang, 1994; Niederreiter, 1992) and probability (Alon and Spencer, 2000).
Definition 3. If β is the collection of all the closed subintervals of X of the form di=1 [ai , bi ], then the quantity Dβ (x L ) is called discrepancy and is denoted by D(x L ). If β is the collection of all the closed subintervals of X of the form di=1 [0, bi ], then the quantity Dβ (x L ) is called star discrepancy and is denoted with D ∗ (x L ). A classic result (Kuipers and Niederreiter, 1974) states that the following three properties are equivalent: 1. Ψ (L) is uniformly distributed in X, that is, limL→∞ C[B, Ψ (L)]/L = λ(B) for all the subintervals B of X. 2. limL→∞ D(Ψ (L)) = 0. 3. limL→∞ D ∗ (Ψ (L)) = 0. Thus, a uniformly well-distributed sequence of points in the input domain has a small discrepancy or star discrepancy.
78
CERVELLERA AND MUSELLI
Now, with each vertex v of a given subinterval B = di=1 [ai , bi ] of X a binary string s can be associated, whose ith bit is 0 if the corresponding component vi of the vertex is equal to ai and 1 if vi = bi . Denote with EB (respectively OB ) the set of vertexes whose associated strings contain an even (respectively odd) number of 1s. For every function ϕ : X → R we define (ϕ, B) as the alternating sum of ϕ computed at the vertexes of B, that is, ϕ(x) − ϕ(x). (ϕ, B) = x∈EB
x∈OB
Definition 4. The variation of ϕ on X in the sense of Vitali is defined by (Niederreiter, 1992) (ϕ, B) (11) V (d) (ϕ) = sup β B∈β
where β is any partition of X into subintervals of the form
d
i=1 [ai , bi ].
If the partial derivatives of ϕ are continuous on X, the variation V (d) (ϕ) can be written as (Niederreiter, 1992) V
(d)
(ϕ) =
1 0
1 ∂d ϕ · · · ∂x . . . x 1
0
d
dx1 . . . dxd
(12)
where xi is the ith component of x. The equivalence between Eq. (11) and Eq. (12) can be readily seen when the function ϕ is monotone increasing in the domain [0, 1]d . In this case, the supremum in Eq. (11) is reached when the partition β contains only the whole interval [0, 1]d . On the other hand, the dth derivative in Eq. (12) is always nonnegative and a direct integration shows that the alternating sum (ϕ, [0, 1]d ) follows as result. Similar reasoning makes it possible to achieve the same conclusion when ϕ is monotone decreasing. For a general ϕ the equivalence between Eq. (11) and Eq. (12) can be viewed by partitioning the domain [0, 1]d into subintervals, where the restriction of ϕ to each of them is again monotone. For 1 ≤ k ≤ d and 1 ≤ i1 < i2 < · · · < ik ≤ d, let V (k) (ϕ, i1 , . . . , ik ) be the variation in the sense of Vitali of the restriction of ϕ to the k-dimensional face {(x1 , . . . , xd ) ∈ X: xi = 1 for i = i1 , . . . , ik }. Definition 5. The variation of ϕ on X in the sense of Hardy and Krause is defined by (Niederreiter, 1992) VHK (ϕ) =
d
k=1 1≤i1 o(p,q)k − y(p,q)k ), or it can remain with the weights unchanged (for o(p,q)k − y(p,q)k = 0). An alternative solution to Eq. (40) can be obtained by solving the constrained problem in Eq. (38) as follows (Lukac et al., 2004b; Yin et al., 1993): w(i,j ) = P w(i,j ) + 2μ x(|ζ |)k − x(1)k − 2|o(p,q)k − x(i,j )k | & (43) w(g,h) x(|ζ |)k − x(1)k − 2|x(i,j )k − x(g,h)k | − (g,h)∈ζ
where x(|ζ |)k and x(1)k represent the uppermost and the lowest componentwise order statistics in Eq. (19), respectively, and μ is the positive adaptation constant. 3. Vector Median Filters Unlike the scalar filters described above, the essential spectral characteristics of the noisy color image x are utilized by vector filtering schemes. The most popular vector filter is the vector median filter (VMF) (Astola et al., 1990). The VMF is a vector processing operator that has been introduced as an extension of the scalar median filter. The VMF can be derived either as
208
LUKAC AND PLATANIOTIS
an MLE or by using vector order-statistic techniques (Lukac et al., 2005a). Using the reduced ordering in Eq. (21), the vector median of a population of the vectors inside the supporting window Ψ(p,q) is the lowest ranked vector x(1) ∈ Ψ(p,q) . Since the ordering can be used to determine the positions of the different input vectors without any a priori information regarding the signal distributions, vector order-statistic filters, such as the VMF, are robust estimators. Similarly to the traditional MF in Eq. (34), the VMF output can be defined using the minimization concept as follows (Lukac et al., 2005a): x(g,h) − x(i,j ) L (44) y(p,q) = arg min x(g,h)
(i,j )∈ζ
where y(p,q) = x(g,h) ∈ Ψ(p,q) denotes the outputted vector belonging to the input set Ψ(p,q) . Such a concept has been used to develop the VMF modifications following the properties of color spaces (Regazoni and Teschioni, 1997). To speed up the calculation of the distances between the color vectors, the VMF based on the linear approximation of the Euclidean norm has been proposed in Barni et al. (1994). It is widely observed that the VMF excellently suppresses impulsive noise (Astola et al., 1990; Smolka et al., 2004). To improve its performance in the suppression of additive Gaussian noise, the VMF has been combined with linear filters (Astola et al., 1990). This so-called extended VMF is defined as
⎧ AMF
yVMF − x(i,j )
yAMF − x(i,j ) ≤ ⎪ (p,q) (p,q) ⎨ y(p,q) if L L (i,j )∈ζ (i,j )∈ζ (45) y(p,q) = ⎪ ⎩ VMF y(p,q) otherwise
VMF where yVMF (p,q) is the VMF output obtained in Eq. (44) and y(p,q) is an arithmetic mean filter (AMF) defined over the vectors inside the neighborhood ζ : 1 x(i,j ) . (46) yAMF (p,q) = |ζ | (i,j )∈ζ
Even though the sample outputted in Eq. (45) is not always one of the input samples, the filter adapts to the input signal by applying the VMF near a signal edge and using the AMF in the smooth areas. Thus, the extended VMF preserves the structural information of the image x while improving noise attenuation in the smooth areas. Another modification of the VMF filter uses the multistage filtering concept and the finite impulse response (FIR) filters to increase the design freedom of the VMF and to reduce the number of processing operations required to find the VMF output (Astola et al., 1990). The so-called vector FIR-median hybrid filters combine linear filtering with the VMF operation by dividing the
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
209
filter window ζ into an odd number of smaller subwindows where FIR filters are operating. The output of such a vector hybrid filter is the vector median of the FIR filter outputs. For example, using the three AMF filters with the subwindows ζ1 = {(p − 1, q − 1), (p − 1, q), (p − 1, q + 1), (p, q − 1)}, (p, q), and ζ2 = {(p, q + 1), (p + 1, q − 1), (p + 1, q), (p + 1, q + 1)}, respectively, the output of the vector FIR-median filter is defined as follows (Astola et al., 1990): y(p,q) = fVMF fAMF (x(i,j ) ; (i, j ) ∈ ζ1 ), x(p,q) , fAMF (x(i,j ) ; (i, j ) ∈ ζ2 ) (47) where ζ = {ζ1 ∪ (p, q) ∪ ζ2 }. The functions fVMF (·) and fAMF (·) denote the VMF and AMF operations, respectively. Since the central sample x(p,q) is usually the most important sample in the supporting window Ψ(p,q) , it remains unchanged by the corresponding FIR filter (identity filter). The main advantage of the vector FIR-median filters is that they significantly speed-up the filtering process compared to the traditional VMF. Both VMF and extended VMF utilize the aggregated Euclidean distance of the input vectors within a processing window. However, these measures do not take into account either the importance of the specific samples in the filter window or structural contents of the image. Much better results can be obtained when the weighting coefficients w(i,j ) are introduced into the filter structure to control the contribution of the associated input vectors x(i,j ) to the aggregated distances (Lukac et al., 2003a; Viero et al., 1994), w(g,h) x(i,j ) − x(g,h) L (48) D(i,j ) = (g,h)∈ζ
used as the ordering criterion in Eq. (21). Based on the aggregated weighted magnitude distances in Eq. (48), the so-called weighted VMF (WVMF) operators (Lucat et al., 2002; Lukac et al., 2004c; Viero et al., 1994) produce the lowest ranked vector x(1) in Eq. (21) as the filter output, that is, y(p,q) = x(1) . Similarly to the traditional VMF operator, the output of the WVMF filters is equivalently determined using the minimization concept as follows (Lukac et al., 2004c): w(i,j ) x(g,h) − x(i,j ) L (49) y(p,q) = arg min x(g,h)
(i,j )∈ζ
where y(p,q) = x(g,h) ∈ Ψ(p,q) represents the filter output. In the case of the unity weight vector w = [w(i,j ) = 1, (i, j ) ∈ ζ ], the WVMF definition [Eq. (49)] reduces to the earlier one [Eq. (44)] of the VMF. Note that both the VMF and the WVMF are generalized within a class of selection weighted vector filters (SWVF) (Lukac et al., 2004c) presented in Section IV.A.5. Thus,
210
LUKAC AND PLATANIOTIS
to tune the performance of the WVMF operators (Lukac et al., 2003a) the SWVF optimization framework can be utilized (Lukac et al., 2004c). In some situations, the choice of the outputted sample y(p,q) from the input set Ψ(p,q) may come as a limitation. Therefore, the combined vector and component-wise filtering can be used to achieve better noise attenuation (Astola et al., 1990). Operating on the set of the filter weights w(i,j ) , for (i, j ) ∈ ζ , associated with the input vectors x(i,j ) ∈ Ψ(p,q) the so-called extended WVMF is defined as follows (Viero et al., 1994): ⎧
⎪ yWAMF if w(i,j ) yAMF ⎪ (p,q) − x(i,j ) L (p,q) ⎪ ⎪ ⎪ (i,j )∈ζ ⎪ ⎨
≤ w(i,j ) yVMF y(p,q) = (50) (p,q) − x(i,j ) L ⎪ ⎪ (i,j )∈ζ ⎪ ⎪ ⎪ ⎪ ⎩ yWVMF otherwise (p,q)
yWVMF (p,q)
where denotes the WVMF output calculated using Eq. (44), and AVMF y(p,q) is the output of a weighted averaging filter yWAMF (p,q) = "
1 (g,h)∈ζ
w(g,h)
w(i,j ) x(i,j ) .
(51)
(i,j )∈ζ
Similarly to Eq. (45), the extended WVMF operation in Eq. (50) chooses yWAMF (p,q) in smooth areas to produce the final output, whereas near edges it tends to choose the VMF to preserve the structural information. Since the weighted averaging operation in Eq. (50) tends to smooth fine details and it is prone to outliers, improved design characteristics can be obtained by replacing the weighted averaging filter yWAMF (p,q) in Eq. (50) with the alpha-trimmed filter: 1 x(i,j ) (52) yα(p,q) = |ζα | (i,j )∈ζα
where α is a design parameter that can have values α = 0, 1, . . . , |ζ | − 1. The set ζα = {(i, j ), for D(i,j ) ≤ D(|ζ |−α) } ⊂ ζ , consists of the spatial locations of the vectors x(p,q) ∈ Ψ(p,q) , which have the aggregated weighted distances in Eq. (48) smaller or equal to the (|ζ | − α)th largest aggregated weighted distance D(|ζ |−α) ∈ {D(i,j ) ; (i, j ) ∈ ζ }. If α = |ζ | − α, the filter in Eq. (50) is equivalent to a WVMF. Vector rational filters (VRF) (Khriji and Gabbouj, 1999, 2002) operate on the input vectors x(i,j ) of Ψ(p,q) using rational functions: y(p,q) =
P [x(i,j ) ; (i, j ) ∈ ζ ] Q[x(i,j ) ; (i, j ) ∈ ζ ]
(53)
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
211
where P (·) = [P1 (·), P2 (·), P3 (·)] is a vector-valued polynomial with the components Pk x(i,j ) ; (i, j ) ∈ ζ = a0 + ai,j x(i,j )k (i,j )∈ζ
+
(i1 ,j1 )∈ζ (i2 ,j2 )∈ζ
ai1 ,j1 ,i2 ,j2 x(i1 ,j1 )k x(i2 ,j2 )k + · · ·
(54)
and a0 , ai,j , ai1 ,j1 ,i2 ,j2 , . . . , are the functions f (x(i,j ) ; (i, j ) ∈ ζ ) of the input set Ψ(p,q) . The function Q(·) is a scalar polynomial Q x(i,j ) ; (i, j ) ∈ ζ = b0 + bi1 ,j1 ,i2 ,j2 x(i1 ,j1 ) − x(i2 ,j2 ) L (55) (i1 ,j1 )∈ζ (i2 ,j2 )∈ζ
where b0 > 0 and bi1 ,j1 ,i2 ,j2 are constant. The kth component y(p,q)k of the VRF output vector y(p,q) is defined as y(p,q)k = [[Pk (·)]]/Q(·), where [[Pk (·)]] denotes the integer part of Pk (·). Thus, depending on the filter parameters the VRF can remove the different type of noise (additive Gaussian, impulsive, mixed), while retaining sharp edges. Similarly to the other VMF-like filters, the VRF reduces to rational scalar filters if the vector dimension is one. Finally, there also exist vector median rational hybrid filters that combine the VRF and linear low-pass filters to reduce the computational complexity of the standard VRF operators. Multichannel L filters (Kotropoulos and Pitas, 2001; Nikolaidis and Pitas, 1996) use a linear combination of the ordered input samples to determine the filter output: y(p,q) =
|ζ |
wτ x(τ )
(56)
τ =1
where wτ is the weight associated with the τ th ordered vector x(τ ) ∈ Ψ(p,q) . These filters can be designed optimally under the mean-square error (MSE) criterion for a specific additive noise distribution. Assuming the weight vector w = [w1 , w2 , . . . , w|ζ | ] and the unity vector e = [1, 1, . . . , 1] of the dimension identical to that of w, the optimal coefficients wτ , for τ = 1, 2, . . . , |ζ |, can be determined as follows: w=
R−1 e eT R−1 e
(57)
212
LUKAC AND PLATANIOTIS
F IGURE 10.
Directional processing concept on the Maxwell triangle.
where wT e = 1 is the constraint imposed on the solution and R is a |ζ | × |ζ | correlation matrix of the ordered noise variables. Alternatively, the popular r based on the least mean square (LMS) formula w = w + 2μe(p,q) Ψ(p,q) r ordered input set Ψ(p,q) can be used instead of Eq. (57) to speed-up the optimization process. 4. Vector Directional Filters It has been observed that the filtering techniques taking into account the vectors’ magnitude may produce color outputs with chromaticity impairments. To alleviate such problems, a new type of multichannel filters has been proposed (Trahanias and Venetsanopoulos, 1993). The so-called vector directional filter (VDF) family operates on the direction of the image vectors, aiming to eliminate vectors with atypical directions in the vector space (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000). To achieve its objective, the VDF utilizes the angle [Eq. (27)] between the image vectors to order vector inputs inside a processing window (Plataniotis et al., 1998b). The output of the basic vector directional filter (BVDF) (Trahanias et al., 1996) defined within the VDF class is the color vector x(g,h) ∈ Ψ(p,q) whose direction is the MLE of directions of the input vectors (Nikolaidis and Pitas, 1998). Thus, the BVDF output x(g,h) minimizes the angular ordering criterion (Figure 10) (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000) to other samples inside the sliding filtering window Ψ(p,q) : y(p,q) = arg min A(x(g,h) , x(i,j ) ). (58) x(g,h)
(i,j )∈ζ
The above definition can be used to express a spherical median (SM) (Trahanias et al., 1996), which minimizes the angular criterion in Eq. (58)
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
213
without the constraint that the filter output y(p,q) is one of the original samples within the filtering window Ψ(p,q) . It was argued in Tang et al. (2001) that the vector’s direction defines its color chromaticity properties. Thus minimizing the angular distances between vector inputs may produce better performance than the VMF-based approaches in terms of direction preservation (Lukac et al., 2005a, 2005g). On the other hand, the BVDF does not take into account the magnitude characteristics and thus ignores the brightness of color vectors. To utilize both features in color image filtering (Trahanias et al., 1996), the generalized vector directional filters (GVDF) first eliminate the color vectors with atypical directions in the vector space: y(p,q) = fGVDF (x(1) , x(2) , . . . , x(τ ) )
(59)
where {x(1) , x(2) , . . . , x(τ ) } is the set of the τ lowest vector order statistics obtained using the angular distances in Eq. (27). As a result of this process, a set of input vectors with approximately the same direction in the vector space is produced as the output set. Then, the GVDF operators process the vectors with the most similar orientation according to their magnitude. Thus, the GVDF splits the color image processing into directional processing and magnitude processing. Another approach, the directional-distance filter (DDF) (Karakos and Trahanias, 1997), combines both ordering criteria used in the VMF and the BVDF schemes. Using equal amounts of the magnitude and directional information, the DDF makes use of a hybrid ordering criterion expressed through a product of the aggregated Euclidean distance and the aggregated angular distance as follows: A(x(i,j ) , x(g,h) ) . (60) x(i,j ) − x(g,h) L D(i,j ) = (g,h)∈ζ
(g,h)∈ζ
The DDF output is the sample x(1) ∈ Ψ(p,q) in Eq. (21) associated with the smallest value D(i,j ) , for (i, j ) ∈ ζ . The introduction of the DDF inspired a new set of heuristic vector processing filters such as the hybrid vector filters (HVF) (Gabbouj and Cheickh, 1996; Plataniotis and Venetsanopoulos, 2000). These filters try to capitalize on the same appealing principle, namely the simultaneous minimization of the distance functions used in the VMF and the BVDF. The HVFs operate on the direction and the magnitude of the color vectors independently and then combine them to produce a unique final output. The HVF1 approach, viewed as a nonlinear combination of the VMF and BVDF filters, generates an output according to the following rule (Lukac et al., 2004b; Plataniotis and
214
LUKAC AND PLATANIOTIS
Venetsanopoulos, 2000): y(p,q) =
yVMF (p,q)
BVDF if yVMF (p,q) = y(p,q)
y¯ 1(p,q)
otherwise
(61)
VMF yVMF (p,q) is the VMF output obtained in (44), y(p,q) characterizes the BVDF output in Eq. (58) and y¯ 1(p,q) is the vector calculated as
y¯ 1(p,q)
=
|yVMF (p,q) |
|yBVDF (p,q) |
yBVDF (p,q)
(62)
with | · | denoting the magnitude of the vector. A more refined, nonlinear combiner is the so-called HVF2 (Gabbouj and Cheickh, 1996), which combines AMF, VMF, and BVDF as follows (Lukac et al., 2004b; Plataniotis and Venetsanopoulos, 2000): ⎧ VMF BVDF y(p,q) if yVMF ⎪ (p,q) = y(p,q) ⎪ ⎪ ⎪ ⎨ 1 x(i,j ) − y¯ 1 < x(i,j ) − y¯ 2 ¯ if y (p,q) (p,q) (p,q) y(p,q) = ⎪ (i,j )∈ζ (i,j )∈ζ ⎪ ⎪ ⎪ ⎩ 2 y¯ (p,q) otherwise (63) where y¯ 1(p,q) is the vector obtained in Eq. (62) and y¯ 2(p,q) is the vector defined as AMF |y(p,q) | BVDF 2 y(p,q) y¯ (p,q) = (64) |yBVDF (p,q) |
with yAMF (p,q) in Eq. (46) denoting the output of the AMF operating inside the same processing window Ψ(p,q) positioned in (p, q). Both hybrid vector filters are computationally demanding due to the required evaluation of both the VMF and BVDF outputs (Plataniotis and Venetsanopoulos, 2000). Thus, the two independent ordering schemes are applied to the input samples to produce a unique final output. The recently introduced weighted vector directional filters (WVDF) (Lukac, 2004a; Lukac et al., 2004b) utilize nonnegative real weighting coefficients w(i,j ) associated with the input vectors x(i,j ) , for (i, j ) ∈ ζ . These filters output the color vector y(p,q) = x(g,h) ∈ Ψ(p,q) , which minimizes the aggregated weighted angular distance to the other samples inside the processing window Ψ(p,q) : y(p,q) = arg min w(i,j ) A(x(g,h) , x(i,j ) ). (65) x(g,h)
(i,j )∈ζ
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
215
Equivalently, the WVDF output is determined using the lowest vector order statistics x(1) from the ordered set in Eq. (21) based on the aggregated weighted angular distance criterion (Lukac et al., 2004b): D(i,j ) = w(g,h) A(x(i,j ) , x(g,h) ) (66) (g,h)∈ζ
where D(i,j ) is associated with the input vector x(i,j ) . Since the WVDF output is by construction one of the original samples in the input set Ψ(p,q) , the filter never introduces new outlying vectors. Based on the actual weight vector w in Eq. (65) or (66), the WVDF filtering class extends the flexibility of the VDFbased designs, improves the detail-preserving filtering characteristics of the conventional VDF schemes, and provides a powerful color image filtering tool capable of tracking varying signal and noise statistics. To obtain the desired performance of the WVDF operators, the least mean absolute (LMA) errorbased multichannel adaptation algorithms operating in the directional domain of the processed vectors have been introduced in Lukac et al. (2004b). The WVDF and WVMF weights’ adaptation algorithms along with the WVDF and WVMF operators have been generalized in the unified framework of the selection weighted vector filters in Lukac et al. (2004c). 5. Selection Weighted Vector Filters The structure of the selection weighted vector filters (SWVF) (Lukac et al., 2004c) is characterized by a design parameter ξ ranging from 0 to 1, and a set of nonnegative real weights w = {w(i,j ) ; (i, j ) ∈ ζ }. For each input sample x(i,j ) , (i, j ) ∈ ζ , the weights w(i,j ) are used to form a SWVF processing function fSWVF (Ψ(p,q) , w, ξ ) defined as follows (Lukac et al., 2004c; Lukac and Plataniotis, 2005b): 1−ξ w(i,j ) x(g,h) − x(i,j ) L y(p,q) = arg min x(g,h)
×
(i,j )∈ζ
(i,j )∈ζ
w(i,j ) A(x(g,h) , x(i,j ) )
ξ
(67)
where y(p,q) = x(g,h) ∈ Ψ(p,q) represents the filter output. The selective nature of the SWVF operator and the use of the minimization concept ensure the outputting of the input color vector x(g,h) , which is the most similar, under the specific setting of w, to other samples in Ψ(p,q) . The weighting coefficient w(i,j ) signifies the importance of x(i,j ) in Ψ(p,q) . Through the weight vector w and the design parameter ξ , the SWVF scheme tunes the overall filter’s detail-preserving and noise-attenuating characteristics
216
LUKAC AND PLATANIOTIS
F IGURE 11. Adaptive filtering concept with the parameter’s adaptation obtained using (a) original signal, (b) noisy signal.
and uses both the spatial and spectral characteristics of the color image x during processing. Depending on the value of parameter 0 ≤ ξ ≤ 1 in Eq. (67), color image processing can be performed in the magnitude (ξ = 0) or directional (ξ = 1) domain. By setting ξ = 0.5, the SWVF process the input image using an equal amount of the magnitude and directional information. Any deviation from this value to a lower or larger value of ξ places more emphasis on the magnitude or directional characteristics, respectively. Thus, each setting of the filter parameters w and ξ represents a specific filter that can be used for a specific task. This suggests that SWVF filters constitute a wide class of vector operators. For example, the use of the unity weight vector w = 1 in Eq. (67) with ξ = 0 reduces an SWVF operator to the VMF, while the use of w = 1 reduces the SWVF to the BVDF (for ξ = 1) and to the DDF (for ξ = 0.5). Similarly, the use of ξ = 0 and ξ = 1 reduces the SWVF to the WVMF and WVDF, respectively, for an arbitrary weight vector w. The use of the SWVF scheme [Eq. (67)] requires the determination of the weight vector w by the end-user. Alternatively, if the original signal o(p,q) of Eq. (6) is available, the weights w(i,j ) in Eq. (67) can be adapted as follows (Figure 11a) (Lukac et al., 2004c): (68) w(i,j ) = P w(i,j ) + 2μR(o(p,q) , y(p,q) ) sgn R(x(i,j ) , y(p,q) )
where μ is a regulation factor. Each weight w(i,j ) is adjusted by adding the contributions of the corresponding input vector x(i,j ) and the SWVF output y(p,q) . These contributions are measured as the distances to the original signal o(p,q) , which is used to guide the adaptation process (Lukac et al., 2004b). The initial weight vector can be set to any arbitrary positive value, but equally aligned weighting coefficients such as w(i,j ) = 1, for (i, j ) ∈ ζ , corresponding to the robust smoothing functions and μ ≪ 0.5, are the values recommended in Lukac et al. (2004c) for conventional color image processing applications. To minimize the influence of the initial setting of the SWVF parameters, the adaptation formula should allow for the adjustment
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
217
of w(i,j ) using both positive and negative contributions. Therefore, Eq. (68) is constructed using the sign sigmoidal function [Eq. (39)] and the vectorial sign function (Lukac et al., 2004c): ξ 1−ξ R(x(i,j ) , x(g,h) ) = S(x(i,j ) , x(g,h) ) x(i,j ) − x(g,h) L A(x(i,j ) , x(g,h) ) , (69) which considers contributions using both the magnitude and directional information with the polarity S(·) ∈ {−1, 1} defined as follows: +1 for x(i,j ) − x(g,h) ≥ 0 (70) S(x(i,j ) , x(g,h) ) = −1 for x(i,j ) − x(g,h) < 0.
The use of R(·) is essential in sgn(·) since the positive (or negative) values of R(x(i,j ) , y(p,q) ) allow for the corresponding adjustment of w(i,j ) in Eq. (68) by adding the negative (or positive) value of 2μR(o(p,q) , y(p,q) ) sgn[R(x(i,j ) , y(p,q) )]. If the sample under consideration x(i,j ) and the actual SWVF output y(p,q) are identical [i.e. R(x(i,j ) , y(p,q) ) = 0], then sgn(·) = 1, which suggests that w(i,j ) is adjusted based solely on the difference between the SWVF output y(p,q) and the original signal o(p,q) . To keep the aggregated distances in Eq. (67) positive, and thus to ensure the unbiased low-pass characteristics of the SWVF filters, a projection function [Eq. (41)] is used to project the updated weight w(i,j ) onto the constraint space of w during the adaptation process in Eq. (68). If the original signal o(p,q) is not available (Figure 11b), the weights w(i,j ) in Eq. (67) can be adapted replacing the original signal o(p,q) in Eq. (68) with ∗ ∗ ∗ , y(p,q)2 , y(p,q)3 ] as follows: the feature signal y∗(p,q) = [y(p,q)1 w(i,j ) = P w(i,j ) + 2μR(y∗(p,q) , y(p,q) ) sgn R(x(i,j ) , y(p,q) ) . (71)
The considered adaptation scheme leads to a number of SWVF filters with different design characteristics. For example, the feature signal y∗(p,q) can be obtained using one of the following ways (Lukac et al., 2004c; Lukac and Plataniotis, 2005b):
• The use of the acquired signal y∗(p,q) = x(p,q) is useful, when the corrupting noise power is low and strong detail-preserving characteristics are expected from the SWVF operators. • The robustness of the SWVF operator and its noise attenuation capability is ensured using a robust, easy to calculate estimate such as ∗ = the component-wise median of Ψ(p,q) with the components y(p,q)k ∗ median{x(i,j )k ; (i, j ) ∈ ζ } of y(p,q) obtained in Eq. (34). • The trade-off between the noise attenuating and detail-preserving characteristics can be obtained through the combination of the input signal and the obtained estimate.
218
LUKAC AND PLATANIOTIS
Different design characteristics of the SWVF operators are obtained when the adaptation algorithm in Eq. (43) is extended to process the vector signals. In this case, the uppermost x(|ζ |) and the lowest x(1) ranked vectors in Eq. (21) are used to update the weights w(i,j ) , for (i, j ) ∈ ζ , as follows (Lukac et al., 2004b): w(i,j ) = P w(i,j ) + 2μ R(x(|ζ |) , x(1) ) − 2R(o(p,q) , x(i,j ) ) −
(g,h)∈ζ
& w(g,h) R(x(|ζ |) , x(1) ) − 2R(x(i,j ) , x(g,h) ) (72)
where μ is the positive adaptation stepsize. Similarly to Eq. (68), the negative weight coefficients are modified by projection operation [Eq. (41)]. Following the rationale in Eq. (71), the adaptation formula in Eq. (72) can be modified by replacing the desired signal o(p,q) with the feature signal y∗(p,q) . Finally, it should be mentioned that the both multichannel adaptation algorithms in Eqs. (68) and (72) generalize their scalar versions Eqs. (40) and (43), respectively. In addition, the framework generalizes simplified versions of the algorithms used to optimize the WVMF and WVDF operators in the magnitude (ξ = 0) or the directional (ξ = 1) domain of the processed vector signals, respectively. Therefore, it can be concluded that the SWVF framework constitutes a flexible tool for multichannel image processing. 6. Data-Adaptive Vector Filters Since the images are highly nonstationary due to the edges and fine details, and it is difficult to differentiate between noise and edge pixels, fuzzy sets are highly appropriate for image filtering tasks (Plataniotis and Venetsanopoulos, 2000). A number of fuzzy filters, such as the one proposed in Tsai and Yu (2000), adopt a window-based, rule-driven approach leading to a datadependent fuzzy solution. To obtain the desired performance of such a filter, the fuzzy rules must be optimally set using the optimization procedure, which often requires the presence of the original signal. However, the original data are usually not available in practical applications. Therefore, the fuzzy vector filters in Lukac et al. (2005b) and Plataniotis et al. (1996, 1999) are designed to remove noise in multichannel images without the requirement of fuzzy rules. The most commonly used method to smooth high-frequency variations and transitions is averaging. Therefore, the general form of the data-dependent filter is given as a fuzzy weighted average (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000) of the input vectors inside the supporting win-
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
dow Ψ(p,q) : y(p,q) = f
w(i,j ) x(i,j )
(i,j )∈ζ
219
(73)
where f (·) is a nonlinear function that operates over the weighted average of the input set and + w(i,j ) = μ(i,j ) μ(g,h) (74) (g,h)∈ζ
is the normalized filter weight calculated using the weighting coefficient μ(i,j ) equivalent to the fuzzy membership function associated with the input color " vector x(i,j ) ∈ Ψ(p,q) . Note that the two constraints w(i,j ) ≥ 0 and (i,j )∈ζ w(i,j ) = 1 are necessary to ensure that the filter output is an unbiased estimator and produces the samples within the desired intensity range. Operating on the vectorial inputs x(i,j ) , the weights w(i,j ) in Eq. (74) are determined adaptively using functions of a distance criterion between the input vectors (Lukac et al., 2005a). Since the relationship between distances measured in physical units and perception is generally exponential (Plataniotis et al., 1999), an exponential type of function may be suitable for use in the weighting formulation (Lukac et al., 2005b; Plataniotis et al., 1999): −r μ(i,j ) = β 1 + exp{D(i,j ) } (75)
where r is a parameter adjusting the weighting effect of the membership function, β is a normalizing constant, and D(i,j ) is the aggregated distance or similarity measure defined in Eq. (20). The data-adaptive filters can be optimized for any noise model by appropriately tuning their membership function in Eq. (75). The vector y(p,q) outputted in Eq. (73) is not part of the original input set Ψ(p,q) . In some image processing application (Lukac et al., 2005a), constrained solutions such as the VMF of Eq. (44) and the BVDF of Eq. (58), which can provide higher preservation of image details (see Figure 12) compared to the unconstrained solutions, are required. Therefore, a different design strategy should be used and the adaptive weights in Eq. (73) can be redefined as follows (Lukac et al., 2005b; Plataniotis et al., 1999): 1 if μ(i,j ) = μmax w(i,j ) = (76) 0 if μ(i,j ) = μmax where μmax ∈ {μ(i,j ) ; (i, j ) ∈ ζ } is the maximum fuzzy membership value. If the maximum value occurs at a single point only, Eq. (73) reduces to a selection filtering operation y(p,q) = x(i,j ) ,
for μ(i,j ) = μmax
(77)
220
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
F IGURE 12. Filtering of additive Gaussian noise in Figure 5b: (a) MF output, (b) VMF output, (c) data-adaptive filter output, (d) DPAL filter output.
which identifies one of the samples inside the processing window Ψ(p,q) as the filter output. 7. Adaptive Multichannel Filters Based on Digital Paths Adaptive multichannel filters proposed in Szczepanski et al. (2003, 2004) exploit connections between image pixels using the concept of digital paths instead of using a fixed supporting window. Operating in a predefined search area, image pixels are grouped together forming paths that reveal the underlying structural image content and are used to determine the weighting coefficients of a data-adaptive filter in Eq. (73).
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
221
Assuming that ρ0 = (i, j ) and ρη = (g, h)—the two spatial locations inside the search area—are connected by a digital path P η (ρ0 , ρ1 , . . . , ρη ) of length η, the connection cost Λη (·) defined over the digital path linking the starting ρ0 and the ending ρη location using η − 1 connecting locations ρc , for c = 1, 2, . . . , η − 1, is expressed as η
Λ (ρ0 , ρ1 , . . . , ρη ) = f (xρ0 , xρ1 , . . . , xρη ) =
η c=1
xρc − xρc−1 . (78)
The function Λη (·) can be seen as a measure of dissimilarity between color image pixels xρ0 , xρ1 , . . . , xρη . If a path P η (·) joining two distinct locations consists of the identical vectors xρc , for c = 0, 1, . . . , η, then Λη (·) = 0, otherwise Λη (·) > 0. In general, two distinct pixel’s locations on the image lattice can be connected by many paths. Moreover the number of possible geodesic paths of certain length η connecting two distinct points depends on their locations, length of the path, and the neighborhood system used (Smolka et al., 2004). Similarly to a fuzzy membership function [Eq. (75)] used in a data-adaptive filter in Eq. (73), a similarity function is employed here to evaluate the appropriateness of the digital paths leading from (i, j ) to (g, h) as follows: χ
μ
η,Ψ
η,b f Λ (i, j ), (g, h) (i, j ), (g, h) = =
b=1 χ b=1
exp −β · Λη,b (i, j ), (g, h)
(79)
where χ is the number of all paths connecting (i, j ) and (g, h), Λη,b [(i, j ), (i, j )] is a dissimilarity value along a specific path b from the set of all χ possible paths leading from (i, j ) to (g, h) in the search area Ψ , f (·) is a smooth function of Λη,b , and β is the design parameter. Note that Ψ can be restricted by the dimension of the supporting window in conventional filtering. If x(p,q) ∈ Ψ(p,q) is the color vector under consideration and x(i,j ) ∈ Ψ(p,q) represents the vector connected to x(p,q) via a digital path, the digital path approach (DPA) filter is defined as follows: (p,q) w(i,j ) x(i,j ) y(p,q) = (i,j )⇔(p,q)
=
(i,j )⇔(p,q)
"
μη,Ψ [(p, q), (i, j )]x(i,j ) η,Ψ [(p, q), (g, h)] (g,h)⇔(p,q) μ
(80)
222
LUKAC AND PLATANIOTIS
where (i, j ) ⇔ (p, q) denotes all points (i, j ) connected by digital paths with (p, q) contained in Ψ(p,q) . Adapting the concept in Eq. (73), the outputted vector y(p,q) is equivalent to the weighted average of all vectors x(i,j ) connected by digital paths with the vector x(p,q) . A more sophisticated solution is obtained by incorporating the information on the local image features into the filter structure (Smolka et al., 2004). This can be done through the investigation of the connection costs Λη (·) of digital paths that originate at ρ0 , cross ρ1 , and then pass the successive locations ρc , for c = 2, 3, . . . , η, until the path reaches length η. Operating on the above assumptions, the similarity function [Eq. (79)] is modified as μ
η,Ψ
(ρ0 , ρ1 , η) =
χ b=1
exp −β · Λη,b (ρ0 , ρ1 , ρ2∗ , . . . , ρη∗ )
(81)
where χ denotes the number of the paths P η (ρ0 , ρ1 , p2∗ , . . . , ρη∗ ) originating at ρ0 crossing ρ1 and ending at ρη∗ , which are totally included in the search area Ψ . If the constraint of crossing the location ρ1 is omitted, then Λη,b (ρ0 , ρ1 , ρ2∗ , . . . , ρη∗ ) can be replaced with Λη,b (ρ0 , ρ1∗ , ρ2∗ , . . . , ρη∗ ). " Using the normalized weights w(ρ0 , ρ1∗ ) = μη,Ψ (ρ0 , ρ1 , η)/ μη,Ψ (·) the so-called DPAF filter replaces the color vector x(p,q) ∈ Ψ(p,q) under consideration as follows (Szczepanski et al., 2003): y(p,q) = w(ρ0 , ρ1∗ )xρ1∗ . (82) ρ1∗ ∼ρ0
Thus, operating inside the supporting window Ψ(p,q) , the weights are calculated exploring all digital paths starting from the central pixel and crossing its neighbors. The output vector y(p,q) is obtained through a weighted average of the nearest neighbors of x(p,q) . In a similar way, the so-called DPAL filter can be defined as (Szczepanski et al., 2003) w(ρ0 , ρη∗ )xρη∗ (83) y(p,q) = ρη∗
where the weights w(ρ0 , ρη∗ ) are obtained by exploring all digital paths leading from the central pixel x(p,q) , for ρ0 = (p, q) to any of the pixels in the supporting window. Then, a weighted average of all pixels contained in the supporting window is calculated to determine y(p,q) . Note that the DPAL filter involves all the |ζ | pixels from Ψ(p,q) into the averaging process, whereas the DPAF filter determines the weighted output using only its nearest neighbors. Therefore, the DPAL filter has more efficient smoothing capability.
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
(a)
(b)
(c)
(d)
223
F IGURE 13. Filtering of impulsive noise in Figure 5c: (a) MF output, (b) VMF output, (c) SWVF output, (d) output of the switching filter in Lukac et al. (2004e).
8. Switching Filtering Schemes Besides the adaptive filters, such as SWVF, fuzzy vector filters, and DPAbased filters, the essential trade-off between noise suppression and imagedetail preservation (see Figure 13) can be achieved in the impulsive environment by switching within a range of predefined filtering operators (Figure 14) (Hore et al., 2003; Ma and Wu, 2006; Smolka, 2002). As explained in Lukac et al. (2005f), filters following the switching mode paradigm most often switch between a robust nonlinear smoothing mode and an identity processing mode that leaves input samples unchanged during
224
LUKAC AND PLATANIOTIS
F IGURE 14. Switching filter concept based on (a) the fixed threshold ξ and (b) fully adaptive control using the signal statistics.
the filtering operation. Their decoupled, easily implemented structure and their computational simplicity made such filters popular and a method of choice in a variety of applications where the desired signal is corrupted by impulsive noise. Recent developments have seen the introduction of switching schemes based on vector operations, which further increased the appeal of the switching framework in color processing applications (Lukac, 2003, 2004a; Lukac et al., 2005c). Operating on the color vectors inside the supporting window Ψ(p,q) , the switching vector filter (SVF) output is defined as follows (Lukac, 2004a): y(p,q) =
yNSF (p,q) x(p,q)
if λ ≥ ξ otherwise
(84)
where yNSF (p,q) denotes a robust nonlinear smoothing filter (NSF) output (e.g., VMF, BVDF, or DDF) and x(p,q) is the input color vector occupying the center of the supporting window Ψ(p,q) . The switching mechanism is controlled by comparing the adaptive parameter λ(Ψ(p,q) ) and the nonnegative threshold ξ , which can be defined either as the fixed value used in Figure 14a or the function ξ(Ψ(p,q) ) of the input set Ψ(p,q) as shown in Figure 14b. In the case of noise detection (λ ≥ ξ ), the input vector x(p,q) ∈ Ψ(p,q) is replaced with the NSF output yNSF (p,q) . If λ < ξ , then x(p,q) is considered noise-free and remains unchanged (i.e., the SVF performs the so-called identity operation). If ξ = 0, then the filter output is always the NSF output, while for large values of ξ , the switching filter output will always be the central pixel x(p,q) as follows: y(p,q) =
yNSF (p,q) x(p,q)
if ξ = 0 if ξ → ∞.
(85)
The SVF family in Lukac (2002a, 2003) uses the switching mechanism in Eq. (84) based on the function λ(Ψ(p,q) , τ ) of the window center x(p,q) and the
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
225
robust order statistics {x(c) ∈ Ψ(p,q) , c = 1, 2, . . . , τ } obtained in Eq. (21): τ 1 λ = d x(p,q) , (86) x(c) . τ c=1
The achieved value of λ is compared with the fixed threshold ξ . It has been found in Lukac (2002a) that the utilization of the angular distance Eq. (27) in Eq. (86) along with the set of the five (τ = 5) lowest vector directional order statistics x(c) , the BVDF-based yNSF (p,q) , and the threshold ξ = 0.16 provides excellent color/structural preserving characteristics. A more robust solution in Lukac (2003) uses Eqs. (86) and (21) based on the Euclidean metric [Eq. (24)], the VMF-based yNSF (p,q) , and the control parameters τ = 5 and ξ = 60. A sophisticated switching filtering scheme is obtained using the selection weighted filters (WM, WVMF, WVDF, SWVF) with the weight vector w = [w(i,j ) ; (i, j ) ∈ ζ ] constituted as follows (Lukac, 2004a): w(i,j ) = |ζ | − 2c + 2 if (i, j ) = (p, q) (87) 1 otherwise. By tuning the smoothing parameter c, for c = 1, 2, . . . , (|ζ | + 1)/2, in such a center-weighted filter the mechanism regulates the amount of smoothing provided by the filter ranging from no smoothing (the identity operation for c = 1) to the maximum amount of smoothing [c = (|ζ | + 1)/2 reduces, for example, the WVMF and WVDF operators to the VMF and BVDF, respectively] (Lukac, 2002b). Operating in this range, the switching parameter λ(Ψ(p,q) , τ ) is defined as follows: τ +2 λ= d yc(p,q) , x(p,q)
(88)
c=τ
where yc(p,q) denotes the vector obtained using the selection weighted filters based on Eq. (87) with the parameter c. The value of λ in Eq. (88) is compared in Eq. (84) with the fixed threshold ξ . Using the different distance measures d(·) and selection weighted filters yc(p,q) in Eq. (88), and the different smoothing filters in Eq. (84), a multitude of SVF filters varying in their performance and complexity can be obtained. For example, using the directional processing based WVDF yc(p,q) , the angular distance [Eq. (27)], the BVDF-based yNSF (p,q) , and the parameters τ = 2 and ξ = 0.19 excellent color/structural preservation is obtained (Lukac, 2004a). Robust impulsive noise filtering characteristics are observed when the WVMF-based yc(p,q) , the Euclidean distance [Eq. (24)], the VMF-based yNSF (p,q) , and ξ = 80 are used
226
LUKAC AND PLATANIOTIS
instead (Lukac, 2001). Very low complexity is obtained when the componentwise implementation with the WM-based yc(p,q) , the MF-based yNSF (p,q) , and ξ = 60 is employed (Lukac et al., 2004e). Finally, it should be noted that using center-weighted selection filters, the concept can be extended from a bilevel smoothing scheme in Eq. (84) to a multilevel smoothing [up to (|ζ | + 1)/2 smoothing levels in Eq. (84)], which can allow for additional flexibility and performance improvement (Lukac and Marchevsky, 2001a; Lukac, 2004a). The SVF filters in Lukac et al. (2005c) use the approximation of the multivariate dispersion. In this design, the value of λ is determined through the function λ(Ψ(p,q) ) = D(p,q) defined as the aggregated distance D(p,q) in Eq. (20) between the window center x(p,q) and the other vectors x(i,j ) , for (i, j ) ∈ ζ , inside the supporting window Ψ(p,q) . The parameter ξ(Ψ(p,q) , τ ) in Eq. (84) is determined as (Lukac et al., 2005c) ξ = D(1) + τ ψ =
|ζ | − 1 + τ D(1) |ζ | − 1
(89)
|ζ | + τ Dx¯ |ζ |
(90)
where ψ is the variance approximated using the smallest aggregated distance D(1) ∈ {D(i,j ) ; (i, j ) ∈ ζ } as ψ = D(1) /(|ζ | − 1) and τ (suboptimal value is τ = 4) is the tuning parameter used to adjust the smoothing properties of the SVF filter. An alternative " solution can approximate the variance via ψx¯ = Dx¯ /|ζ |, where Dx¯ = x(p,q) , x(i,j ) ) is the aggregated (i,j )∈ζ d(¯ distance between multichannel input samples x(i,j ) ∈ Ψ(p,q) and the sample mean x¯ (p,q) . In this case, the parameter ξ(Ψ(p,q) , τ ) in Eq. (84) is determined as follows (Lukac et al., 2005c): ξ = Dx¯ + τ ψx¯ = with the suboptimal value τ = 12. 9. Similarity Based Vector Filters Taking advantage of the switching vector filters, the class of similarity-based vector filters has been proposed in Smolka et al. (2003). These filters use the modified aggregated similarity measures ′ D(p,q) μ(x(p,q) , x(g,h) ) (91) = (g,h)∈ζ (g,h)=(p,q)
′ D(i,j ) =
(g,h)∈ζ (i,j )=(p,q) (i,j )=(g,h)
μ(x(i,j ) , x(g,h) )
(92)
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
227
associated with the input central vector x(p,q) and the neighboring vectors x(i,j ) , respectively, located inside the supporting window Ψ(p,q) . The function μ(x(i,j ) , x(g,h) ) = μ(x(i,j ) − x(g,h) ) denotes the similarity function μ : [0; ∞) → R, which is nonascending and convex in [0; ∞) and satisfies μ(0) = 1, μ(∞) = 0. The similarity between two identical vectors is equal to 1, and the similarity between the two maximally different color vectors [0, 0, 0] and [255, 255, 255] should be very close to 0. The above conditions are satisfied by the following similarity functions (Smolka et al., 2003): 2 μ(x(i,j ) , x(g,h) ) = exp − x(i,j ) − x(g,h) / h μ(x(i,j ) , x(g,h) ) = exp −x(i,j ) − x(g,h) / h
μ(x(i,j ) , x(g,h) ) =
1 , 1 + x(i,j ) − x(g,h) / h
h ∈ (0; ∞)
1 (1 + x(i,j ) − x(g,h) )h 2 μ(x(i,j ) , x(g,h) ) = 1 − arctan x(i,j ) − x(g,h) / h π 2 , h ∈ (0; ∞) μ(x(i,j ) , x(g,h) ) = 1 + exp{x(i,j ) − x(g,h) / h} μ(x(i,j ) , x(g,h) ) =
μ(x(i,j ) , x(g,h) ) =
1 . 1 + x(i,j ) − x(g,h) h
(93) (94) (95) (96) (97) (98) (99)
′ ′ By comparing the values D(p,q) and D(i,j ) , for (i, j ) ∈ ζ and (i, j ) = (p, q), the switching filtering function (Smolka et al., 2003) is formed: ′ ′ ′ y(p,q) if D(p,q) ≤ min{D(i,j )} y(p,q) = (100) x(p,q) otherwise ′ ′ where the satisfied condition D(p,q) ≤ min{D(i,j ) } identifies the noisy window center x(p,q) (with the minimum similarity to other vectors in Ψ(p,q) ) to be replaced with
′ y′(p,q) = arg max D(i,j ) (x(i,j ) ); (i, j ) ∈ ζ, (i, j ) = (p, q) . x(i,j )
(101)
The vector y′(p,q) denotes the input vector x(i,j ) , which maximizes the aggregated similarity measure in Eq. (92) defined over the vectors neighboring ′ ′ the central vector x(p,q) ∈ Ψ(p,q) . If D(p,q) > min{D(i,j ) } in Eq. (100), then the window center x(p,q) is passed to the filter output unchanged.
228
LUKAC AND PLATANIOTIS
Apart from the various similarity measures listed in Eqs. (93)–(99), the simplest function 1 − x(i,j ) − x(g,h) / h if x(i,j ) − x(g,h) ≤ h μ(x(i,j ) , x(g,h) ) = (102) 0 otherwise where h ∈ (0; ∞), can be used through the aggregated distance function ⎧ −h + x(i,j ) − x(g,h) if (i, j ) = (p, q) ⎪ ⎪ ⎪ ⎪ ⎪ (g,h)∈ζ ⎪ ⎨ (g,h)=(p,q) (103) D(i,j ) = x(i,j ) − x(g,h) otherwise ⎪ ⎪ ⎪ ⎪ (g,h)∈ζ ⎪ ⎪ ⎩ (i,j )=(p,q) (i,j )=(g,h)
as the basis of the fast VMF-like filter. Taking into consideration the quantities D(i,j ) obtained in Eq. (103), the filter replaces the original vector x(p,q) in the supporting window Ψ(p,q) with x(i,j ) ∈ Ψ(p,q) as follows: (104) y(p,q) = arg min D(i,j ) (x(i,j ) ); (i, j ) ∈ ζ . x(i,j )
The construction of the above vector filter is similar to that of the VMF, with the major difference related to the omission of the central vector x(p,q) when calculating D(i,j ) in Eq. (103), for (i, j ) = (p, q). Since the central vector x(p,q) is not used in calculating the aggregated distances associated with its neighbors, the filter replaces x(p,q) only when it is really noisy. Similar to other switching filters, this preserves the desired image information. 10. Adaptive Hybrid Vector Filters A robust structure-adaptive hybrid vector filter (SAHVF) (Ma et al., 2005) classifies the central pixel x(p,q) into several different signal activity categories using noise-adaptive preprocessing and modified quadtree decomposition. The classification is performed during processing for each input set Ψ(p,q) determined by the sliding supporting window and a window adaptive hybrid filtering operation is then chosen according to the structure classification. Thus, the filter adapts itself to both local statistics through an update of the filter parameters and local structures by modifying the window dimension. The SAHVF employs the presmoothing filter similar to the well-known adaptive filter in Lee and Fam (1987). To use fast component-wise filter and avoid color artifacts in the outputted RGB vector, the presmoothing operation
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
229
is performed in the decorrelated YCb Cr space as follows: ′ y(p,q)k
=
′ x¯(p,q)k
+
σx2′
k
σk2
′ ′ (x(p,q)k − x¯(p,q)k )
(105)
where k = 1, 2, 3 indicates the Y, Cb , Cr channel, respectively; and σk2 = " ′ ′ 2 1/|ζ |2 (i,j )∈ζ (x(i,j component-wise variance calculated )k − x¯ (p,q)k ) is a" ′ ′ using the sample mean x¯(p,q)k = 1/|ζ | (i,j )∈ζ x(i,j )k of the kth components ′ ′ ′ x(i,j )k ∈ Ψ(p,q)k . The components x¯(p,q)k , for k = 1, 2, 3, constitute the ′ ′ ′ YCb Cr versions x′(·,·) = [x(·,·)1 , x(·,·)2 , x(·,·)3 ] of the original RGB vectors 2 x(·,·) . The local image signal variance σx ′ = max{σk2 − σv2k , 0} is obtained k √ using the contaminated additive noise deviation σvk = max{ξ π/18σ¯ vk − ε, 0}, where √ K 1 −1 K 2 −1 π/2 ′ (p,q)k ∗ u(p,q) (106) Ψ σ¯ vk = 6(K1 − 2)(K2 − 2) p=2 q=2
is the global noise deviation calculated using all the YCb Cr image values ′ ′ ∈Ψ xˆ(·,·)k (p,q)k corresponding to the outputted RGB VMF values in Eq. (44). A 3 × 3 Laplacian convolution mask u(p,q) = {1, −2, 1, −2, 4, −2, 1, −2, 1} is utilized to collect the additive noise energy from the input image (Ma et al., 2005). The empirically determined parameters ξ = 1.2 and ε = 4.4 has been found to well compensate the bias from fine image structures. To achieve sufficient suppression of noise in background areas, the choice of the window dimension of the presmoothing filter in Eq. (106) should depend on the value of the noise deviation σvk , namely, a 3 × 3, 5 × 5, and 7 × 7 window is recommended for 0 ≤ σvk ≤ 15, 15 < σvk ≤ 30, and 30 < σvk , respectively. The structure activity classification is performed at the second SAHVF stage. By converting the preprocessed image in Eq. (105) into the luminance ′ ′ ′ signal L′(p,q) ([y(p,q)1 , y(p,q)2 , y(p,q)3 ]), the modified quadtree decomposition (Ma et al., 2005) is used to decompose the luminance image {L′(·,·) } into nonoverlapping rectangular blocks represented by the deviation quantities λc , for c = 1, 2, 3, 4. Using the design parameter η (recommended values 0.7 ≤ η ≤ 2.5) and the median absolute deviation (MAD) of the ensemble deviations σ(p,q) , the satisfied condition max {λc } − min {λc }
1≤c≤4
1≤c≤4
≤ η.MAD{σ(p,q) ; p = 1, 2, . . . , K1 , q = 1, 2, . . . , K2 }
(107)
denotes a homogeneous block. If any block is classified as inhomogeneous, the splitting step is recursively applied until all the subblocks are either
230
LUKAC AND PLATANIOTIS
marked as homogeneous or they reach the minimum 1 × 1 size. If the dimension of any homogeneous block is larger than 16 × 16 pixels, the recursive block splitting procedure is applied until all the subblocks reach a 16 × 16-square shape. Since each luminance pixel L(p,q) is contained in an l × l block, for l = 1, 2, . . . , 16, an activity flag a(p,q) (L(p,q) ) = l can be determined. Using a 3 × "3 mean filter, the structure activity index is determined as I(p,q) = 1/|ζ | (i,j )∈ζ a(p,q) , where |ζ | = 9. Based on the obtained value of I(p,q) , each input RGB vector x(p,q) is classified finally as (1) high activity area (for 1.0 ≤ I(p,q) ≤ 2.5), (2) medium activity area (for 2.5 < I(p,q) ≤ 5.5), and (3) low activity area (for 5.5 < I(p,q) ≤ 16.0). Thus, the small values of I(p,q) denote details or impulses requiring use of a detail-preserving nonlinear filter with a small supporting window. The medium values of I(p,q) usually correspond to edges and textures, whereas the large values of I(p,q) denote flat areas and allow for the utilization of a large supporting window. If I(p,q) denotes the high-activity area (1.0 ≤ I(p,q) ≤ 2.5), then the L-filtering SAHVF output y(p,q) is obtained via a 3 × 3 window-based " structure in Eq. (56) with the coefficients wc = μc / τc=1 μc , for μc = (d(τ ) − d(c) )/(d(τ ) − d(1) ) and c = 1, 2, . . . , τ . The parameter τ is determined using the central pixel peer group concept as τ = min arg " α
1 "α
"|ζ |−1 & 1
c=1 d(c) − |ζ |−1−α c=α+1 d(c) α 2 "|ζ |−1 2 "|ζ |−1 α 1 1 "α c=1 d(c) + c=1 d(c) − α c=α+1 d(c) − |ζ |−1−α c=α+1 d(c)
(108) where α = 1, 2, . . . , |ζ |−1. The quantities {d(c) ; for c = 1, 2, . . . , |ζ |−1} ⊂ {d(x(p,q) , x(i,j ) ); (i, j ) = (p, q), (i, j ) ∈ ζ } are the ordered distance between the central vector x(p,q) and its neighbors x(i,j ) inside the processing window Ψ(p,q) . The values of d(x(p,q) , x(i,j ) ) are calculated using the combined distance measure from Eq. (29) as follows (Plataniotis et al., 1996; Plataniotis and Venetsanopoulos, 2000): |x(p,q) − x(i,j ) | 1− . (109) d(x(p,q) , x(i,j ) ) = x(p,q) x(i,j ) max{x(p,q) , x(i,j ) }
x(p,q) .xT(i,j )
In the case of a medium activity area (for 2.5 < I(p,q) ≤ 5.5), the recommended window size is 3 × 3 or 5 × 5 pixels and the SAHVF output y(p,q) is obtained using the data-adaptive concept in Eq. (73) with the weights w(i,j ) in Eq. (74) defined using μ(i,j ) =
(D(|ζ |) − D(i,j ) ) + γ (D(i,j ) − D(1) ) (1 + γ )(D(|ζ |) − D(1) )
(110)
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
231
where γ is an empirical parameter used to control the nonlinearity of the weighted function. The value of D(|ζ |) ∈ {D(i,j ) ; (i, j ) ∈ ζ } and D(1) ∈ {D(i,j ) ; (i, j ) ∈ ζ } denotes the maximum and the minimum, respectively, of the aggregated distances in Eq. (20), which are calculated using the combined distance in Eq. (109). The extremes D(|ζ |) and D(1) can be used (Plataniotis and Venetsanopoulos, 2000) to define γ as γ = (D(|ζ |) − D(1) )−1 . Finally, if the input vector x(p,q) corresponds to the 5.5 < I(p,q) ≤ 16.0 range, then the SAHVF output y(p,q) is determined using the data-adaptive concept in Eq. (73) with the weights w(i,j ") equivalent to the normalized structure activity values w(i,j ) = I(i,j ) / (g,h)∈ζ I(g,h) , for (i, j ) ∈ ζ . Depending on the value of I(p,q) , the supporting window can be chosen from 7 × 7 to 11 × 11 pixels in size. B. Performance Evaluation of the Noise Reduction Filters In many application areas, such as multimedia, visual communications, production of motion pictures, the printing industry, and graphic arts, greater emphasis is given to perceptual image quality. Consequently, the perceptual closeness of the filtered image to the uncorrupted original image is ultimately the best measure of the efficiency of any color image filtering method (see images shown in Figures 5, 12, 13, and 15). There are basically two major approaches1 used for assessing the perceptual error between two color images, namely, the objective evaluation approach and the subjective evaluation approach. 1. Objective Evaluation Following conventional practice, the difference between the original and noisy images, as well as the difference between the original and filtered images, is often evaluated using the commonly employed objective measures (Lukac et al., 2004c), such as mean absolute error (MAE) and MSE corresponding to signal-detail preservation and noise suppression, respectively. Since the RGB space is the most popular color space used conventionally to store, process, display, and analyze color images, both the MAE and the MSE are expressed in the RGB color space as follows: MAE =
K2 3 K1 1 |o(p,q)k − y(p,q)k | 3K1 K2
(111)
k=1 p=1 q=1
1 Comprehensive comparisons of various noise removal filters can be found in Lukac (2004a), Lukac et al. (2004c, 2005f), Plataniotis et al. (1999), Plataniotis and Venetsanopoulos (2000), and Smolka et al. (2004).
232
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
F IGURE 15. Filtering of mixed noise in Figure 5d: (a) MF output, (b) VMF output, (c) data-adaptive filter output, (d) DPAL filter output.
MSE =
K2 3 K1 1 (o(p,q)k − y(p,q)k )2 3K1 K2
(112)
k=1 p=1 q=1
where o(p,q) = [o(p,q)1 , o(p,q)2 , o(p,q)3 ] is the original RGB pixel, y(p,q) = [y(p,q)1 , y(p,q)2 , y(p,q)3 ] is the processed pixel with (p, q) denoting a spatial position in a K1 × K2 color image, and k characterizing the color channel. However, the above criteria do not measure the perceptual closeness between the two images because the RGB is not a uniform color space. Therefore, the additional criterion expressed in the perceptually uniform CIE Lab or CIE Luv color space should be used in conjunction with the MAE
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
233
and MSE. The normalized color difference (NCD) criterion (Plataniotis et al., 1999) is defined in the CIE Luv color space and its usage is essential in determining the color information preservation. Using the NCD criterion, the perceptual similarity between the original and the processed image is quantified as follows: "K1 "K2 $"3 2 k=1 (o¯ (p,q)k − y¯(p,q)k ) q=1 p=1 $ (113) NCD = "K1 "K2 "3 2 ( o ¯ ) k=1 (p,q)k q=1 p=1
where o¯ (p,q) = [o¯ (p,q)1 , o¯ (p,q)2 , o¯ (p,q)3 ] and y¯ (p,q) = [y¯(p,q)1 , y¯(p,q)2 , y¯(p,q)3 ] are the vectors representing the RGB vectors o(p,q) and y(p,q) , respectively, in the CIE Luv color space. Since the NCD as well as the MAE and the MSE evaluate the difference between two images, small error values denote enhanced performance. A precisely designed color image filter should yield consistently good results with respect to all of the above measures. An alternative criteria to the MSE are the signal-to-noise ratio (SNR) and the peak signal-to-noise ratio (PSNR) defined as "K1 "K2 "3 2 k=1 (o(p,q)k ) q=1 p=1 (114) SNR = 10 log10 "K "K "3 1 2 2 k=1 (o(p,q)k − y(p,q)k ) p=1 q=1 2552 (115) PSNR = 10 log10 MSE whereas the NCD criterion can be replaced using the Lab criterion (Vrhel et al., 2005): ( K2 ) K1 3 ) 1 * (o¯ 2 (116) Lab = (p,q)k − y¯(p,q)k ) K1 K2 p=1 q=1
k=1
where o¯ (p,q)k is the kth component of the CIE Lab vector o¯ (p,q) corresponding to the original RGB vector o(p,q) . Similarly, y¯(p,q)k is the kth component of the CIE Lab vector y¯ (p,q) corresponding to the RGB vector y(p,q) in the filtered image y. 2. Subjective Evaluation Since most enhanced images are intended for human inspection, a subjective image quality evaluation approach is widely used. Subjective evaluation is also required in practical application, where the original, uncorrupted images are unavailable. In this case, standard objective measures (MAE, MSE, NCD) of performance evaluations, which are based on the difference in the statistical
234
LUKAC AND PLATANIOTIS TABLE 1 S UBJECTIVE I MAGE E VALUATION G UIDELINES Score
Overall evaluation of the distortion
Noise removal evaluation
1 2 3 4 5
Very disruptive Disruptive Destructive but not disruptive Perceivable but not destructive Imperceivable
Poor Fair Good Very good Excellent
distributions of the pixel values, cannot be utilized (Lukac et al., 2005f; Plataniotis and Venetsanopoulos, 2000). Using the subjective evaluation approach, the image quality is evaluated with respect to (Lukac et al., 2005f) (1) image detail preservation (DP), (2) presence of residual noise, and (3) the introduction of color artifacts as a result of faulty, or excessive, processing. The choice of these criteria follows the well-known fact that the human visual system is sensitive to changes in color appearance. Furthermore, a good restoration method should maintain the edge information while it removes image noise. Edges are important features since they indicate the presence and the shape of various objects in the image. As shown in Table 1, performance (or lack of it) should be ranked subjectively in five categories (Plataniotis and Venetsanopoulos, 2000). In the subjective evaluation procedure (Lukac et al., 2005f), the methods under consideration are usually applied to the reference images and compared according to criteria listed in Table 1. Input (reference) images and filtered outputs should be viewed simultaneously under identical viewing conditions either on panel or on the screen by the set of observers. Although specific criteria can be applied in choosing the observers, to simulate a realistic situation where viewing of images is done by ordinary citizens and not image processing experts, the observers should be unaware of the specifics of the experiments (Lukac et al., 2005f). The subjective evaluation experiment should be performed in a controlled room (external light had no influence on image perception) with gray painted walls. If the screen is used to display the images during evaluation tests, it should be a calibrated, highquality, displaying device with controlled illumination. Pixel-resize zooming functionality control over the images can be allowed to highlight the image details. Finally, it should be mentioned that the images can be presented either in specific or random order. C. Inpainting Techniques Although many noise reduction filters can excellently enhance color images corrupted by point-wise acquisition and/or transmission noise, there are
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
235
situations when the damaged image areas are a size larger than that of the supporting window. Such a problem occurs in video data archiving, transmission over best effort networks or wireless communication channels, and aggressive coding where visual impairments in visual data are often observed (Criminisi et al., 2004; Park et al., 2005; Rane et al., 2003). For example, missing blocks are introduced by packet loss during wireless transmission. Archived photographs, films, and videos are exposed to chemical and physical elements as well as environmental conditions, which cause visual information loss and artifacts (e.g., cracks, scratches, and dirt) in the corresponding digital representation. Finally, undesired image objects such as logos, stamped dates, text, and persons can also be considered as the damaged area to be reconstructed using digital image processing. To restore the damaged image areas, digital inpainting techniques designed either for image or video restoration should be used. Image inpainting (Rane et al., 2003) refers to the process of filling in missing data in a designated region of the visual input by means of image interpolation. The object of the process is to reconstruct missing parts or damaged images in such a way that the inpainted region cannot be detected by a casual observer (Figure 16). To recover the color, structural, and textural content in a large damaged area, output pixels are calculated using the available data from the surrounding undamaged areas (Rane et al., 2003). The required input can be automatically determined by the inpainting technique or supplied by the user. Since different inpainting techniques focus on pure texture or pure structure restoration, both the quality and cost of the inpainting process differ significantly. Boundaries between image regions constitute structural (edge) information, which is a complex, nonlinear phenomenon produced by blending together different textures. It is not therefore surprising that the state-of-the-art inpainting methods attempt to simultaneously perform texture and structure filling in (Rane et al., 2003). D. Image Sharpening Techniques Apart from the noise reduction techniques, image sharpening or high-pass filtering is often required to enhance the appearance of the edges and fine image details (Figure 17) (Hardie and Boncelet, 1995; Konstantinides et al., 1999; Tang et al., 1994). Images are usually blurred by image processing, such as low-pass filtering and compression resulting in various edge artifacts. Since many practical applications suffer from noise, image sharpeners should be insensitive to both noise and compression artifacts (Fischer et al., 2002; Tang et al., 1994). It has been widely observed that linear sharpeners such as unsharp masking are inefficient and introduce new artifacts into the image (Fischer et al., 2002;
236
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
F IGURE 16. Color image inpainting: (a, c) damaged images, and (b, d) the corresponding images reconstructed using image inpainting.
Hardie and Boncelet, 1993). Therefore, nonlinear sharpening methods based on robust order statistics are used instead. Operating in the component-wise manner, the comparison and selection (CS) filter enhances the image by replacing the input components x(p,q)k with its enhanced value y(p,q)k as follows (Lee and Fam, 1987): if x¯k ≥ x((|ζ |+1)/2)k x(τ )k y(p,q)k = (117) x(|ζ |−τ +1)k otherwise where x¯k = mean{x(i,j )k ; (i, j ) ∈ ζ } is the component-wise mean of Ψ(p,q)k , x((|ζ |+1)/2)k obtained in Eq. (19) is the component-wise median of
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
(a)
(b)
(c)
(d)
237
F IGURE 17. Image sharpening demonstrated on (a) a 256×256 color image in Figure 5a blurred by a 3 × 3 mean filter. Images (b)–(d) were obtained by sharpening the blurred image in (a) using (b) CS filter, (c) LUM sharpener, (d) WM sharpener.
Ψ(p,q)k , x(τ )k and x(|ζ |−τ +1)k are the component-wise order statistics defined in Eq. (19), and τ , for τ = 1, 2, . . . , (|ζ | + 1)/2, is the parameter used to control the level of enhancement. The smaller, value of τ , the more significant sharpening characteristics are obtained. Although the CS filter offers good performance, it often smoothes fine details (Hardie and Boncelet, 1993). This is not the case when lower-uppermiddle (LUM) sharpeners (Hardie and Boncelet, 1993, 1995) are used. The output of the LUM sharpener is obtained by comparing the window center (middle sample) x(p,q)k with the lower x(τ )k and the upper x(|ζ |−τ +1)k
238
LUKAC AND PLATANIOTIS
component-wise order statistics in Eq. (19) as follows: if x(τ ) < x(p,q)k ≤ x¯kτ x(τ )k y(p,q)k = x(|ζ |−τ +1)k if x¯kτ < x(p,q)k < x(|ζ |−τ +1)k x(p,q)k otherwise
(118)
where x¯kτ = (x(τ )k + x(|ζ |−τ +1)k )/2, and τ , for τ = 1, 2, . . . , (|ζ | + 1)/2, is the parameter used to control the enhancement process. The level of enhancement varies from the maximum amount of sharpening (for τ = 1) to no sharpening [identity operation obtained for τ = (|ζ | + 1)/2]. If x(τ ) < x(p,q)k < x(|ζ |−τ +1)k , then the input central sample x(p,q)k represents an edge transition. By shifting x(p,q)k to extreme order statistics x(τ )k and x(|ζ |−τ +1)k in Eq. (118), the transition point is removed resulting in a stepper edge. More sophisticated sharpeners can be obtained using the WM framework, which admits the negative weights (Arce, 1998). However, such an approach requires complex optimization procedures to obtain the desired performance. To avoid this drawback, computationally efficient approaches combine linear sharpening operators and robust order-statistics (Fischer et al., 2002). By adding to the input image x a high-pass filtered version of x the sharpening filter can be obtained. Following the derivation in Fischer et al. (2002), the Laplacian-based high-pass permutation WM (PWM) filter can be used to enhance the image x as follows: (x(τ +1)k − x(1)k )/2 if r(p,q)k ≤ τ if τ < r(p,q)k ≤ |ζ | − τ (119) y(p,q)k = x(p,q)k + (x(p,q)k − x(1)k )/2 (x(|ζ |−τ )k − x(1)k )/2 otherwise where the addition of the kth component x(p,q)k of the central vector x(p,q) ∈ Ψ(p,q) normalizes the output of the high-pass filter to the desired intensity range. Another approach follows the unsharp masking concept (Polesel et al., 2000) and expresses the sharpening filter as y(p,q) = x(p,q) + λf (Ψ(p,q) ). The scaling parameter λ is used to tune the amount of sharpening operation and f (·) is a high-pass filter defined over the input set Ψ(p,q) used to extract the high-frequency component in the image. By employing the WM high-pass filter in the processing pipeline, the component-wise sharpening procedure is performed as follows (Fischer et al., 2002): y(p,q)k = (1 + λ)x(p,q)k − λx¯(p,q)k
(120)
x¯(p,q)k = (x(1)k + x(|ζ |)k )/2
(121)
where
denotes the mid-range of the component-wise input set Ψ(p,q)k .
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
239
The concept can be extended to derive the PWM sharpener (Fischer et al., 2002): x + 0.5λx − 0.5λx¯ if r ≤τ (p,q)k
(τ +1)k
(p,q)k
(p,q)k
if τ < r(p,q)k ≤ |ζ | − τ otherwise (122) where x¯(p,q)k is the mid-range obtained by Eq. (121). The sharpener obtained can be further modified as follows (Fischer et al., 2002): PWM y(p,q)k if xˆ(p,q)k > ξ y(p,q)k = (123) NSF otherwise y(p,q)k y(p,q)k =
(1 + 0.5λ)x(p,q)k − 0.5λx¯(p,q)k x(p,q)k + 0.5λx(|ζ |−τ )k − 0.5λx¯(p,q)k
where xˆ(p,q)k = x(|ζ |)k −x(1)k and the threshold ξ form the switching function. The large value of xˆ(p,q)k indicates the presence of an edge, and in this case PWM should be used. The small values indicate no the sharpening filter y(p,q)k presence of an edge allowing for the utilization of the different processing NSF to flat image areas. Note that type, for example, the smoothing filter y(p,q)k the accuracy of the enhancement process depends highly on the value of ξ . E. Image Zooming Techniques Another application of multichannel image filters is image zooming or spatial interpolation of a digital color image, which is the process of increasing the number of pixels representing the natural scene (Figure 18) (Lukac et al., 2005a, 2005d). It is frequently used in high resolution display devices (Herodotou and Venetsanopoulos, 1995) and consumer-grade digital cameras (Lukac et al., 2004a, 2005e). Spatial interpolation preserves the spectral representation of the input. Operating on the spatial domain of a digital image, spatial interpolation transforms a color image into an enlarged color image. It is well-known that a typical natural image exhibits significant spectral correlation among its RGB color planes. Therefore, scalar techniques operating separately on the individual color channels are insufficient and produce various spectral artifacts and color shifts (Lukac et al., 2005d, 2005e). Moreover, many conventional methods such as bilinear interpolation and spline based techniques often cause excessive blurring or geometric artifacts (Herodotou and Venetsanopoulos, 1995; Lukac et al., 2005e). Therefore, the development of the more sophisticated, vector processing-based, nonlinear approaches is of paramount importance. Zooming a K1 × K2 color image x with the pixels x(p,q) by a factor of z results in a zK1 × zK2 zoomed color image y. The zooming factor z ∈ Z can be an arbitrary positive integer, however, the value z = 2 is selected
240
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
F IGURE 18. Spatial interpolation of a 256×256 color image shown in Figure 5a with a zooming factor of z = 2: (a) MF output, (b) VMF output, (c) BVDF output, (d) data-adaptive filter output.
here to facilitate the discussion. Assuming the aforementioned setting, the use of the zooming procedure maps the original color vectors x(p,q) with spatial coordinates p and q into the enlarged image y as y(2p−1,2q−1) = x(p,q) where the pixels y(2p,2q) denote the new rows and columns (e.g., of zeros) added to the original data (Lukac et al., 2005e). Using a 3 × 3 processing window Ψ(p,q) , for p = 1, 2, . . . , 2K1 and q = 1, 2, . . . , 2K2 , sliding on the up-sampled image y to calculate individually all the new image pixels in the enlarged color image, the three pixel configurations are obtained when the window is centered on an empty pixel position. In these configurations, the available pixels are described
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
241
using ζ = {(p, q − 1), (p, q + 1)}, ζ = {(p − 1, q), (p + 1, q)}, and ζ = {(p − 1, q − 1), (p − 1, q + 1), (p + 1, q − 1), (p + 1, q + 1)}. Since the first two configurations provide an insufficient number of original pixels for the estimation of the unknown vector y(p,q) at the center of Ψ(p,q) , a two-iteration interpolation procedure is employed. In the first interpolation step, the unknown vector y(p,q) is estimated using the filtering function f (·) defined over the pixels y(i,j ) , for (i, j ) ∈ ζ = {(p − 1, q − 1), (p − 1, q + 1), (p + 1, q − 1), (p + 1, q + 1)}. When this processing step is completed over all regular locations, the second interpolation step is performed on all remaining positions with unknown pixels y(p,q) located in the center of Ψ(p,q) by using an operator f (·) defined over the vectors y(i,j ) , for (i, j ) ∈ ζ = {(p − 1, q), (p, q − 1), (p, q + 1), (p + 1, q)}, constituted of the two original and the two previously estimated pixels. This processing step completes the spatial interpolation process resulting in the fully populated, enlarged color image y. Finally, it should be mentioned that the number of interpolation steps in the spatial interpolation process increases with the value of the zooming factor z. F. Applications 1. Virtual Restoration of Artworks Virtual restoration of artworks is an emerging digital image processing application (Barni et al., 2000; Li et al., 2000; Lukac et al., 2005f). Since original materials, such as mural, canvas, vellum, photography, and paper medium, are invariably exposed to various aggressive environmental factors that lead to the deterioration of the perceived image quality, digital image processing solutions are used to restore, interpret, and preserve collections of visual cultural heritage in a digital form. Environmental conditions include sunshine, oxidation, temperature variations, humidity, and the presence of bacteria. These undesirable effects result in significant variation in the color characteristics and pigmentation of the artwork, preventing proper recognition, classification, and dissemination of the corresponding digitized artwork images. It was argued in Lukac et al. (2005f) that modern color image filters can be used as a preprocessing tool to eliminate noise introduced during the digital acquisition of the original visual artworks. The most common sources of noise and visual impairments are the acquisition device limitations, the granulation of the artwork’s surfaces, as well as the encrusting and accumulation of dirt on protecting surfaces. Thus, color image enhancement methods utilized in virtual restoration of artwork should eliminate noise and impairments present
242
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
(e)
(f)
F IGURE 19. Artwork image enhancement: (a–c) artwork images, and (d–f) the corresponding enhanced images.
in the corresponding digital data, while at the same time preserving the original colors (pigment) and the fine details of the artwork (Figure 19). Apart from denoising, one of the most critical issues in digitized artwork image restoration is the task of crack removal and fading color enhancement. Cracks are breaks in the original medium, paint, or varnish of the original artwork usually caused by aging, drying, or mechanical factors (Barni et al., 2000; Giakumis et al., 2006). With various degrees of interaction, cracks can first be localized by a sophisticated detection process. In the sequence, the damaged area is restored using image inpainting, which fills the corresponding spatial locations with image interpolated values (Figure 19c, f). Similar to crack removal, a region with faded colors and obscure shadows is first localized (Li et al., 2000). Then the user, by selecting target colors from a color template and an inpainting method, fills in the detected gaps and restores both intensity and color information.
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
(a)
(b)
(c)
(d)
(e)
(f)
243
F IGURE 20. Television image reconstruction: (a–c) television images, and (d–f) the corresponding reconstructed images.
2. Television Image Enhancement Television image enhancement represents a typical application where filtering is used to remove strong transmission noise and other visual impairments (Hamid et al., 2003; Lukac et al., 2005a; Rantanen et al., 1992). In most cases, television signals transmitted over the air are highly corrupted by impulsive or mixed noise due to atmospheric conditions. In this case, the noise can be removed (Figure 20) using robust estimators such as the techniques described in Section IV.A. In addition to transmission noise, images received often contain large damaged areas that are mostly present in the form of corrupted image rows or noise-like diagonal lines. Since the dimension of the damaged areas usually exceeds the size of support ζ used in traditional filtering solutions, image quality is enhanced using image inpainting (Figure 20c, f) rather than filtering. Apart from processing still television images, motion video enhancement is often required in restoring archived films and videos (Kokaram et al., 1995). Motion video can be viewed as a 3D image signal or a time sequence of two-dimensional image frames (Arce, 1991; Lukac and Marchevsky, 2001b).
244
LUKAC AND PLATANIOTIS
Such a visual input exhibits significant spatial and temporal correlation. Since temporal restoration of motion video without spatial processing results in blurring of the structural information in the reconstructed video, and ignoring the temporal correlation in processing each one of the video frames as still images produces strong motion artifacts, the development of spatiotemporal restoration techniques is of paramount importance (Lukac et al., 2004e). Since the spatial position and intensity of visual impairments, such as missing data patches caused by macroblocks dropped during transmission over the best effort type of networks, or speckle noise, random data patches, and sparkles caused by the presence of dirt, dust, and scratches in the original medium, vary significantly in the corresponding digital motion video, impairments are localized as temporal discontinuity (Kokaram et al., 1995). Through the employed motion compensation algorithms, this discontinuity is viewed as a spatial area in the actual frame that cannot be matched to a similar area in reference frames. After localizing the artifacts at the target frame, image inpainting implemented either as a spatial or a spatiotemporal solution can be used to fill in the missing information.
V. E DGE D ETECTION Edges convey essential information about a scene. For gray-scale images, edges are commonly defined as physical, photometric, and geometric discontinuities of the image function, as the edges can be defined as the boundaries of distinct image regions that differ in intensity (Gonzalez and Woods, 1992; Plataniotis and Venetsanopoulos, 2000). In the case of color images, represented in the 3D color space, the edges may be defined as discontinuities in the vector space representing the color image. In this way the edges split image regions of different color or intensity. Determination of object boundaries is important in many areas such as visual communication, medical imaging, dactyloscopy, quality control, photogrammetry, and intelligent robotic systems. Thus, edge detection—a process of transforming an input digital image into an edge map that can be viewed as a line drawing image with a spatial resolution identical to that of the input—is a common component in image processing systems (Lukac et al., 2005a). It has been widely observed that color images carry much more information than monochrome images (Plataniotis and Venetsanopoulos, 2000). For example, the monochrome images may not contain enough information in cases of image scenes, in which two close objects are of quite different color but of the same brightness and are merged in the gray-scale imaging. In this case, the utilization of the full information contained in color channels of the
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
245
F IGURE 21. Scalar edge detection based on color to gray-scale image conversion: (a) color to gray-scale image conversion, (b) scalar edge detection.
input image enables better detection of color edges compared to the use of monochrome (gray-scale) edge techniques, which operate separately on the individual color channels (Lukac et al., 2003b; Plataniotis and Venetsanopoulos, 2000). Moreover, multichannel images such as the color RGB images carry additional information contained in the various spectral channels. In this case, the boundary between two surfaces with different properties can be determined in more than one way (Lukac and Plataniotis, 2005a; Plataniotis and Venetsanopoulos, 2000). The development of an efficient edge detector, which properly detects the objects’ edges, is a rather demanding task. The most popular edge operators generate the edge maps by processing information contained in a local image neighborhood as determined by an element of support (Lukac et al., 2003b; Plataniotis and Venetsanopoulos, 2000). These operators (1) do not use any prior information about the image structure, (2) are image content agnostic, and (3) are localized, in the sense that the detector output is solely determined by the features obtained through the element of support. Such color edge detectors can be classified as scalar and vector techniques. A. Scalar Operators Since color images are arrays of three-component color vectors, the use of the scalar edge detectors requires the conversion of the color image to its luminance-based (monochrome) equivalent (Lukac et al., 2003b; Lukac and Plataniotis, 2005a). Assuming the conventional RGB representation, the conversion of the color image to a luminance-based image can be performed via Eq. (1) or (2) with (p, q) denoting the spatial location in the image. In the sequence, a scalar edge detector is applied on the luminance image (Figure 21): m(p,q) = f (L(i,j ) );
for (i, j ) ∈ ζ
(124)
where f (·) denotes the edge operator defined over the local luminance quantities L(i,j ) , with (i, j ) ∈ ζ denoting the area of support (e.g., a 3 × 3 filtering window Ψ(p,q) ). Alternatively, the edge map of the color image can be achieved using component-wise processing (Figure 22) (Lukac et al., 2003b; Plataniotis and
246
LUKAC AND PLATANIOTIS
F IGURE 22. Component-wise edge-detection concept: (a) color channel decomposition, (b) scalar edge detection in the separated color channels, (c) combination of the separate edge maps to form the outputted edge map.
Venetsanopoulos, 2000). In this way, each of the three color channels is processed separately. The operator then combines the three distinct edge maps to form the output map as follows (Lukac and Plataniotis, 2005a): m(p,q) = max(m(p,q)1 , m(p,q)2 , m(p,q)3 )
(125)
where m(p,q)k = f (x(i,j )k );
for (i, j ) ∈ ζ
(126)
denotes the edge maps obtained by applying the scalar edge detector on the R (k = 1), G (k = 2), and B (k = 3) channel of the color image, respectively. The output edge description corresponds to the dominant indicator of the edge activity noticed in the different color bands. The edge operator’s output [Eq. (124) or (125)] is compared with a predefined threshold to obtain the edge map. In other words, the purpose of the thresholding operation is to determine if a given pixel belongs to an edge or not. The resulting edge map E(p, q) with pixels E(p,q) is determined as follows (Lukac et al., 2003b; Lukac and Plataniotis, 2005a): E(p,q) = m(p,q) if m(p,q) ≥ ξ (127) 0 otherwise where ξ is a nonnegative threshold value. It is known that edge operators are sensitive to noise and small variations in intensity, phenomena often encountered in localized image processing areas (Lukac and Plataniotis, 2005a), and that the edge map usually contains noise. Using the appropriate setting of ξ , the thresholding operation [Eq. (127)] can extract the structural information, which corresponds to the edge discontinuities (Figure 23). Note that an overshot value of ξ results in edge information exclusions, whereas
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
(a)
(b)
(c)
(d)
247
F IGURE 23. Scalar edge detection: (a) 512 × 512 color image butterfly and (b)–(d) the corresponding edge maps obtained by (b) Sobel detector, (c) Canny detector, (d) LwG detector.
the use of a too small value of ξ usually produces edge maps that include noise-like pixels and redundant details. In practice, most popular edge detectors are approximated through the use of convolution masks. Assuming a 3 × 3 supporting window Ψ(p,q) , where each spatial location (i, j ), for (i, j ) ∈ ζ , is associated with the mask coefficient w(i,j ) , the edge map’s pixel m(p,q) is obtained as follows (Gonzalez and Woods, 1992): {w(i,j ) u(i,j ) } (128) m(p,q) = w ∗ U (p, q) = (i,j )∈ζ
248
LUKAC AND PLATANIOTIS
with ζ denoting the spatial locations within the area of support. The quantities u(i,j ) ∈ U (p, q) and w(i,j ) ∈ w denote the image inputs and the mask coefficients, respectively. The U (p, q) denotes the set of the signal values used as the input for an edge operator, that is, U (p, q) = {L(i,j ) ; (i, j ) ∈ ζ } for Eq. (124) and U (p, q) = {x(i,j )k ; (i, j ) ∈ ζ } for Eq. (125). The set w = {w(i,j ) ; (i, j ) ∈ ζ } forms the so-called set of mask coefficients. Both component-wise and dimensionality reduction-based edge detection operators can be further grouped into two main classes of operators (Lukac et al., 2003b; Plataniotis and Venetsanopoulos, 2000): • gradient methods, which use the first-order directional derivatives of the image to determine the edge contrast used in edge map formation, and • zero-crossing-based methods, which use the second-order directional derivatives to identify locations with zero crossings. 1. Gradient Operators The gradient methods use the so-called gradient (Gonzalez and Woods, 1992): ∂U (p, q) ∂U (p, q) (129) , ∇U (p, q) = ∂p ∂q of the function U (p, q). Based on the definition of an edge as an abrupt change in the image intensity, that is, for denoting the luminance values in Eq. (124) or image channel intensity values in Eq. (125), a derivative operator was proposed for the detection of intensity discontinuities. The first derivative provides information on the rate of change of the image intensity. Using this information, it is possible to localize points where large changes of intensity occur. Of particular interest is the gradient magnitude ' ∂U (p, q) 2 ∂U (p, q) 2 ∇U (p, q) = + (130) ∂p ∂q denoting the rate of change of the image intensity and the gradient direction , ∂U (p, q) ∂U (p, q) (131) θ = arctan ∂p ∂q
denoting the orientation of an edge. The mask w of a particular operator in Eq. (128) should be considered a digital approximation of the gradient in a given direction. Typically, two masks are defined enabling the determination of the gradient magnitude in two orthogonal directions. For the most commonly used scalar gradient operators such as the Prewitt, Sobel, isotropic, and Canny operator, the convolution masks are defined as follows (Gonzalez and Woods, 1992; Lukac et al., 2003b):
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
• Prewitt operator:
! ! −1 0 1 −1 −1 −1 w= , −1 0 1 , 0 0 0 −1 0 1 1 1 1 ! ! 0 1 1 −1 −1 0 w= −1 0 1 , −1 0 1 −1 −1 0 0 1 1
• Sobel operator:
! ! −1 0 1 −1 −2 −1 w= , −2 0 2 , 0 0 0 −1 0 1 1 2 1 ! ! 0 1 2 −2 −1 0 w= −1 0 1 , −1 0 1 −2 −1 0 0 1 2
• Canny operator:
w=
0 −1 0
! 0 1 0 0 0 1 , 0 0 0 0 0 −1
0 0 0
!
or
249
(132)
(133)
or
.
(134)
(135)
(136)
It should be√mentioned that the isotropic operator can be obtained by using the value of 2 instead of 2 in Eqs. (134)–(135). Other members of the gradient edge detectors are the Roberts operator and the Kirsch compass operator (Lukac et al., 2003b; Plataniotis and Venetsanopoulos, 2000). Within gradient operators, the Canny operator (Canny, 1986) is considered by many the most advanced gradient-based operator. This operator does not solely rely on the intensity variations, but attempts to limit using Gaussian presmoothing the effect of the noise to improve the quality of the edge maps, and improve the appearance of the edge maps using the hysteresis-based thinning of the thresholded edge maps. 2. Zero-Crossing-Based Operators It is well-known that when the first derivative achieves a maximum, the second derivative is zero (Lukac et al., 2003b; Ziou and Tabbone, 1998). Therefore, operators may localize edges by evaluating the zeros of the second derivatives of U (p, q). The most commonly used operator of this type is the Laplacian operator U (p, q) = ∇ 2 U (p, q) =
∂ 2 U (p, q) ∂ 2 U (p, q) + ∂p2 ∂q 2
(137)
250
LUKAC AND PLATANIOTIS
F IGURE 24.
Vector edge detection.
which is approximated in practice using the convolution masks ! ! 0 1 0 1 1 1 w = 1 −4 1 , w = 1 −8 1 (138) 0 1 0 1 1 1 defined for a four- and eight-neighborhood, respectively. Other zero-crossing-based methods are the so-called LwG and LoG edge detectors, which combine Laplacian and Gaussian operators, and the socalled DoG operator defined through the difference of Gaussians (Gomes and Velho, 1997; Lukac et al., 2003b). Both gradient and zero-crossing scalar edge operators do not use the full potential of the spectral image content and, thus, they can miss the edges in multichannel images. B. Vector Operators Psychological research on the characteristics of the human visual system reveals that color plays a significant role in the perception of edges or boundaries between two surfaces. Since the ability to distinguish between different objects is crucial for applications such as object recognition, image segmentation, image coding, and robot vision, the additional boundary information provided by color is of paramount importance (Plataniotis and Venetsanopoulos, 2000; Scharcanski and Venetsanopoulos, 1997). Following the major performance issues in color edge detection such as the ability to extract edges accurately, robustness to noise, and the computational efficiency, most popular color edge detectors are vector edge detectors (Figure 24) based on vector order statistics (Lukac et al., 2003b, 2005a; Plataniotis and Venetsanopoulos, 2000). Edge detectors based on order statistics operate by detecting local minimum and maximum in the color image function and combining them in an appropriate way to produce the corresponding edge map (Figure 25). Since there is no unique way to define ranks for multichannel signals, the reduced ordering scheme in Eq. (21) is commonly used to achieve the ranked sequence of the color vectors inside the processing window. Based on these two extreme vector order statistics x(1) and x(N ) , the vector range (VR) detector is defined as follows (Lukac et al., 2005a; Plataniotis and
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
(a)
(b)
(c)
(d)
251
F IGURE 25. Vector edge detection using the image shown in Figure 23a: (a) VR detector, (b) MVD detector, (c) NNVR detector, (d) NNMVD detector.
Venetsanopoulos, 2000): m(p,q) = x(|ζ |) − x(1)
(139)
where (p, q) corresponds to the center spatial location of Ψ(p,q) . The output of Eq. (139) quantitatively expresses the deviation of the vector outlier in the highest rank from the most representative vector in the lowest rank within Ψ(p,q) . It is not difficult to see that in a uniform area, where all vectors x(i,j ) , for (i, j ) ∈ ζ , are characterized by a similar magnitude Mx(i,j ) and/or the direction Ox(i,j ) , the output of Eq. (139) will be small. However, this is not the case in high-frequency regions, where x(N ) is usually located at one side of an
252
LUKAC AND PLATANIOTIS
edge, whereas x(1) is included in the set of vectors occupying spatial positions on the other side of the edge. Thus, the response of Eq. (139) is a large value. Due to the utilization of the distance between the lowest and uppermost ranked vector, the VR operator is rather sensitive to noise. More robust color edge detectors are obtained using linear combinations of the lowest ranked vector samples. This is mainly due to the fact that the lowest ranks are associated with the most similar vectors in the population of the color vectors, and upper ranks usually correspond to the outlying samples. The so-called vector dispersion edge detector (VDED) is obtained as follows (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000):
|ζ |
m(p,q) =
(140) wr x(r)
r=1
where · denotes the magnitude of the vector operand and wr is the weight coefficient associated with ranked vector x(r) . Different coefficients wr in the linear combinations result in a multitude of edge detectors that vary significantly in terms of performance and/or complexity. For example, the VR operator [Eq. (139)] is a special case of the VDED operator, obtained using w1 = −1, w|ζ | = 1, and wr = 0, r = 2, 3, . . . , |ζ | − 1. To design robust edge detectors, the operators should utilize a linear combination of the lowest ranked vectors x(r) , for r = 1, 2, . . . , c and c < |ζ |. This is mainly due to the fact that (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000) (1) the lowest ranks are associated with the most similar vectors in the population of the vectorial inputs whereas upper ranks usually correspond to the outlying samples, and (2) the lowest ranked vector is commonly used to attenuate noise in the vectorial data sets. Therefore, employing the set of the c lowest, in rank, vectors x(1) , x(2) , . . . , x(c) , for c < |ζ |, the output of Eq. (140) becomes immune to noise. The minimum over the magnitudes of these linear combinations defines the output of the so-called minimum vector dispersion (MVD) operator (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000):
c
x(r)
m(p,q) = min x(|ζ |−c+1) −
, c = 1, 2, . . . , b, b, c < |ζ | (141)
b c
r=1
where the parameters b and c control the trade-off between complexity and noise attenuation. Such an operator exhibits significant robustness against image noise. Since the response of the MVD operator is much larger at the true edges, the highly precise edge maps can be obtained through subsequent thresholding [Eq. (127)]. An alternative design of the vector edge operators utilizes the adaptive nearest-neighbor filter. The coefficients are chosen to adapt to local image
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
253
characteristics. Instead of constants, the coefficients are determined by an adaptive weight function for each window Ψ(p,q) . The so-called nearestneighbor VR (NNVR) operator is defined as the distance between the outlier and the weighted sum of all the ranked vectors (Lukac et al., 2003b; Plataniotis and Venetsanopoulos, 2000):
|ζ |
(142) wr x(r) . m(p,q) = x(|ζ |) −
r=1
The weight function wr is determined adaptively using transformations of a distance criterion at each image location and it is not uniquely defined. Similar to the adaptive design in Eq. (73), each weight coefficient is positive (wr 0) "|ζ | and the weight function is normalized ( r=1 wr = 1). The MVD operator can also be incorporated with the NNVR operator to further improve its performance in the presence of impulse noise. The resulting NNMVD operator is defined as follows (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000):
|ζ |
m(p,q) = min x(|ζ |−j +1) − wr x(r) , for b = 1, 2, . . . , r, r < |ζ |
b i=r (143) where wr denotes the normalized weighting coefficient. A possible weight function to be used in Eqs. (142)–(143) can be defined as follows: wr =
D(|ζ |) − D(r) "|ζ | |ζ | · D(|ζ |) − b=1 D(b)
(144)
where D(r) is the aggregated distance associated with the vector x(r) ∈ Ψ(p,q) . Since in a highly uniform area (no edge) all pixels have the same distance and the above weight function cannot be used since the denominator is zero, the NNVR and NNMVD output should be set to zero. C. Evaluation Criteria The performance of the edge detectors, in terms of accuracy in edge detection and robustness to noise, is usually evaluated using both quantitative and qualitative measures (Avcibas et al., 2002; Lukac et al., 2003b). The quantitative performance measures can be grouped into two types: (1) probabilistic measures that are based on the statistic of correct edge detection and false edge rejection, and (2) distance measures that are based on edge deviation from true edges.
254
LUKAC AND PLATANIOTIS
The probabilistic measures can be adopted to evaluate the accuracy of edge detection by measuring the percentage of correctly and falsely detected edges. Since a predefined edge map (ground truth) is needed, synthetic images are preferred for this experiment. The distance measures can be adopted to evaluate the noise performance by measuring the deviation of edges caused by noise from the true edges. Since numerical measures, such as various percentage criteria, are not sufficient to model the complexity of human visual systems and evaluation based on synthetic images has limited value, qualitative evaluation using subjective (visual) tests is often used. Such an approach allows for the utilization of real RGB color images in the evaluation process. 1. Objective Evaluation Approach The use of the percentage criteria requires the presence of the reference (ground-truth) edge map with known location of edges in the ideal test image. Then the so-called coefficient of found edges CF and the coefficient of lost edges CL are determined as follows: |ξF | 100%, ξF = {m(p,q) : m(p,q) ∈ χ ∧ m(p,q) ∈ ξA } |χ | |ξL | 100% = 1 − CF , CL = |χ | ξL = {m(p,q) : m(p,q) ∈ χ ∧ m(p,q) ∈ / ξA }
CF =
(145)
(146)
where ξF denotes the number of edge pixels of the output edge map, which coincides with the edges derived from the artificial image, ξL denotes the lost edge pixels, χ is the number of pixels of ideal edges, and ξA denotes the number of edge pixels found by the tested edge detector. The performance of the edge operators can be further evaluated using a quotient of pixels falsely detected as an edge and the total number of pixels comprising the found edges (Lukac et al., 2003b): CFD =
|TF | 100%, |ξA |
TF = {m(p,q) : m(p,q) ∈ / χ ∧ m(p,q) ∈ ξA } (147)
where TF denotes the number of false edge pixels. The so-called fault ratio (Plataniotis and Venetsanopoulos, 2000) is calculated as a quotient of pixels wrongly classified as edges to the correctly identified edge pixels: CFR =
|TF | , |TH |
TF ={m(p,q) : m(p,q) ∈ / χ ∧ m(p,q) ∈ ξA } TH ={m(p,q) : m(p,q) ∈ χ ∧ m(p,q) ξ IA }
where TF and TH denote the false and real edge pixels.
(148)
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
255
Finally, the measure in Avcibas et al. (2002) is based on knowledge about the ideal reference edge map, where the reference edges should preferably have a width of one pixel. The measure considers both the accuracy of the edge location and false (or missing) edge elements as follows: ξ
CR =
A 1 1 max{ξA , χ } 1 + αdi2
(149)
i=1
where ξA and χ are the number of detected and ground-truth edge points, respectively, and di denotes the distance to the closest real edge pixel corresponding to the ith detected edge pixel. Parameter α is a scale factor (e.g., 1/9 for Pratt edge operator), which provides the relative weighting between smeared edges and thin but offset (shifted, dislocated) edges (Lukac et al., 2003b). Unlike the percentage measures, distance measures do not require the reference image and they enable the analysis of edge maps acquired from real scenes. Since smoothing filters usually produce an error by processing the edge pixels, criteria like the SNR and PSNR can be used to measure the difference between the image edge maps produced before and after image smoothing. Alternatively, the reference edge map required in Eqs. (145)–(148) can be obtained as the absolute difference between the input image and the image produced by a robust smoothing filter. 2. Subjective Evaluation Approach The subjective evaluation allows for further investigation of the characteristics of the obtained edge maps through the involvement of human factors (Plataniotis and Venetsanopoulos, 2000). The edge operators are usually rated in terms of several criteria, such as (1) ease at organizing objects, (2) continuity of edges, (3) thinness of edges, and (4) performance in suppressing noise. Visual inspection of the edge maps shown in Figures 23 and 25 reveals that the performance of the scalar and vector edge operators is very similar for noiseless images. More sophisticated MVD and NNMVD operators produce thinner edges and, due to the employed averaging operation, they are less sensitive to small texture variations. If the edge operators are used to localize the edges in the noisy images such as the cDNA microarray images (Figure 26),2 then due to the utilization of the robust order statistic concept, the vector edge operators usually outperform 2 Complementary deoxyribonucleic acid (cDNA) microarray imaging is considered one of the most important and powerful technologies used to extract and interpret genomic information (Lukac et al., 2004d; Lukac and Plataniotis, 2005b). The image formation process produces two monochromatic images that are further registered into a two-channel, red-green image that contains thousands of spots
256
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
F IGURE 26. Edge detection-based cDNA microarray spot localization: (a) 200 × 200 cDNA microarray image, and (b)–(d) the corresponding edge maps obtained by (b) Sobel detector, (c) MVD detector, (d) NNMVD edge detector.
the scalar edge detectors in terms of both the accuracy of edge localization and the robustness against noise. Since cDNA microarray image formation is affected by a number of impairments that can be attributed to (Lukac et al., 2005b; Lukac and Plataniotis, 2005b) (1) variations in the image background, (2) variations in the spot sizes and positions, (3) artifacts caused by laser light reflection and dust on the glass slide, and (4) photon and electronic noise carrying the genetic information. The generated cDNA microarray image is a multichannel vector signal that can be represented, for storing or visualization purposes, as the RGB color image with a zero blue component (Lukac et al., 2004d, 2005a).
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
257
introduced during scanning; microarray spot localization necessitates the use of robust vector operators that are able to follow the spectral correlation that exists between the R and G channels of the vectorial cDNA microarray image (Lukac and Plataniotis, 2005a).
VI. C ONCLUSION This chapter provided a taxonomy of modern color image filtering and enhancement solutions. Since image signals are nonlinear in nature, due to the presence of edges and fine details, and are often processed by the highly nonlinear human visual system, nonlinear color image processing solutions were the main focus of the chapter. Moreover, given the vectorial nature of the color image, particular emphasis was given to nonlinear vector operators that constitute a rich and expanding class of tools for color image filtering, enhancement, and analysis. By utilizing robust order statistics calculated through a supporting processing window, nonlinear filters can preserve important structural elements, such as color edges, and eliminate degradations occurring during signal formation and transmission. As shown in this work, vector processing operators constitute a basis for noise detection and removal in color images. The same color vector tools can be used to inpaint missing color information, to enhance color input in acquired images and videos, to increase the spatial resolution of the visual data, and to localize color image edges and fine details. The utilization of spatial, structural, and spectral characteristics of the visual input is essential in modern imaging systems that attempt to mimic the human perception of the visual environment. Therefore, it is not difficult to see that color image filtering techniques have an extremely valuable position in modern color image science, communication, multimedia, and biomedical applications.
R EFERENCES Arce, G.R. (1991). Multistage order statistic filters for image sequence processing. IEEE Trans. Signal Process. 39 (5), 1146–1163. Arce, G.R. (1998). A general weighted median filter structure admitting negative weights. IEEE Trans. Signal Process. 46 (12), 3195–3205. Astola, J., Kuosmanen, P. (1997). Fundamentals of Nonlinear Digital Filtering. CRC Press, Boca Raton, FL. Astola, J., Haavisto, P., Neuvo, Y. (1990). Vector median filters. Proc. IEEE 78 (4), 678–689. Avcibas, I., Sankur, B., Sayood, K. (2002). Statistical evaluation of image quality measures. Journal of Electronic Imaging 11 (2), 206–223.
258
LUKAC AND PLATANIOTIS
Barnett, V. (1976). The ordering of multivariate data. Journal of Royal Statistical Society A 139, 318–354. Barni, M., Cappellini, V., Mecocci, A. (1994). Fast vector median filter based on Euclidean norm approximation. IEEE Signal Processing Letters 1 (6), 92–94. Barni, M., Bartolini, F., Cappellini, V. (2000). Image processing for virtual restoration of artworks. IEEE Multimedia 7 (2), 34–37. Canny, J.F. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 8, 679–698. Coyle, E.J., Lin, J.H., Gabbuoj, M. (1989). Optimal stack filtering and the estimation and structural approaches to image processing. IEEE Transactions on Acoustics, Speech, and Signal Processing 37 (12), 2037–2066. Criminisi, A., Perez, P., Toyama, K. (2004). Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 13 (9), 1200–1212. Duda, R.O., Hart, P.E., Stork, D.G. (2000). Pattern Classification and Scene Analysis, 2nd ed. John Wiley, New York. Faugeras, O. (1979). Digital color image processing within the framework of a human visual model. IEEE Transactions on Acoustics, Speech, and Signal Processing 27 (4), 380–393. Fischer, M., Paredesm, J.L., Arce, G.R. (2002). Weighted median image sharpeners for the World Wide Web. IEEE Trans. Image Process. 11 (7), 717–727. Gabbouj, M., Cheickh, A. (1996). Vector median–vector directional hybrid filter for color image restoration. In: Proceedings of the European Signal Processing Conference EUSIPCO’96, pp. 879–881. Gabbouj, M., Coyle, E.J., Gallagher, N.C. (1992). An overview of median and stack filtering. Circuit Systems Signal Processing 11 (1), 7–45. Giakumis, I., Nikolaidis, N., Pitas, I. (2006). Digital image processing techniques for the detection and removal of cracks in digitized paintings. IEEE Transactions on Image Processing 15 (1), 178–188. Gomes, J., Velho, L. (1997). Image Processing for Computer Graphics. Springer-Verlag, Berlin. Gonzalez, R., Woods, R.E. (1992). Digital Image Processing. AddissonWesley, Reading, MA. Gunturk, B., Altunbasak, Y., Mersereau, R. (2002). Color plane interpolation using alternating projections. IEEE Trans. Image Process. 11 (9), 997– 1013. Hamid, M.S., Harvey, N.L., Marshall, S. (2003). Genetic algorithm optimization of multidimensional grayscale soft morphological filters with applications in film archive restoration. IEEE Trans. Circuits Systems Video Tech. 13 (5), 406–416.
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
259
Hardie, R.C., Arce, G.R. (1991). Ranking in R p and its use in multivariate image estimation. IEEE Trans. Circuits Systems Video Tech. 1 (2), 197– 208. Hardie, R.C., Boncelet, C.G. (1993). LUM filters: A class of rank-order-based filters for smoothing and sharpening. IEEE Trans. Signal Process. 41 (3), 1061–1076. Hardie, R.C., Boncelet, C.G. (1995). Gradient-based edge detection using nonlinear edge enhancing prefilters. IEEE Trans. Image Process. 4 (11), 1572–1578. Henkel, W., Kessler, T., Chung, H.Y. (1995). Coded 64-CAP ADSL in an impulse-noise environment—modeling of impulse noise and first simulation results. IEEE J. Selected Areas Communications 13 (9), 1611–1621. Herodotou, N., Venetsanopoulos, A.N. (1995). Colour image interpolation for high resolution acquisitions and display devices. IEEE Trans. Consumer Electron. 41 (4), 1118–1126. Holst, G.C. (1998). CCD Arrays, Cameras, and Displays, 2nd ed. JCD Publishing and SPIE Optical Engineering Press. Hore, E.S., Qiu, B., Wu, H.R. (2003). Improved vector filtering for color images using fuzzy noise detection. Opt. Eng. 42 (6), 1656–1664. Karakos, D.G., Trahanias, P.E. (1997). Generalized multichannel imagefiltering structure. IEEE Trans. Image Process. 6 (7), 1038–1045. Kayargadde, V., Martens, J.B. (1996). An objective measure for perceived noise. Signal Process. 49 (3), 187–206. Khriji, L., Gabbouj, M. (1999). Vector median-rational hybrid filters for multichannel image processing. IEEE Signal Process. Lett. 6 (7), 186–190. Khriji, L., Gabbouj, M. (2002). Adaptive fuzzy order statistics-rational hybrid filters for color image processing. Fuzzy Sets and Systems 128 (1), 35–46. Kokaram, A.C., Morros, R.D., Fitzerald, W.J., Rayner, P.J.V. (1995). Detection of missing data in image sequences. IEEE Trans. Image Process. 4 (11), 1496–1508. Konstantinides, K., Bhaskaran, V., Beretta, G. (1999). Image sharpening in the JPEG domain. IEEE Trans. Image Process. 8 (6), 874–878. Kotropoulos, C., Pitas, I. (2001). Nonlinear Model-Based Image/Video Processing and Analysis. John Wiley, New York. Lee, Y.H., Fam, A.T. (1987). An edge gradient enhancing adaptive order statistic filters. IEEE Trans. Acoust. 35 (5), 680–695. Li, X., Lu, D., Pan, Y. (2000). Color restoration and image retrieval for Donhuang fresco preservation. IEEE Multimedia 7 (2), 38–42. Lucat, L., Siohan, P., Barba, D. (2002). Adaptive and global optimization methods for weighted vector median filters. Signal Processing: Image Communications 17 (7), 509–524.
260
LUKAC AND PLATANIOTIS
Lukac, R. (2001). Vector LUM smoothers as impulse detector for color images. In: Proc. European Conference on Circuit Theory and Design ECCTD’01, vol. III, pp. 137–140. Lukac, R. (2002a). Color image filtering by vector directional order-statistics. Patt. Recognition Image Anal. 12 (3), 279–285. Lukac, R. (2002b). Optimised directional distance filter. Machine Graphics and Vision: Special Issue on Colour Image Processing and Its Applications 11 (2–3), 311–326. Lukac, R. (2003). Adaptive vector median filtering. Patt. Recognition Lett. 24 (12), 1889–1899. Lukac, R. (2004a). Adaptive color image filtering based on center-weighted vector directional filters. Multidimensional Systems and Signal Processing 15 (2), 169–196. Lukac, R. (2004b). Performance boundaries of optimal weighted median filters. Intern. J. Image Graphics 4 (2), 157–182. Lukac, R., Marchevsky, S. (2001a). Adaptive vector LUM smoother. In: Proc. 2001 IEEE International Conference on Image Processing ICIP’01, vol. 1, pp. 878–881. Lukac, R., Marchevsky, S. (2001b). LUM smoother with smooth control for noisy image sequences. EURASIP Journal of Applied Signal Processing 2001 (2), 110–120. Lukac, R., Plataniotis, K.N. (2005a). Vector edge operators for cDNA microarray spot localization. Image Vision Comput., Submitted for publication. Lukac, R., Plataniotis, K.N. (2005b). cDNA microarray image segmentation using root signals. Intern. J. Imaging Systems Tech., Submitted for publication. Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N. (2003a). Weighted vector median optimization. In: Proc. 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications ECVIP-MC’03, vol. 1, pp. 227–232. Lukac, R., Plataniotis, K.N., Venetsanopoulos, A.N., Bieda, R., Smolka, B. (2003b). Color edge detection techniques. In: Signaltheorie und Signalverarbeitung, Akustik und Sprachakustik, Informationstechnik, vol. 29. W.E.B. Universität Verlag, Dresden, pp. 21–47. Lukac, R., Martin, K., Plataniotis, K.N. (2004a). Digital camera zooming based on unified CFA image processing steps. IEEE Trans. Consumer Electron. 50 (1), 15–24. Lukac, R., Smolka, B., Plataniotis, K.N., Venetsanopulos, A.N. (2004b). Selection weighted vector directional filters. Computer Vision and Image Understanding, Special Issue on Colour for Image Indexing and Retrieval 94 (1–3), 140–167.
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
261
Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N. (2004c). Generalized selection weighted vector filters. EURASIP Journal on Applied Signal Processing: Special Issue on Nonlinear Signal and Image Processing 2004 (12), 1870–1885. Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N. (2004d). A multichannel order-statistic technique for cDNA microarray image processing. IEEE Trans. Nanobioscience 3 (4), 272–285. Lukac, R., Fischer, V., Motyl, G., Drutarovsky, M. (2004e). Adaptive video filtering framework. Intern. J. Imaging Systems Tech. 14 (6), 223–237. Lukac, R., Smolka, B., Martin, K., Plataniotis, K.N., Venetsanopulos, A.N. (2005a). Vector filtering for color imaging. IEEE Signal Processing Magazine: Special Issue on Color Image Processing 22 (1), 74–86. Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N. (2005b). cDNA microarray image processing using fuzzy vector filtering framework. Journal of Fuzzy Sets and Systems: Special Issue on Fuzzy Sets and Systems in Bioinformatics 152 (1), 17–35. Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N. (2005c). A statistically-switched adaptive vector median filter. J. Intell. Robot. Syst. 42 (4), 361–391. Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N. (2005d). Vector operators for color image zooming. In: Proc. IEEE International Symposium on Industrial Electronics ISIE’05, vol. 3, pp. 1273–1277. Lukac, R., Plataniotis, K.N., Hatzinakos, D. (2005e). Color image zooming on the Bayer pattern. IEEE Trans. Circuit Syst. Video Tech. 15 (11), 1475– 1492. Lukac, R., Plataniotis, K.N., Smolka, B. (2005f). Adaptive color image filter for application in virtual restoration of artworks. Image Vision Comput., in preparation. Lukac, R., Plataniotis, K.N., Venetsanopoulos, A.N. (2005g). Color image denoising using evolutionary computation. Intern. J. Imaging Syst. Tech. 15, Submitted for publication. Ma, Z., Wu, H.R. (2006). Partition based vector filtering technique for color suppression of noise in digital color images. IEEE Trans. Image Process., in preparation. Ma, Z., Wu, H.R., Qiu, B. (2005). A robust structure-adaptive hybrid vector filter for color image restoration. IEEE Trans. Image Process. 14 (12), 1990–2001. Mitra, S., Sicuranza, J. (2001). Nonlinear Image Processing. Academic Press, San Diego. Neuvo, Y., Ku, W. (1975). Analysis and digital realization of a pseudorandom Gaussian and impulsive noise source. IEEE Trans. Commun. 23, 849–858. Nikolaidis, N., Pitas, I. (1996). Multichannel L filters based on reduced ordering. IEEE Trans. Circuits Syst. Video Tech. 6 (5), 470–482.
262
LUKAC AND PLATANIOTIS
Nikolaidis, N., Pitas, I. (1998). Nonlinear processing and analysis of angular signals. IEEE Trans. Signal Process. 46 (12), 3181–3194. Nosovsky, R.M. (1984). Choice, similarity and the context theory of classification. J. Exp. Psychol. Learn. Mem. Cog. 10 (1), 104–114. Park, J., Park, D.C., Marks, R.J., El-Sharkawi, M.A. (2005). Recovery of image blocks using the method of alternating projections. IEEE Trans. Image Process. 14 (4), 461–474. Peltonen, S., Gabbouj, M., Astola, J. (2001). Nonlinear filter design: Methodologies and challenges. In: Proc. International Symposium on Image and Signal Processing and Analysis ISPA’01, pp. 102–107. Pitas, I., Tsakalides, P. (1991). Multivariate ordering in color image filtering. IEEE Trans. Circuits Syst. Video Tech. 1 (3), 247–259. Pitas, I., Venetsanopoulos, A.N. (1990). Nonlinear Digital Filters, Principles and Applications. Kluwer Academic Publishers, Dordrecht. Pitas, I., Venetsanopoulos, A.N. (1992). Order statistics in digital image processing. Proc. IEEE 80 (12), 1892–1919. Plataniotis, K.N., Venetsanopoulos, A.N. (1998). Vector processing. In: Sangwine, S.J. (Ed.), Colour Image Processing. Chapman & Hall, London, U.K., pp. 188–209. Plataniotis, K.N., Venetsanopoulos, A.N. (2000). Color Image Processing and Applications. Springer-Verlag, Berlin. Plataniotis, K.N., Androutsos, D., Venetsanopoulos, A.N. (1996). Fuzzy adaptive filters for multichannel image processing. Signal Process. 55 (1), 93–106. Plataniotis, K.N., Androutsos, D., Vinayagamoorthy, S., Venetsanopoulos, A.N. (1997). Color image processing using adaptive multichannel filters. IEEE Trans. Image Process. 6 (7), 933–950. Plataniotis, K.N., Androutsos, D., Venetsanopoulos, A.N. (1998a). Adaptive multichannel filters for colour image processing. Signal Process. Image Commun. 11 (3), 171–177. Plataniotis, K.N., Androutsos, D., Venetsanopoulos, A.N. (1998b). Color image processing using adaptive vector directional filters. IEEE Trans. Circuits Syst. 45 (10), 1414–1419. Plataniotis, K.N., Androutsos, D., Venetsanopoulos, A.N. (1999). Adaptive fuzzy systems for multichannel signal processing. Proc. IEEE 87 (9), 1601–1622. Polesel, A., Ramponi, G., Mathews, V.J. (2000). Image enhancement via adaptive unsharp masking. IEEE Trans. Image Process. 9 (3), 505–510. Rane, S.D., Sapiro, G., Bertalmio, M. (2003). Structure and texture fillingin of missing image blocks in wireless transmission and compression applications. IEEE Trans. Image Process. 12 (3), 296–303.
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
263
Rantanen, H., Karlsson, M., Pohjala, P., Kalli, S. (1992). Color video signal processing with median filters. IEEE Trans. Consumer Electron. 38 (3), 157–161. Regazoni, C.S., Teschioni, A. (1997). A new approach to vector median filtering based on space filling curves. IEEE Trans. Image Process. 6 (7), 990–1001. Scharcanski, J., Venetsanopoulos, A.N. (1997). Edge detection of color images using directional operators. IEEE Trans. Circuits Syst. Video Tech. 7 (2), 397–401. Sharma, G., Trussell, H.J. (1997). Digital color imaging. IEEE Trans. Image Process. 6 (7), 901–932. Smolka, B. (2002). Adaptive modification of the vector median filter. Machine Graphics and Visions 11 (2–3), 327–350. Smolka, B., Chydzinski, A., Wojciechowski, K., Plataniotis, K.N., Venetsanopoulos, A.N. (2001). On the reduction of impulsive noise in multichannel image processing. Opt. Eng. 40 (6), 902–908. Smolka, B., Lukac, R., Plataniotis, K.N., Wojciechowski, K., Chydzinski, A. (2003). Fast adaptive similarity based impulsive noise reduction filter. Real-Time Imaging, Special Issue on Spectral Imaging 9 (4), 261–276. Smolka, B., Plataniotis, K.N., Venetsanopoulos, A.N. (2004). Nonlinear techniques for color image processing. In: Barner, K.E., Arce, G.R. (Eds.), Nonlinear Signal and Image Processing: Theory, Methods, and Applications. CRC Press, Boca Raton, FL, pp. 445–505. Stokes, M., Anderson, M., Chandrasekar, S., Motta, R. (1996). A standard default color space for the internet—sRGB. Technical Report, www.w3.org/Graphics/Color/sRGB.html. Sung, K.K. (1992). A Vector Signal Processing Approach to Color. M.S. Thesis, Massachusetts Institute of Technology. Szczepanski, M., Smolka, B., Plataniotis, K.N., Venetsanopoulos, A.N. (2003). On the geodesic paths approach to color image filtering. Signal Process. 83 (6), 1309–1342. Szczepanski, M., Smolka, B., Plataniotis, K.N., Venetsanopoulos, A.N. (2004). On the distance function approach to color image enhancement. Discrete Appl. Math. 139 (1–3), 283–305. Tang, K., Astola, J., Neuvo, Y. (1994). Multichannel edge enhancement in color image processing. IEEE Trans. Circuits Syst. Video Tech. 4 (5), 468– 479. Tang, K., Astola, J., Neuvo, Y. (1995). Nonlinear multivariate image filtering techniques. IEEE Trans. Image Process. 4 (6), 788–798. Tang, B., Sapiro, G., Caselles, V. (2001). Color image enhancement via chromaticity diffusion. IEEE Trans. Image Process. 10 (5), 701–707.
264
LUKAC AND PLATANIOTIS
Trahanias, P.E., Venetsanopoulos, A.N. (1993). Vector directional filters: A new class of multichannel image processing filters. IEEE Trans. Image Process. 2 (4), 528–534. Trahanias, P.E., Karakos, D., Venetsanopoulos, A.N. (1996). Directional processing of color images: Theory and experimental results. IEEE Trans. Image Process. 5 (6), 868–881. Tsai, H.H., Yu, P.T. (2000). Genetic-based fuzzy hybrid multichannel filters for color image restoration. Fuzzy Sets and Systems 114 (2), 203–224. Viero, T., Oistamo, K., Neuvo, Y. (1994). Three-dimensional median related filters for color image sequence filtering. IEEE Trans. Circuits Syst. Video Tech. 4 (2), 129–142. Vrhel, M.J., Saber, E., Trussell, H.J. (2005). Color image generation and display technologies. IEEE Signal Process. Mag. 22 (1), 22–33. Wyszecki, G., Stiles, W.S. (1982). Color Science, Concepts and Methods, Quantitative Data and Formulas, 2nd ed. John Wiley, New York. Yang, R., Yin, L., Gabbouj, M., Astola, J., Neuvo, Y. (1995). Optimal weighted median filtering under structural constraints. IEEE Trans. Signal Process. 43 (3), 591–604. Yin, L., Neuvo, Y. (1994). Fast adaptation and performance characteristics of FIR-WOS hybrid filters. IEEE Trans. Signal Process. 41 (7), 1610–1628. Yin, L., Astola, J., Neuvo, Y. (1993). Adaptive stack filtering with application to image processing. IEEE Trans. Signal Process. 41 (1), 162–184. Yin, L., Yang, R., Gabbouj, M., Neuvo, Y. (1996). Weighted median filters: A tutorial. IEEE Trans. Circuits Syst. 43 (3), 157–192. Yu, P.T., Liao, W.H. (1994). Weighted order statistics filters—their classification, some properties, and conversion algorithm. IEEE Trans. Signal Process. 42 (10), 2678–2691. Zheng, J., Valavanis, K.P., Gauch, J.M. (1993). Noise removal from color images. J. Intell. Robot. Syst. 7 (3), 257–285. Ziou, D., Tabbone, S. (1998). Edge detection techniques: An overview. Patt. Recognition Image Anal. 8 (4), 537–559.
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 140
General Sweep Mathematical Morphology FRANK Y. SHIH Computer Vision Laboratory, College of Computing Sciences, New Jersey Institute of Technology, Newark, New Jersey 07102, USA
I. Introduction . . . . . . . . . . . . . . . . . II. Theoretical Development of General Sweep Mathematical Morphology A. Computation of Traditional Morphology . . . . . . . . B. General Sweep Mathematical Morphology . . . . . . . C. Properties of Sweep Morphological Operations . . . . . . III. Blending of Swept Surfaces with Deformations . . . . . . . IV. Image Enhancement . . . . . . . . . . . . . . . V. Edge Linking . . . . . . . . . . . . . . . . . A. Edge Linking Using Sweep Morphology . . . . . . . . VI. Shortest Path Planning for Mobile Robot . . . . . . . . . VII. Geometric Modeling and Sweep Mathematical Morphology . . . A. Tolerance Expression . . . . . . . . . . . . . B. Sweep Surface Modeling . . . . . . . . . . . . VIII. Formal Language and Sweep Morphology . . . . . . . . IX. Representation Scheme . . . . . . . . . . . . . . A. Two-Dimensional Attributes . . . . . . . . . . . B. Three-Dimensional Attributes . . . . . . . . . . . X. Grammars . . . . . . . . . . . . . . . . . . A. Two-Dimensional Attributes . . . . . . . . . . . B. Three-Dimensional Attributes . . . . . . . . . . . XI. Parsing Algorithm . . . . . . . . . . . . . . . XII. Conclusions . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . Further Reading . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
265 268 268 270 273 275 278 280 281 286 288 289 291 291 292 292 293 297 298 298 300 303 303 306
I. I NTRODUCTION The sweep operation to generate a new object by sweeping an object along a space curve trajectory provides a natural design tool in solid modeling. The simplest sweep is linear extrusion defined by a two-dimensional (2D) area swept along a linear path normal to the plane of the area to create a volume. Another simple sweep is rotational sweep defined by rotating a 2D object about an axis. Though simple, these two sweeps are often seen in real applications. Sweeps that generate area or volume changes in size, shape, 265 ISSN 1076-5670/05 DOI: 10.1016/S1076-5670(05)40005-1
Copyright 2006, Elsevier Inc. All rights reserved.
266
SHIH
or orientation during the sweeping process, and follow an arbitrarily curved trajectory, are called general sweeps (Requicha, 1980). General sweeps of solids are useful in modeling the region swept out by a machine-tool cutting head or a robot following a path. General sweeps of 2D cross sections are known as generalized cylinders in computer vision, and are usually modeled as parameterized 2D cross sections swept at right angles along an arbitrary curve. Being the simplest of general sweeps, generalized cylinders are somewhat easy to compute. However, general sweeps of solids are difficult to compute since the trajectory and object shape may make the sweep object self-intersect (Foley et al., 1995). Mathematical morphology involves the geometric analysis of shapes and textures in images. Appropriately used, mathematical morphological operations tend to simplify image data presenting their essential shape characteristics and eliminating irrelevancies (Haralick et al., 1987; Serra, 1982; Shih and Mitchell, 1989, 1992). As the object recognition, feature extraction, and defect detection correlate directly with shape, it becomes apparent that mathematical morphology is the natural processing approach to deal with the machine vision recognition process and the visually guided robot problem. The mathematical morphological operations can be thought of working with two images. Conceptually, the image being processed is referred to as the active image and the other image being a kernel is referred to as the structuring element. Each structuring element has a designed shape, which can be thought of as a probe or filter of the active image. We can modify the active image by probing it with various structuring elements. The two fundamental mathematical morphological operations are dilation and erosion. Dilation combines two sets using vector addition of set elements. Dilation by disk structuring elements corresponds to isotropic expansion algorithm popular to binary image processing. Dilation by small square (3 × 3) is an eight-neighborhood operation that can be easily implemented by adjacently connected array architectures and is the one known by the name “fill,” “expand,” or “grow.” Erosion is the morphological dual to dilation. It combines two sets using vector subtraction of set elements. Some equivalent terms of erosion are “shrink” and “reduce.” The traditional morphological operations perform vector additions or subtractions by a translation of structuring element to the object pixel. They are far from being capable of modeling the swept volumes of structuring elements moving with complex, simultaneous translation, scaling, and rotation in Euclidean space. In this chapter, we developed an approach that adopts sweep morphological operations to study the properties of swept volumes. We present the theoretical framework for representation, computation, and analysis of a new class of general sweep mathematical morphology and its practical applications.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
267
Geometric modeling is the foundation for CAD/CAM integration (Pennington et al., 1983). The goal of automated manufacturing inspection and robotic assembly is to generate a complete process automatically. The representation must not only possess the nominal geometric shapes, but also reason the geometric inaccuracies (or tolerances) into the locations and shapes of solid objects. Boundary representation and constructed solid geometry (CSG) representation are popularly used as an internal database (Requicha and Voelcker, 1982; Rossignac, 2002) for geometric modeling. Boundary representation consists of two kinds of information—topological information and geometric information, including vertex coordinates, surface equations, and connectivity between faces, edges, and vertices. There are several advantages in boundary representation: large domain, unambiguity, uniqueness, and explicit representation of faces, edges, and vertices. There are also several disadvantages: verbose data structure, difficulty in creating, difficulty in checking validity, and variational information unavailability. The CSG representation works by constructing a complex part by hierarchically combining simple primitives using Boolean set operations (Mott-Smith and Baer, 1972). There are several advantages of using CSG representation: large domain, unambiguity, easy-to-check validity, and easy creativity. There are also several disadvantages: nonuniqueness, difficulty in editing graphically, input data redundancy, and variational information unavailability (Voelcker and Hunt, 1981). The framework we propose for geometric modeling and representation is sweep mathematical morphology. The sweep operation to generate a volume by sweeping a primitive object along a space curve trajectory provides a natural design tool. The simplest sweep is linear extrusion defined by a 2D area swept along a linear path normal to the plane of the area to create a volume (Chen et al., 1999). Another sweep is rotational sweep defined by rotating a 2D object about an axis. General sweep is useful in modeling the region swept out by a machinetool cutting head or a robot following a path (Blackmore et al., 1994). General sweeps of 2D cross sections are known as generalized cylinders in computer vision and are usually modeled as parameterized 2D cross sections swept at right angles along an arbitrary curve. Being the simplest of general sweeps, generalized cylinders are somewhat easy to compute. However, general sweeps of solids are difficult to compute since the trajectory and object shape may make the sweep object self-intersect (Foley et al., 1995). A generalized sweeping method for CSG modeling was developed by Shiroma et al. (1982, 1991) to generate swept morphological operations that tend to simplify image data representing their volume. It is shown that the complex solid shapes can be generated with a blending surface to join two disconnected
268
SHIH
solids, fillet volumes for rounding corners, and swept volumes formed by the movement of numeric control (NC) tools. Ragothama and Shapiro (1998) presented a B-Rep method for deformation in parametric solid modeling. This chapter is organized as follows. Section II presents the theoretical development of general sweep mathematical morphology along with its properties. Section III describes an application of sweep morphology, which represents the blending of swept surfaces with deformations. Section IV presents the usage of sweep morphology for image enhancement, Section V the edge linking, and Section VI the shortest path planning. Section VII describes modeling based on the sweep mathematical morphology. Section VIII describes the formal languages. Section IX proposes the representation scheme for 2D and three-dimensional (3D) objects. Section X introduces the adopted grammars. Section XI applies the parsing algorithm to determine whether a given object belongs to the language. The conclusions are drawn in Section XII.
II. T HEORETICAL D EVELOPMENT OF G ENERAL S WEEP M ATHEMATICAL M ORPHOLOGY Traditional morphological dilation and erosion perform vector additions or subtractions by translating a structuring element along an object. These operations obviously have the limitation of orientation dependence and can represent the sweep motion, which involves only translation. By including not only translation but also rotation and scaling, the entire theoretical framework and practical applications become extremely fruitful. Sweep morphological dilation and erosion describe a motion of a structuring element that sweeps along the boundary of an object or an arbitrary curve by geometric transformations. The rotation angles and scaling factors are defined with respect to the boundary or the curve. A. Computation of Traditional Morphology Because rotation and scaling are inherently defined on each pixel of the curve, the traditional morphological operations of an object by a structuring element need to be converted to the sweep morphological operations of a boundary by the structuring element. We assume throughout this chapter that the sets considered are connected and bounded. Definition 1. A set S is said to be connected if each pair of points, p, q ∈ S can be joined by a path that consists of pixels entirely located in S. Definition 2. Given a set S, a boundary ∂S is defined as the set of points all of whose neighborhoods intersect both S and its complement S c .
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
269
Definition 3. If a set S is connected and has no holes, it is called simply connected; if it is connected but has holes, it is called multiply connected. Definition 4. Given a set S, the outer boundary ∂+ S of the set is defined as the closed loop of points in S that contains every other closed loop consisting of points of the set S; the inner boundary ∂− S is defined as the closed loop of points in S that does not contain any other closed loop in S. Proposition 1. If a set S is simply connected, then ∂S is its boundary; if it is multiply connected, then ∂S = ∂+ S ∪ ∂− S. Definition 5. The positive filling of a set S is denoted as [S]+ and is defined as the set of all points that are inside the outer boundary of S; the negative filling is denoted as [S]− and is defined as the set of all points that are outside the inner boundary. Note that if S is simply connected, then [S]− is a universal set. Therefore, we determine that whether S is simply or multiply connected, S = [S]+ ∩ [S]− . Proposition 2. Let A and B be simply connected sets. The dilation of A by B equals the positive filling of ∂ A ⊕ B, that is, A ⊕ B = [∂ A ⊕ B]+ . The significance is that if A and B are simply connected sets, we can compute the dilation of the boundary ∂A by the set B. This leads to a substantial reduction of computation. Proposition 3. If A and B are simply connected sets, the dilation of A by B equals the positive filling of the dilation of their boundaries, that is, A ⊕ B = [∂ A ⊕ ∂B]+ . This proposition further reduces the computation required for the dilation. Namely, the dilation of sets A by B can be computed by the dilation of the boundary of A by the boundary of B. Proposition 4. If A is multiply connected and B is simply connected, A ⊕ B = [∂+ A ⊕ ∂B]+ ∩ [∂− A ⊕ ∂B]− . Since A and B possess the commutative property with respect to dilation, the following proposition can be easily obtained. Proposition 5. If A is simply connected and B is multiply connected, A ⊕ B = [∂ A ⊕ ∂+ B]+ ∩ [∂ A ⊕ ∂− B]− .
270
SHIH
B. General Sweep Mathematical Morphology The sweep morphology can be represented as a four-tuple, Ψ (B, A, S, Θ), where B is a structuring element set, indicating a primitive object; A is either a curve path or a closed object whose boundary representing the sweep trajectory with a parameter t along which the structuring element B is swept; S(t) is a vector consisting of the scaling factors; Θ(t) is a vector consisting of the rotation angles. Note that both scaling factors and rotation angles are defined with respect to the sweep trajectory. Definition 6. If A is a simply connected object and ∂A denotes its boundary, the sweep morphological dilation of A by B in Euclidean space is denoted by A ⊞ B and is defined as A ⊞ B = c | c = a + bˆ for some a ∈ A and bˆ ∈ S(t) × Θ(t) × B . This is equivalent to performing on the boundary of A (i.e., ∂A) and taking the positive filling as & A⊞B = . ∂A(t) + b × S(t) × Θ(t) 0≤t≤1 b∈B
+
If A is a curve path, that is, ∂A = A, the sweep morphological dilation of A by B is defined as A⊞B = A(t) + b × S(t) × Θ(t) . 0≤t≤1 b∈B
Note that if B does not involve rotations (or B is rotation-invariant like a circle) and scaling, then the sweep dilation is equivalent to the traditional morphological dilation.
Example 1. Figure 1a shows a curve and Figure 1b shows an elliptical structuring element. The rotation angle θ is defined as θ (t) = tan−1 (dy/dt)/(dx/dt) along the curve with parameter t in the range of [0, 1]. The traditional morphological dilation is shown in Figure 1c and the sweep dilation using the defined rotation is shown in Figure 1d. A geometric transformation of the structuring element specifies the new coordinates of each point as functions of the old coordinates. Note that the new coordinates are not necessarily integers after a transformation to a digital image is applied. To make the results of the transformation into a digital image, they must be resampled or interpolated. Since we are transforming a two-valued (black-and-white) image, the zero-order interpolation is adopted.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
271
F IGURE 1. (a) An open curve path, (b) a structuring element, (c) result of a traditional morphological dilation, (d) result of a sweep morphological dilation.
The sweep morphological erosion, unlike dilation, is defined with the restriction on a closed object only and its boundary represents the sweep trajectory. Definition 7. Let ∂A be the boundary of an object A and B be a structuring element. The sweep morphological erosion of A by B in Euclidean space, denoted by A ⊟ B, is defined as A ⊟ B = c | c + bˆ ∈ A and for every bˆ ∈ S(t) × Θ(t) × B .
An example of a sweep erosion by an elliptical structuring element whose semimajor axis is tangent to the boundary is shown in Figure 2. As in traditional morphology, the general sweep morphological opening can be defined as a general sweep erosion of A by B followed by a general sweep dilation, where A must be a closed object. The sweep morphological closing can be defined in the opposite sequence, that is, a general sweep dilation of
272
SHIH
(a)
(b) F IGURE 2.
(a) Traditional erosion and (b) sweep erosion.
A by B followed by a general sweep erosion, where A can be either a closed object or a curve path. The propositions of traditional morphological operations can be extended to sweep morphological operations. Proposition 6. If the structuring element B is simply connected, the sweep dilation of A by B equals the positive filling of the sweep dilation by the boundary of B, that is, A ⊞ B = [A ⊞ ∂B]+ . Extending this proposition to multiply connected objects, we get the following three cases.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
273
Case 6a. If A is multiply connected, that is, ∂A = ∂+ A ∪ ∂− A, then A ⊞ B = [∂A+ ⊞ B]+ ∩ [∂A− ⊞ B]− . Case 6b. If B is multiply connected, that is, ∂B = ∂+ B ∪ ∂− B, then A ⊞ B = [A ⊞ ∂B+ ]+ ∩ [A ⊞ ∂B− ]− . Case 6c. Finally, if both A and B are multiply connected, that is, ∂A = ∂+ A ∪ ∂− A and ∂B = ∂+ B ∪ ∂− B, then A ⊞ B = [(∂+ A ∪ ∂− A) ⊞ ∂+ B]+ ∩ [(∂+ A ∪ ∂− A) ⊞ ∂− B]− . This leads to a substantial reduction of computation. We can make an analogous development for the sweep erosion. Proposition 7.
If A and B are simply connected sets, then A⊟B = A⊟∂B.
With the aforementioned propositions and considering the boundary of the structuring element, we can further reduce the computation of sweep morphological operations. C. Properties of Sweep Morphological Operations Property 1 (Noncommutative). Because of the rotational factor in the operation, the commutativity does not hold, that is, A ⊞ B = B ⊞ A. Property 2 (Nonassociative). Because the rotational and scaling factors are dependent on the characteristics of the boundary of the object, associativity does not hold. Hence, A ⊞ (B ⊞ C) = (A ⊞ B) ⊞ C. But associativity of regular dilation and sweep dilation holds, that is, A ⊕ (B ⊞ C) = (A ⊕ B) ⊞ C. As the structuring element is rotated based on the boundary properties of B and after A ⊕ B, the boundary properties will be similar to that of B. Property 3 (Translational invariance). & ∂A + x + b × S(t) × θ (t) Ax ⊞ B = 0≤t≤1 b∈B
+
0≤t≤1 b∈B
+
& ∂A + b × S(t) × θ (t) + x =
274
SHIH
=
0≤t≤1 b∈B
∂A + b × S(t) × θ (t)
= (A ⊞ B)x .
&
+
+x
Sweep erosion can be derived similarly. Property 4. Increasing property will not hold in general. If the boundary is smooth, that is, derivative exists everywhere, then increasing property will hold. Property 5 (Distributivity). a. Distributivity over union. Dilation is distributive over union of structuring elements. That is, dilation of A with a union of two structuring elements B and C is the same as union of dilation of A with B and dilation of A with C. & ∂A + b × S(t) × θ (t) A ⊞ (B ∪ C) = +
0≤t≤1 b∈B∪C
=
0≤t≤1 b∈B
∪
∂A + b × S(t) × θ (t)
& ∂A + b × S(t) × θ (t)
b∈C
+
∂A + b × S(t) × θ (t) = 0≤t≤1 b∈B
& ∂A + b × S(t) × θ (t) ∪ 0≤t≤1 b∈C
& = ∂A + b × S(t) × θ (t) 0≤t≤1 b∈B
+
+
& ∂A + b × S(t) × θ (t) ∪ 0≤t≤1 b∈C
= (A ⊞ B) ∪ (A ⊞ C).
+
b. Dilation is not distributive over union of sets. That is, dilation of (A ∪ C) with a structuring element B is not the same as union of dilation of A with B and dilation of C with B. (A ∪ C) ⊞ B = (A ⊞ B) ∪ (C ⊞ B).
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
275
c. Erosion is antidistributive over union of structuring elements. That is, erosion of A with a union of two structuring elements B and C is the same as intersection of erosion of A with B and erosion of A with C. d. Distributivity over intersection. & ∂A ⊞ (B ∩ C) = ∂A + b × S(t) × θ (t) 0≤t≤1 b∈B∩C
& ∂A + b × S(t) × θ (t) ⇒ ∂A ⊞ (B ∩ C) ⊆ 0≤t≤1 b∈B
and also
∂A ⊞ (B ∩ C) ⊆ Therefore,
& ∂A + b × S(t) × θ (t) . 0≤t≤1 b∈C
∂A ⊞ (B ∩ C) ⊆ (∂A ⊞ B) ∩ (∂A ⊞ C), which implies ∂A ⊞ (B ∩ C) + ⊆ (∂A ⊞ B) + ∩ (∂A ⊞ C) + , that is,
A ⊞ (B ∩ C) ⊆ (A ⊞ B) ∩ (A ⊞ C).
III. B LENDING OF S WEPT S URFACES WITH D EFORMATIONS By using general sweep mathematical morphology, a smooth sculptured surface can be described as a trajectory of a cross section curve swept along a profile curve, where the trajectory of the cross section curve is the structuring element B and the profile curve is the open or closed curve C. It is very easy to describe the sculptured surface by specifying the 2D cross sections, and that the resulting surface is aesthetically appealing. The designer can envision the surface as a blended trajectory of cross section curves swept along a profile curve. Let ∂B denote the boundary of a structuring element B. A swept surface Sw (∂B, C) is produced by moving ∂B along a given trajectory curve C. The plane of B must be perpendicular to C at any time instance. The contour curve is represented as a B-spline curve and ∂B is represented as the polygon net of the actual curve. This polygon net is swept along the trajectory to obtain the intermediate polygon nets and later they are interpolated by a B-spline
276
SHIH
surface. The curve can be deformed by twisting or scaling uniformly or by applying the deformations to selected points of ∂B. The curve can also be deformed by varying the weights at each of the points. When a uniform variation is desired, it can be applied to all the points and otherwise to some selected points. These deformations are applied to ∂B before it is moved along the trajectory C. Let ∂B denote a planar polygon with n points and each point ∂Bi = (xi , yi , zi , hi ), where i = 1, 2, . . . , n. Let C denote any 3D curve with m points and each point Cj = (xj , yj , zj ), where j = 1, 2, . . . , m. The scaling factor, weight, and twisting factor for point j of C are denoted as sxj , syj , szj , wj , and θj , respectively. The deformation matrix is obtained as [Sd ] = [Ssw ][Rθ ], where ⎤ ⎡ 0 0 0 sxj 0 0 ⎥ ⎢ 0 syj [Ssw ] = ⎣ 0 0 szj 0 ⎦ 0 0 0 wj and
⎡
cos θj ⎢ − sin θj [Rθ ] = ⎣ 0 0
sin θj cos θj 0 0
⎤ 0 0 0 0⎥ . 1 0⎦ 0 1
The deformed ∂B must be rotated in 3D with respect to the tangent vector at each point of trajectory curve C. To calculate the tangent vector, we add two points to C, C0 and Cm+1 , where C0 = C1 and Cm+1 = Cm . The rotation matrix Rx about the x-axis is given by ⎡ ⎤ 1 0 0 0 sin αj 0 ⎥ ⎢ 0 cos αj , [Rx ] = ⎣ 0 − sin αj cos αj 0 ⎦ 0 0 0 1 where
cos αj =
cyj −1 − cyj +1
hx =
hx
$
,
sin αj =
czj +1 − czj −1 hx
,
(cyj −1 − cyj +1 )2 + (czj +1 − czj −1 )2 .
The rotation matrices about the y- and z-axes can similarly be derived. Finally, ∂B must be translated to each point of C and the translation matrix Cxyz is
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
F IGURE 3.
277
Sweeping of a square along a trajectory with deformation to a circle.
given by ⎡
⎢ [Cxyz ] = ⎣
Cxj
1 0 0 − C x1
Cy j
0 1 0 − Cy 1
0 0 1 Czj − Cz1
⎤ 0 0⎥ . 0⎦ 1
The polygon net of the sweep surface will be obtained by [Bi,j ] = [∂Bi ][Sd ][SwC ] where [SwC ] = [Rx ][Ry ][Rz ][Cxyz ]. The B-spline surface can be obtained from the polygon net by finding the B-spline curve at each point of C. To obtain the whole swept surface, the B-spline curves at each point of the trajectory C have to be calculated. This computation can be reduced by selecting a few polygon nets and calculating the B-spline surface. Example 2. Sweeping of a circle along a trajectory with deformation to a square. Here the deformation is only the variation of the weights. The circle is represented as a rational B-spline curve. The polygon net is a square with nine points with√the first and last being the same and the weights of the corner vary from 5 to 2/2 as it is being swept along the trajectory C, which is given in parametric form as x = 10s and y = cos(π s) − 1. The sweep transformation is given by ⎡ ⎤ cos ψ sin ψ 0 0 ⎢ − sin ψ cos ψ 0 0 ⎥ −1 −π sin(π s) . [SwT ] = ⎣ , where ψ = tan 0 0 1 0⎦ 10 10s cos π s 0 1 Figure 3 shows the sweeping of a square along a trajectory with deformation to a circle.
278
SHIH
IV. I MAGE E NHANCEMENT Because of being adapted to local properties of the image, general sweep morphological operations can provide variant degrees of smoothing for noise removal while preserving the object features. Research on statistical analysis of traditional morphological operations has been found. Stevenson and Arce (1987) developed the output distribution function of opening with flat structuring elements by threshold decomposition. Morales and Acharya (1993) presented general solutions for the statistical analysis of morphological openings with compact, convex, and homothetic structuring elements. Traditional opening can remove noise as well as object features whose sizes are smaller than the structuring element. With the general sweep morphological opening, the object features of similar shape and greater size compared to the structuring element will be preserved while removing noise. In general, the highly varying parts of the image are assigned based on smaller structuring elements and the slowly varying parts with larger ones. The structuring elements can be assigned based on the contour gradient variation. An example is illustrated in Figure 4.
Step edge is an important feature in an image. Assume a noisy step edge is defined as Nx if x < 0 f (x) = h + Nx if x ≥ 0,
where h is the strength of the edge and Nx is i.i.d. Gaussian random noise with mean value 0 and variance 1. For image filtering with a general sweep morphological opening, we essentially adopt a smaller sized structuring element for the important feature points and a larger sized one for other
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
F IGURE 4.
279
Structuring element assignment using general sweep morphology.
locations. Therefore, the noise in an image is removed while the features are preserved. For instance, in a one-dimensional image, it can be easily achieved by computing the gradient by f (x) − f (x − 1) and setting those points accordingly, in which the gradient values are larger than the predefined threshold, with smaller structuring elements. In Chen et al. (1993), the results of noisy step edge filtering by both traditional morphological opening and so-called space-varying (involving both scaling and translation in our general sweep morphology model) opening were shown and compared by computing the mean and variance of output signals. The mean value of the output distribution follows the main shape of the filtering result well, and this gives evidence of the shape-preserving ability of the proposed operation. Meanwhile, the variance of output distribution coincides with the noise variance, and this shows the corresponding noise-removing ability. It is observed that general sweep opening possesses approximately the same noise-removing ability as compared to the traditional one. Moreover, it can be observed that the relative edge strength with respect to the variation between the transition interval, say [−2, 2], for general sweep opening, is larger than that of the traditional one. This explains why the edge is degraded in the traditional morphology case but is enhanced in the general sweep one. Although a step-edge model was tested successfully, other complicated cases need further elaboration. Statistical analysis for providing a quantitative approach to general sweep morphological operations will be further investigated. Chen et al. (1999) have shown image filtering using adaptive signal processing, which is nothing but the sweep morphology with only scaling and translation. The method uses space-varying structuring elements by
280
SHIH
assigning different filtering scales to the feature parts and other parts. To adaptively assign structuring elements, they have developed the progressive umbra-filling (PUF) procedure. This is an iterative process. The experimental results have shown that this approach can successfully eliminate noise without oversmoothing the important features of a signal.
V. E DGE L INKING Edge is a local property of a pixel and its immediate neighborhood. Edge detector is a local processing to locate sharp changes in the intensity function. An ideal edge has a step-like cross section as gray levels change abruptly across the border. In practice, edges in digital images are generally slightly blurred as effects of sampling and noise. There are many edge detection algorithms, and the basic idea underlying most edge detection techniques is the computation of a local derivative operator (Gonzalez and Woods, 2002). Some algorithms like the LoG filter produce closed edges; however, false edges are generated when blur and noise appear in an image. Some algorithms like Sobel operator produce noisy boundaries that do not actually lie on the borders and broken gaps where border pixels should reside. That is because noise and breaks are present in the boundary from nonuniform illumination and other effects that introduce spurious intensity discontinuities. Thus, edge detection algorithms are typically followed by linking and other boundary detection procedures, which are designed to assemble edge pixels into meaningful boundaries. Edge linking by the tree search technique was proposed by Martelli (1976) to link the edge sequentially along the boundary between pixels. The cost of each boundary element is defined by the step size between the pixels on both of its sides. A larger intensity difference corresponds to a larger step size, which is assigned a lower cost. The path of boundary elements with the lowest cost is linked up as an edge. The cost function was later redefined by Cooper et al. (1980), where the edge is extended through the path having a maximal local likelihood. Similar efforts were made by Eichel et al. (1988) and by Farag and Delp (1995). Basically, the tree search method is time consuming and requires the suitable assignment of root points. Another method locates all of the end points of the broken edges and uses a relaxation method to pair them up, so that line direction is maintained, lines are not allowed to cross, and closer points are matched first. However, this results in problems if unmatched end points or noises are present. A simple approach to edge linking is a morphological dilation of points by some arbitrarily selected radius of circles followed by the OR operator of
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
F IGURE 5.
281
The elliptic structuring element.
the boundary image with the resulting dilated circles and the result is finally skeletonized (Russ, 1992). This method, however, has a problem in that some of the points may be too far apart for the circles to touch, while the circles may obscure details by touching several existing lines. To overcome this, sweep mathematical morphology is used to allow the variation of the structuring element according to local properties of the input pixels. A. Edge Linking Using Sweep Morphology Let B denote an elliptic structuring element shown in Figure 5, where p and q denote, respectively, the semimajor and semiminor axes. That is ∂B ≡ [x, y]T | x 2 /p2 + y 2 /q 2 = 1 .
An edge-linking algorithm was proposed by Shih and Cheng (2004) based on the sweep dilation, thinning, and pruning. This is a three-step process as explained below. Step 1: Sweep Dilation. The broken line segments can be linked up by using the sweep morphology provided that the structuring element is suitably adjusted. Considering the input signal plotted in Figure 6a, the concept of using the sweep morphological dilation is illustrated in Figure 6b. Extending the line segments in the direction of the local slope performs the linking. The basic shape of the structuring element is an ellipse, where the major axis is always aligned with the tangent of the signal. The elliptical structuring element reduces noisy edge points and small insignificant branches. The width of the ellipse is selected to accomplish this purpose. The major axis of the ellipse should be adapted to the local curvature of the input signal to protect it from overstretch at high curvature points. At high curvature points, a short major axis is selected and vice versa. Step 2: Thinning. After performing the sweep dilation by directional ellipses, the edge segments are extended in the direction of the local slope.
282
SHIH
(a) F IGURE 6.
(b)
(a) Input signal and (b) sweep dilation with elliptical structuring elements.
(a) F IGURE 7.
(b)
(a) Original elliptical edge and (b) its randomly discontinuous edge.
Because the tolerance (or the minor axis of the ellipse) is added, the edge segments grow a little thick. To suppress this effect, morphological thinning is adopted. An algorithm of thinning using mathematical morphology was proposed by Jang and Chin (1990). The skeletons generated by their algorithm are connected, one pixel width, and closely follow the medial axes. The algorithm is an iterative process based on the hit/miss operation. Four structuring elements
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
(a)
283
(b)
(c) F IGURE 8. (c) r = 10.
Using circular structuring elements in five iterations with (a) r = 3, (b) r = 5, and
are constructed to remove boundary pixels from four directions, and another four are constructed to remove the extra pixels at skeleton junctions. There are four passes in each iteration. Three of the eight predefined structuring element
284
SHIH
F IGURE 9.
Using the sweep morphological edge-linking algorithm.
templates are applied simultaneously in each pass. The iterative process is performed until the result converges. The thinning algorithm will not shorten the skeletal legs. Therefore, it is applied to the sweep dilated edges. Step 3: Pruning. The dilated edge segments after thinning may still produce a small number of short skeletal branches. These short branches should be pruned. In a skeleton, any pixel, which has three or more neighbors, is called a root. Starting from each neighbor of the root pixel, the skeleton is traced outward. Those paths whose lengths are shorter than a given threshold k are treated as branches and are pruned away. Figure 7a shows an original elliptical edge and Figure 7b shows its randomly discontinuous edge. The sweep morphological edge-linking algorithm is shown experimentally in Figure 7b. Figure 8 shows the results of using circular structuring elements with radius r = 3, r = 5, and r = 10, respectively, in five iterations. Compared with the original ellipse in Figure 7a, we know that if the gap is larger than the radius of the structuring element, it is difficult to link the gap smoothly. However, if a very large circular structuring element is used, the edge will look hollow and protuberant. Also, using a big circle can obscure the details of the edge. Figure 9 shows the result of using the sweep morphological edge-linking algorithm. Figure 10a shows the edge of an industrial part and Figure 10b shows its randomly discontinuous edge. Figure 10c shows the result of using the sweep morphological edge-linking algorithm. Figure 11a shows the edge
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
(a)
285
(b)
(c) F IGURE 10. (a) The edge of an industrial part, (b) its randomly discontinuous edge, and (c) using the sweep morphological edge-linking algorithm.
with added uniform noise and Figure 11b shows the edge after removing noise. Figure 11c shows the result of using the sweep morphological edgelinking algorithm. Figure 12a shows a face image with the originally detected broken edge. Figure 12b shows the face image with the edge linked by the sweep morphological edge-linking algorithm.
286
SHIH
(a)
(b)
(c) F IGURE 11. (a) Part edge with added uniform noise, (b) part edge after removing noise, and (c) using the sweep morphological edge-linking algorithm.
VI. S HORTEST PATH P LANNING
FOR
M OBILE ROBOT
The recent advances in the fields of robotics and artificial intelligence have stimulated considerable interest in robot motion planning and the shortest path-finding problem (Latombe, 1991). The path planning is in general concerned with finding paths connecting different locations in an environment (e.g., a network, a graph, or a geometric space). Depending on the specific
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
(a)
287
(b)
F IGURE 12. (a) Face image with the originally detected broken edge and (b) face image with the edge linked by the sweep morphological edge-linking algorithm.
F IGURE 13.
Shortest path of an H-shaped car by using the sweep (rotational) morphology.
applications, the desired paths often need to satisfy some constraints (e.g., obstacle avoiding) and optimize certain criteria (e.g., variant distance metrics and cost functions). The problems of planning shortest paths arise in many disciplines, and in fact these constitute one of the most powerful tools for modeling combinatorial optimization problems.
288
SHIH
The path planning problem is given as a mobile robot of arbitrary shape moves from a starting position to a destination in a finite space with arbitrarily shaped obstacles in it. When we apply traditional mathematical morphology to solve the problem, its drawback is the fixed directional movement of the structuring element (i.e., robot), which is no longer the optimal path in real world applications (Lin and Chang, 1993). By incorporating rotation into the motion of a moving object, it gives more realistic results to solving the shortest path finding problem. The shortest path finding problem is equivalent to applying sweep (rotational) morphological erosion to the free space followed by a distance transformation (Shih and Wu, 2004) on the domain with the grown obstacles excluded and then tracing back the distance map from the destination point to the neighbors with the minimum distance until the starting point is reached (Pei et al., 1998). An example illustrating the shortest path of an H-shaped car by using the sweep (rotational) morphology is shown in Figure 13.
VII. G EOMETRIC M ODELING AND S WEEP M ATHEMATICAL M ORPHOLOGY The dilation can be represented in the matrix form as follows. Let A(t) be represented by the matrix [ax (t), ay (t), az (t)], where 0 ≤ t ≤ 1. For every t, let the scaling factors be sx (t), sy (t), sz (t) and the rotation factors be θx (t), θy (t), θz (t). By using the homogeneous coordinates, the scaling transformation matrix can be represented as ⎡ ⎤ sx (t) 0 0 0 ⎢ 0 0 0⎥ sy (t) S(t) = ⎣ . 0 0 sz (t) 0 ⎦ 0 0 0 1 The rotation matrix about the x-axis is represented as ⎡ ⎤ 1 0 0 0 3 4 ⎢ 0 cos θx (t) sin θx (t) 0 ⎥ Rx (t) = ⎣ , 0 − sin θx (t) cos θx (t) 0 ⎦ 0 0 0 1 where
cos θx (t) = hx =
ay (t − 1) − ay (t + 1) , hx $
sin θx (t) =
az (t + 1) − az (t − 1) , hx
2 2 ay (t − 1) − ay (t + 1) + az (t + 1) − az (t − 1) .
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
289
The rotation matrices about the y- and z-axes can be derived similarly. Finally, the structuring element is translated by using ⎤ ⎡ 1 0 0 0 ⎢ 0 1 0 0⎥ . A(t) = ⎣ 0 0 1 0⎦ ax (t) ay (t) az (t) 1 Therefore, the sweep dilation is equivalent to the concatenated transformation matrices as (A ⊞ B)(t) = [B] S(t) Rx (t) Ry (t) Rz (t) A(t) , where 0 ≤ t ≤ 1.
Schemes based on sweep representation are useful in creating solid models of two-and-a-half-dimensional objects that include both solids of uniform thickness in a given direction and axis-symmetric solids. Computer representation of the swept volume of a planar surface has been used as a primary modeling scheme in solid modeling systems (Shih and Cheng, 2004). Representation of the swept volume of a 3D object (Stevenson and Arce, 1987), however, has received limited attention. Leu et al. (1986) presented a method for representing the swept volumes of translating objects using boundary representation and ray in–out classification. Their method is restricted to translation only. Representing the swept volumes of moving objects under a general motion is a more complex problem. A number of researchers have examined the problem of computing swept volumes, including Korein (1985) for rotating polyhedra, Kaul (1993) using Minkowski sums for translation, Wang and Wang (1986) using envelop theory, and Martin and Stephenson (1990) using envelop theory and computer algebraic techniques. In this chapter, geometric modeling based on sweep morphology is proposed. Because of the morphological operators’ geometric nature and nonlinear property, some modeling problems will become simple and intuitive. This framework can be used for modeling not only swept surface and volumes but also for tolerance modeling in manufacturing. A. Tolerance Expression Tolerances constrain an object’s features to lie within regions of space called tolerance zones (Requicha, 1984). Tolerance zones in Rossignac and Requicha (1985) were constructed by expanding the nominal feature to obtain the region bounded by the outer closed curve, shrinking the nominal feature to obtain the region bounded by the inner curve, and then subtracting the two resulting regions. This procedure is equivalent to the morphological dilation of the offset inner contour with a tolerance-radius disked structuring element.
290
SHIH
(a)
(b)
F IGURE 14. Tolerance zones. (a) An annular tolerance zone that corresponds to a circular hole. (b) A tolerance zone for an elongated slot.
(a) F IGURE 15.
(b) (a, b) An example of adding tolerance by a morphological dilation.
Figure 14a shows an annular tolerance zone that corresponds to a circular hole, and Figure 14b shows a tolerance zone for an elongated slot. Both can be constructed by dilating the nominal contour with a tolerance-radius disked structuring element as shown in Figure 15. The tolerance zone for testing the size of a round hole is an annular region lying between two circles with the specified maximal and minimal diameters; the zone corresponding to a form constraint for the hole is also an annulus, defined by two concentric circles whose diameters must differ by a specified amount but are otherwise arbitrary (Shih et al., 1994). The sweep mathematical morphology supports the conventional limit (±) tolerance on “dimensions” that appear in the engineering drawings. The positive deviation is equivalent to the dilated result and the negative deviation is equivalent to the eroded result. The industrial parts adding tolerance information can be expressed using a dilation with a circle.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
F IGURE 16.
291
Modeling of sweep surface.
B. Sweep Surface Modeling A simplest sweep surface is generated by a profile sweeping along a spine with or without deformation. This is nothing but sweep mathematical dilation of the two curves. Let P (u) be the profile curve, B(w) be the spine, and S(u, w) be the sweep surface. The sweep surface can be expressed as S(u, w) = P (u) ⊞ B(w). A sweep surface with initial and final profiles P1 (u) and P2 (u) at relative locations O1 and O2 , respectively, and with the sweeping rule R(w) is shown in Figure 16 and can be expressed as S(u, w) = 1 − R(w) P1 (u) ⊞ B(w) − O1 + R(w) P2 (u) ⊞ B(w) − O2 .
VIII. F ORMAL L ANGUAGE AND S WEEP M ORPHOLOGY Our representation framework is formulated as follows. Let E N denote the set of points in the N-dimensional Euclidean space and p = (x1 , x2 , . . . , xN ) represent a point in E N . In this way, any object is a subset in E N . The formal model for geometric modeling is a context-free grammar, G, consisting of a four-tuple (Fu, 1982; Gosh, 1988; Shih, 1991): G = (VN , VT , P , S), where VN is a set of nonterminal symbols, such as complicated shapes; VT is a set of terminal symbols that contains two sets: one is the decomposed primitive shapes, such as lines and circles, and the other is the shape operators; P is a finite set of rewriting rules or productions denoted by A → β, where A ∈ VN and β is a string over VN ∪ VT ; S is the start symbol, which represents
292
SHIH
the solid object. The operators used include sweep morphological dilation, set union, and set subtraction. Note that such a production allows the nonterminal A to be replaced by the string β independent of the context in which A appears. A context-free grammar is defined as a derived form: A → β, where A is a single nonterminal and β is a nonempty string of terminals and nonterminals. The languages generated by the context-free grammars are called context-free languages. Object representation can be viewed as a task of converting a solid shape into a sentence in the language, whereas object classification is the task of “parsing” a sentence. The criteria for the primitive selection are influenced by the nature of data, the specific application in the question, and the technology available for implementing the system. The following serves as a general guideline for primitive selection. 1. The primitives should be the basic shape elements that can provide a compact but adequate description of the object shape in terms of the specified structural relations (e.g., the concatenation relation). 2. The primitives should be easily extractable by the existing nonsyntactic (e.g., decision-theoretic) methods, since they are considered to be simple and compact shapes and their structural information is not important.
IX. R EPRESENTATION S CHEME In this section, we describe 2D attributes and 3D attributes in our representation scheme. A. Two-Dimensional Attributes Commonly used 2D attributes are rectangle, parallelogram, triangle, rhombus, circle, and trapezoid. They can be represented easily by using the sweep morphological operators. The expressions are not unique, and the preference depends on the simplest combination and the least computational complexity. The common method is to decompose the attributes into smaller components and apply morphological dilations to grow these components. Let a and b represent unit vectors in x- and y-axes, respectively. The unit vector could represent 1 m, 0.1 m, 0.01 m, and so on as needed. Note that when the sweep dilation is not associated with rotation and scaling, it is equivalent to the traditional dilation. a. Rectangle: It is represented as a unit x-axis vector a swept along a unit y-axis vector b, that is, b ⊞ a with no rotation or scaling.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
(a) F IGURE 17.
293
(b)
(a, b) The decomposition of two-dimensional attributes.
b. Parallelogram: Let k denote a vector sum of a and b that is defined in a. It is represented as k ⊞ a with no rotation or scaling. c. Circle: Using a sweep rotation, a circle can be represented as a unit vector a swept about a point p through 2π degrees, that is, p ⊞ a. d. Trapezoid: b ⊞ a with a linear scaling factor to change the magnitude of a into c as it is swept along b as shown in Figure 17a. Let 0 ≤ t ≤ 1. The scaling factor along the trajectory b is S(t) = (c/a)t + (1 − t). e. Triangle: b ⊞ a, similar to a trapezoid but with a linear scaling factor to change the magnitude of a into zero as it is swept along b, as shown in Figure 17b. Note that the shape of triangles (e.g., an equilateral or right triangle) is determined by the fixing location of the reference point (i.e., the origin) of the primitive line a.
B. Three-Dimensional Attributes The 3D attributes can be applied by a similar method. Let a, b, c denote unit vectors in the x-, y-, and z-axes, respectively. The formal expressions are presented below. a. Parallelepiped: It is represented as a unit vector a swept along a unit vector b to obtain a rectangle and then it is swept along a unit vector c to obtain the parallelepiped, that is, c ⊞ (b ⊞ a). b. Cylinder: It is represented as a unit vector a swept about a point p through 2π degrees to obtain a circle, and then it is swept along a unit vector c to obtain the cylinder, that is, c ⊞ (p ⊞ a).
294
SHIH
F IGURE 18.
Sweep dilation of a rectangle with a corner truncated by a circle.
c. Parallelepiped with a corner truncated by a sphere: A unit vector a is swept along a unit vector b to obtain a rectangle. A vector r is swept about a point p through 2π degrees to obtain a circle, and then it is subtracted
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
F IGURE 19.
295
Sweeping of a square along a trajectory with deformation to a circle.
from the rectangle. The result is swept along a unit vector c, that is, c ⊞ [(b ⊞ a) − (p ⊞ r)], as shown in Figure 18. d. Sweep dilation of a square along a trajectory with deformation to a circle: The square is represented as a rational B-spline curve. The polygon net is specified by a square with nine points with the first√and the last being the same and the weights of the corner vary from 5 to 2/2 as it is swept along the trajectory C that is defined in the parametric form as x = 10s and y = cos(π s) − 1. The sweep transformation is given by ⎡ ⎤ cos ψ sin ψ 0 0 ⎢ − sin ψ cos ψ 0 0 ⎥ −1 −π sin(π s) [SwT ] = ⎣ . , where ψ = tan 0 0 1 0⎦ 10 10s cos π s 0 1
The formal expression is C⊞B. The sweeping of a square along a trajectory with deformation to a circle is shown in Figure 19. e. Parallelepiped with a cylindrical hole: A unit vector a is swept along a unit vector b to obtain a rectangle. A vector r is swept about a point p through 2π degrees to obtain a circle, and it is subtracted from the rectangle. The
296
SHIH
F IGURE 20.
Sweep dilation of rectangle with a circular hole.
result is swept along a unit vector c, that is, c ⊞ [(b ⊞ a) − (p ⊞ r)], as shown in Figure 20. f. U-shape block: A unit vector a is swept along a unit vector b to obtain a rectangle. A vector r is swept about a point p through π degrees to obtain a half circle, and it is dilated along the rectangle to obtain a two-roundedcorner rectangle that is then subtracted from another rectangle to obtain a U-shaped 2D object. The result is swept along a unit vector c to obtain the final U-shaped object, that is, c ⊞ {(b′ ⊞ a ′ ) − [(b ⊞ a) ⊞ (p ⊞ r)]}, as shown in Figure 21. Note that the proposed sweep mathematical morphology model can be applied to the NC machining process. For example, the ball-end milling cutter can be viewed as the structuring element, and it can be moved along a predefined path to cut a work piece. During the movement, the cutter can
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
F IGURE 21.
297
Machining with a round bottom tool.
be rotated to be perpendicular to the sweep path. If the swept volume is subtracted from the work piece, the remaining part can be obtained.
X. G RAMMARS In this section, we describe grammars in 2D and 3D attributes. We have experimented on many geometric objects. The results show our model can work successfully.
298
SHIH
A. Two-Dimensional Attributes All the primitive 2D objects can be represented by the following grammar using the sweep mathematical morphology model: G = (VN , VT , P , S), where, VN = {S, A, B, K},
VT = {a, b, k, p, ⊞},
P : S → B ⊞ A | B ⊞ K | p ⊞ A, A → aA | a,
B → bB | b,
K → kK | k.
The sweep dilation ⊞ can be ⊕ (S = 0, θ = 0), ⊕ [S = ac t +(1−t), θ = 0], or ⊕ (S = 0, θ = 2π ). Note that the repetition of a unit vector in a generated string is the usual method of grammatical representation. We can shorten the length of a string by adopting a repetition symbol “*.” For example, “*5a” denotes “aaaaa.” a. Rectangle can be represented by the string bb⊞aa, with as and bs repeated any number of times depending on the required size. b. Parallelogram can be represented by the string kk ⊞ aaa, with as and ks repeated any number of times depending on the required size. c. Circle can be represented by the string p ⊞ aaa, with as repeated any number of times depending on the required size and with ⊞ as ⊕ (S = 0, θ = 2π ). d. Trapezoid can be represented by the string bb⊞aa, with as and bs repeated any number of times depending on the required size and with ⊞ as ⊕ [S = c a t + (1 − t), θ = 0]. e. Triangle can be represented by the string bb ⊞ aa, with as and bs repeated any number of times depending on the required size and with ⊞ as ⊕ [S = (1 − t), θ = 0]. B. Three-Dimensional Attributes All the primitive 3D objects can be categorized into the following grammar: G = (VN , VT , P , S),
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
299
where VN = {S, A, B, C},
VT = a, b, c, p, (,), ⊞ ,
P : S → C ⊞ (B ⊞ A) | C ⊞ (p ⊞ A), A → aA | a,
B → bB | b,
C → cC | c.
The sweep dilation ⊞ can be either ⊕ (S = 0, θ = 0) or ⊕ (S = 0, θ = 2π ).
a. Parallelepiped can be represented by the string ccc ⊞ (bb ⊞ aaa), with as, bs, and cs repeated any number of times depending on the required size. b. Cylinder can be represented by the string cccc ⊞ (p ⊞ aaa), with as and cs repeated any number of times depending on the required size and with the first dilation operator ⊞ as ⊕ (S = 0, θ = 2π ) and the second dilation as the traditional dilation. c. Consider the grammar G = (VN , VT , P , S), where VN = {A, B, C, N},
VT = a, b, c, p, ⊞, −, (,) ,
P : S → C ⊞ N,
N → rectangle − circle, C → cC | c.
The productions for the rectangle and circle are given in Section X.A. c.1. The sweep dilation of a rectangle with a corner truncated by a circle can be represented by the string cc ⊞ [(bb ⊞ aaa) − (p ⊞ aa)], with as, bs, and cs repeated any number of times depending on the required size and with the second dilation operator ⊞ as ⊕ (S = 0, θ = 2π ). c.2. The sweep dilation of a rectangle with a circular hole can be represented by the string cc ⊞ [(bb ⊞ aaa) − (p ⊞ a)], with as, bs, and cs repeated any number of times depending on the required size and with the second dilation operator ⊞ as ⊕ (S = 0, θ = 2π ). The difference from the previous one is that the circle lies completely within the rectangle, and hence we obtain a hole instead of a corner truncated.
300
SHIH
d. The grammar for the U-shape block can be represented as follows: G = (VN , VT , P , S), where VN = {A, B, C, N, M, half_circle},
VT = {a, b, c, p, ⊞, −},
P : S → C ⊞ N,
N → rectangle − M, C → cC | c,
M → rectangle ⊞ half_circle,
half_circle → p ⊞ A.
The U-shape block can be represented by the string ccc ⊞ {bbb ⊞ aaaa − [(bb ⊞ aa) ⊞ (p ⊞ a)]}, with as, bs, and cs repeated any number of times depending on the required size and with the fourth dilation operator ⊞ as ⊕ (S = 0, θ = π ).
XI. PARSING A LGORITHM Given a grammar G and an object representation as a string, the string can be parsed to determine whether it belongs to the given grammar. There are various parsing algorithms, among which Earley’s parsing algorithm for context-free grammars is very popular. Let V ∗ denote the set of all sentences composed of elements from V . The algorithm is described as follows: Input: Context-free grammar G = (VN , VT , P , S) and an input string w = a1 a2 . . . an in VT∗ . Output: The parse lists I0 , I1 , . . . , In . Method: First construct I0 as follows: 1. If S → α a is a production in P , add [S → .α, 0] to I0 . Now, perform steps 2 and 3 until no new item can be added to I0 . 2. If [B → γ ., 0] is on I0 , add [A → αB.β, 0] for all [A → α.Bβ, 0] on I0 . 3. Suppose that [A → α.Bβ, 0] is an item in I0 . Add to I0 for all productions in P of the form B → γ , the item [B → .γ , 0] (provided this item is not already in I0 ). Now, we construct Ij by having constructed I0 , I1 , . . . , Ij −1 . 4. For each [B → α.aβ, i] in Ij −1 such that a = aj , add [B → αa.β, i] to Ij .
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
301
Now, perform steps 5 and 6 until no new items can be added. 5. Let [A → γ ., i] be an item in Ij . Examine Ii for items of the form [B → α.Aβ, k]. For each one found, add [B → αA.β, k] to Ij . 6. Let [A → α.Bβ, i] be an item in Ij . For all B → γ in P , add [B → .γ , j ] to Ij . The algorithm, then, is to construct Ij , for 0 < j ≤ n. Some examples of the parser are shown below. Example 1. Let a rectangle be represented by the string b ⊞ aa. The given grammar is G = (VN , VT , P , S),
where
VN = {S, A, B},
VT = {a, b, ⊞},
P : S → B ⊞ A, A → aA | a,
B → bB | b.
The parsing lists obtained are as follows: I0 [S → .B ⊞ A, 0] [B → .bB, 0] [B → .b, 0]
I3 [A → a.A, 2] [A → a., 2] [S → B ⊞ A., 0] [A → .aA, 3] [A → .a, 3]
[B [B [S [B [B
I1 I2 [S → B ⊞ .A, 0] → b.B, 0] [A → .aA, 2] → b., 0] → B. ⊞ A, 0] [A → .a, 2] → .bB, 1] → .b, 1]
I4 [A → a.A, 3] [A → a., 3] [A → aA., 2] [A → .aA, 4] [A → .a, 4] [S → B ⊞ A., 0]
Since [S → A ⊞ B., 0] is on the last list, the input belongs to the language L(G) generated by G. Example 2. Consider the input string b ⊞ ba. The given grammar is given by G = (VN , VT , P , S),
302
SHIH
where VN = {S, A, B},
VT = {a, b, ⊞},
P : S → B ⊞ A, A → aA | a,
B → bB | b. The parsing lists obtained are as follows: I0 [S → .B ⊞ A, 0] [B → .bB, 0] [B → .b, 0]
[B [B [S [B [B
I3 I1 I2 [S → B ⊞ .A, 0] Nil. → b.B, 0] [A → .aA, 2] → b., 0] → B. ⊞ A, 0] [A → .a, 2] → .bB, 1] → .b, 1]
Since there is no production starting with S on the last list, the input does not belong to the language L(G) generated by G. A question that arises is how we could construct a grammar that will generate a language to describe any kind of solid object. Ideally, it would be nice to have a grammatical inference machine that would infer a grammar from a set of given strings describing the objects under study. Unfortunately, such a machine is not available except for some very special cases. In most cases so far, the designer constructs the grammar based on the a priori knowledge available and experiences. In general, the increased descriptive power of a language is paid for in terms of the increased complexity of the analysis system. The trade-off between the descriptive power and the analysis
F IGURE 22.
An example of a swept surface.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
303
efficiency of a grammar for a given application is almost completely justified by the designer. Consider the swept surface shown in Figure 22. Its string representation is the same as the swept surface shown in Figure 19. The only difference is that it is not deformed as it is being swept.
XII. C ONCLUSIONS We describe the limitations of traditional morphological operations and define new morphological operations, called general sweep morphology. It is shown that traditional morphology is a subset of general sweep morphology. The properties of the sweep morphological operations are studied. We provide several examples of the proposed approach to demonstrate the advantages obtained by using sweep morphological operations instead of traditional morphological operations. The properties of opening/closing will be studied in the future. We have also presented a method of geometric modeling and representation based on sweep mathematical morphology. Since the shape and the dimension of a 2D structuring element can be varied during the process, not only simple rotational and extruded solids but also more complicated objects with blending surfaces can be generated by sweep morphology. We have developed grammars for solid objects and have applied Earley’s parser algorithm to determine whether a given string belongs to a group of similar objects. We compared our model to two popular solid models: boundary representation and the CSG model. Our mathematical framework used for modeling solid objects is sweep morphology, which provides a natural tool for shape representation. There are several advantages: a simple, large domain, lack of ambiguity, and it is easy to create and edit graphically. Furthermore, it does support the conventional limit (±) tolerances on “dimensions” that appear in many engineering drawings. The positive deviation is equivalent to the dilated result and the negative deviation is equivalent to the eroded result. It has been demonstrated that sweep mathematical morphology is an efficient tool for geometric modeling and representation in an intuitive manner.
R EFERENCES Blackmore, D., Leu, M.C., Shih, F.Y. (1994). Analysis and modeling of deformed swept volumes. Comput. Aided Design 26, 315–326. Chen, C., Hung, Y., Wu, J. (1993). Space-varying mathematical morphology for adaptive smoothing of 3D range data. In: Asia Conference on Computer Vision, Osaka, Japan, pp. 23–25.
304
SHIH
Chen, C.S., Wu, J.L., Hung, Y.P. (1999). Theoretical aspects of vertically invariant gray-level morphological operators and their application on adaptive signal and image filtering. IEEE Trans. Signal Process. 47, 1049– 1060. Cooper, D., Elliott, H., Cohen, F., Symosek, P. (1980). Stochastic boundary estimation and object recognition. Comput. Graphics Image Process 12, 326–356. Eichel, P.H., Delp, E.J., Koral, K., Buda, A.J. (1988). A method for a fully automatic definition of coronary arterial edges from cineangiograms. IEEE Trans. Med. Imag. 7, 313–320. Farag, A.A., Delp, E.J. (1995). Edge linking by sequential search. Pattern Recognit. 28, 611–633. Foley, J., van Dam, A., Feiner, S., Hughes, J. (1995). Computer Graphics: Principles and Practice, 2nd ed. Addison-Wesley, Reading, MA. Fu, K.S. (1982). Syntactic Pattern Recognition and Applications. PrenticeHall, Englewood Cliffs, NJ. Gonzalez, R.C., Woods, R.E. (2002). Digital Image Processing. AddisonWesley, New York. Gosh, P.K. (1988). A mathematical model for shape description using Minkowski operators. Compt. Vision, Graphics, and Image Processing 44, 239–269. Haralick, R.M., Sternberg, S.K., Zhuang, X. (1987). Image analysis using mathematical morphology. IEEE Trans. Pattern Anal. Mach. Intell. 9, 532– 550. Jang, B.K., Chin, R.T. (1990). Analysis of thinning algorithms using mathematical morphology. IEEE Trans. Pattern Anal. Mach. Intell. 12, 541–551. Kaul, A. (1993). Computing Minkowski Sums. Ph.D. thesis, Department of Mechanical Engineering, Columbia University. Korein, J. (1985). A Geometric Investigating Reach. MIT Press, Cambridge, MA. Latombe, J. (1991). Robot Motion Planning. Kluwer Academic, New York. Leu, M.C., Park, S.H., Wang, K.K. (1986). Geometric representation of translational swept volumes and its applications. ASME J. Eng. Industry 108, 113–119. Lin, P.L., Chang, S. (1993). A shortest path algorithm for nonrotating objects among obstacles of arbitrary shapes. IEEE Trans. Syst. Man Cybern. 23, 825–832. Martelli, A. (1976). An application of heuristic search methods to edge and contour detection. Communication ACM 19, 73–83. Martin, R.R., Stephenson, P.C. (1990). Sweeping of three dimensional objects. Comput. Aided Design 22, 223–234. Morales, A., Acharya, R. (1993). Statistical analysis of morphological openings. IEEE Trans. Signal Process. 41, 3052–3056.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
305
Mott-Smith, J.C., Baer, T. (1972). Area and volume coding of pictures. In: Huang, T.S., Tretiak, O.J. (Eds.), Picture Bandwidth Compression. Gordon and Breach, New York. Pei, S.-C., Lai, C.-L., Shih, F.Y. (1998). A morphological approach to shortest path planning for rotating objects. Pattern Recognit. 31, 1127–1138. Pennington, A., Bloor, M.S., Balila, M. (1983). Geometric modeling: A contribution toward intelligent robots. In: Proc. 13th Inter. Symposium on Industrial Robots, pp. 35–54. Ragothama, S., Shapiro, V. (1998). Boundary representation deformation in parametric solid modeling. ACM Trans. Graphics 17, 259–286. Requicha, A.A.G. (1980). Representations for rigid solids: Theory, methods, and systems. ACM Computing Surveys 12, 437–464. Requicha, A.A.G. (1984). Representation of tolerances in solid modeling: Issues and alternative approaches. In: Boyse, J.W., Pickett, M.S. (Eds.), Solid Modeling by Computers. Plenum, New York, pp. 3–12. Requicha, A.A.G., Voelcker, H.B. (1982). Solid modeling: A historical summary and contemporary assessment. IEEE Comput. Graph. Appl. 2, 9–24. Rossignac, J. (2002). CSG-Brep duality and compression. In: Proc. ACM Symposium on Solid Modeling and Applications, Saarbrucken, Germany, pp. 59–66. Rossignac, J., Requicha, A.A.G. (1985). Offsetting operations in solid modeling. Production Automation Project, University of Rochester, NY Tech. Memo 53. Russ, J.C. (1992). The Image Processing Handbook. CRC Press, Boca Raton, FL. Serra, J. (1982). Image Analysis and Mathematical Morphology. Academic Press, London. Shih, F.Y. (1991). Object representation and recognition using mathematical morphology model. J. Syst. Integr. 1, 235–256. Shih, F.Y., Cheng, S. (2004). Adaptive mathematical morphology for edge linking. Inf. Sci. 167, 9–21. Shih, F.Y., Mitchell, O.R. (1989). Threshold decomposition of gray-scale morphology into binary morphology. IEEE Trans. Pattern Anal. Mach. Intell. 11, 31–42. Shih, F.Y., Mitchell, O.R. (1992). A mathematical morphology approach to Euclidean distance transformation. IEEE Trans. Image Process. 1, 197– 204. Shih, F.Y., Wu, Y. (2004). The efficient algorithms for achieving Euclidean distance transformation. IEEE Trans. Image Process. 13, 1078–1091. Shih, F.Y., Gaddipati, V., Blackmore, D. (1994). Error analysis of surface fitting for swept volumes. In: Proc. Japan–USA Symp. Flexible Automation, Kobe, Japan, pp. 733–737.
306
SHIH
Shiroma, Y., Okino, N., Kakazu, Y. (1982). Research on 3-D geometric modeling by sweep primitives. In: Proc. of CAD, Brighton, United Kingdom, pp. 671–680. Shiroma, Y., Kakazu, Y., Okino, N. (1991). A generalized sweeping method for CSG modeling. In: Proc. of the First ACM Symposium on Solid Modeling Foundations and CAD/CAM Applications, Austin, Texas, pp. 149–157. Stevenson, R.L., Arce, G.R. (1987). Morphological filters: Statistics and further syntactic properties. IEEE Trans. Circuits Syst. 34, 1292–1305. Voelcker, H.B., Hunt, W.A. (1981). The role of solid modeling in machining process modeling and NC verification. SAE Tech. Paper #810195. Wang, W.P., Wang, K.K. (1986). Geometric modeling for swept volume of moving solids. IEEE Comp. Graph. Appl. 6, 8–17.
F URTHER R EADING Brooks, R.A. (1981). Symbolic reasoning among 3-D models and 2-D images. Artif. Intell. 17, 285–348. Requicha, A.A.G., Voelcker, H.B. (1983). Solid modeling: Current status and research direction. IEEE Comput. Graph. Appl. 3, 25–37.
Index
A
B
Absolute continuity, 69 Active learning, 69 Adaptive filtering, 216f Adaptive hybrid vector filters, 228–231 Adaptive multichannel filters, based on digital paths, 220–222 Adjacency, 36 ADP. See Approximate dynamic programming algorithms Algorithms, distance-based, 3 AMF. See Arithmetic mean filter Angular noise margins, 197 Anomalous X-ray scattering (AXS), 174 APD. See Avalanche photodiode Approximate dynamic programming algorithms (ADP), 94–99, 144 performance issues, 98–99 T-SO problems and, 94–96 Approximate policy iteration, 97–98 Approximate value iteration, 96–97 Approximation error, 73 Arithmetic mean filter (AMF), 208 Artworks, restoration of, 241–242 Atomic arrangements, around Fe atoms, 160f Atomic images data processing for obtaining, 145–149 three-dimensional, from single-energy holograms, 173f Atomic resolution holography, 178 Atom-resolved holography, 122 Au clusters, 153 Au crystals atomic images of, 143f dimer model of, 151–152 holograms of, 142f Avalanche photodiode (APD), 140 energy resolution of, 144 Averaging, 218–219 AXS. See Anomalous X-ray scattering
Backpropagation through structure (BPTS), 9 RNNs and, 18–22 Backward pass, 20 Barton algorithm, 142, 147–148, 154–155 Base b elementary intervals in, 85 sequence in, 85 Basic vector directional filter (BVDF), 212 Batch mode, 20 Bellman’s equation, 96 Bioinformatics, RNNs in, 8 BL47XU, 165 Block mode, 20 Boltmann constant, 150 Boundaries inner, 269 outer, 269 representation of, 267 Bounded variations, ensuring, 80–84 BPTS. See Backpropagation through structure Bragg condition, 128 Bragg peaks, 179 Bragg reflection, 126 BVDF. See Basic vector directional filter
C Cameras, indoor, images acquired by, 50–51 Canberra distance, 204 Cascade-correlated approach, 6 Cd atoms, 179–180 Charge-coupled devices, 195 Circles, 298 Circular structuring elements, 283f City-block distance, 203 Closed-loop forms, 91 Clustering color space, 34–35 K-means, 36
307
308 COIL-100, datasets generated from, 52–54 images from, 53f, 54f Collision phenomenon, 7, 30 conditions for avoidance of, 31–33 Color, perception of, 188 Color image filtering, 199–244 applications of, 241–244 television image enhancement, 243–244 virtual restoration of artworks, 241–242 basics, 190–193 component-wise, 200f edge detection, 244–257 scalar operators, 245–250 vector operators, 250–253 image sharpening, 235–239 image zooming techniques, 239–241 introduction to, 188–190 noise-reduction techniques, 202–231 adaptive hybrid vector filters, 228–231 adaptive multichannel filters based on digital paths, 220–222 component-wise median filtering, 205– 207 data-adaptive filters, 218–220 order-statistic theory, 202–205 selection weighted vector filters, 215– 218 similarity based vector filters, 226–228 switching filtering schemes, 223–226 vector directional filters, 212–215 vector median filtering, 207–212 sliding, 201f vector, 201f Color space clustering, 34–35 Company logos, RNNs and, 8 Comparison and selection (CS) filter, 236 Complementary DNA microarray imaging, 255, 256f Complex holography, 133–134, 170–173 atomic images from, 135f Complexity computational, 93 model, 93 sample, 93 Component-wise filtering, 200f median, 205–207 Computation trees, 27 Computational complexity, 93 Connection costs, 222
INDEX Constructed solid geometry (CSG) representation, 267 Continuity, absolute, 69 Control vectors, 98 Convergence distribution-free, 72 ERM approach and, 85–87 uniform, of empirical means, 70, 77 Conversion electrons, from nuclei, 177–178 Cooling effect, sample, 149–150 Cost-to-go, 63, 94, 100, 101 Cryostream coolers, 150 CS filter. See Comparison and selection filter CSG representation. See Constructed solid geometry representation CuI clusters atomic images of, 126f, 131f complex hologram of, 134f theoretical holograms of, 125f CuI dimers, 136 holograms calculated from, 124f Curse of dimensionality, 64 Curve fittings, k-range for, 159t Cyclic graphs encoding and output networks for, 23f processing, 5–6, 22–30 recursive equivalent trees and, 28–30 recursive-equivalent transforms and, 25–28 Cylinder, 293
D DAG-LE. See Directed acyclic graphs with labeled edges Data formats, 2 Data processing, 138–150 Data-adaptive vector filters, 218–220 Debye–Waller factor, 149–150 Decision-theoretic approach, 2 Deformations, swept surfaces and, 275–280 Descendants, 11 Deterministic learning (DL), 63, 74–90 distribution-dependent case in, 87–90 distribution-free case, 75–80 dynamic programming and, 99–104 T-SO problems, 99–104 experimental results of, 104–114 unknown function approximation, 104– 107 mathematical framework for, 65–69
309
INDEX multistage optimization tests and, 107–114 inventory forecasting model, 108–109 water reservoir network model, 109–114 noisy cases in, 88–89 for optimal control problems, 90–94 Digital path approach (DPA) filter, 221–222 Digital paths, adaptive multichannel filters based on, 220–222 Dilation, 266 Dimensionality, 64 Dimers, calculated holograms of, 152f Directed acyclic graphs with labeled edges (DAG-LE), 5, 7, 11 RNNs and, 11–18 Directed graphs with labeled edges, 10 Directed graphs with unique labels (DUGs), 22 Directed labeled graphs, 10 Directed positional acyclic graphs (DPAGs), 4, 5, 7, 11, 44, 50 RNN processing of, 14–17 Directed unlabeled graphs, 9 Directional processing concept, on Maxwell triangle, 212f Directional-distance filters (DDF), 213 Discounted problems, 91 Discrepancy, 77 star, 77 Discretizations, RMS errors for, 105t, 106t, 107t Distance-based algorithms, 3 Distribution-dependent cases, 75 in deterministic learning, 87–90 Distribution-free cases, 74 in deterministic learning, 75–80 Distribution-free convergence, 72 DL. See Deterministic learning Dopants, 162–168 GaAs:Zn, 162–165 quasicrystal, 168–170 Si:Ge, 165–168 DP. See Dynamic programming DPA filter. See Digital path approach DPAF filter, 222 DPAGs. See Directed positional acyclic graphs DPAL filter, 222 DUGs. See Directed graphs with unique labels Dynamic programming (DP), 63
deterministic learning and, 99–104 T-SO problems, 99–104
E Edge detection, 35, 244–257 component-wise, 246f evaluation criteria, 253–257 objective evaluation approach, 254–255 subjective evaluation approach, 255– 257 scalar operators, 245–250 gradient, 248–249 zero-crossing-based operators, 249–250 vector dispersion, 252 vector operators, 250–253 Edge linking, 280–285 sweep morphological algorithm, 284f using sweep morphology, 281–285 Edge-weighting functions, 18 Electron scattering, 121 Elementary intervals, 85 Elliptic structuring elements, 281f, 282f Emitter-scatterer dimer, 129f Empirical risk, 67 minimization of, 68 Encoding networks, 13 for cyclic graphs, 23f Energy dispersive solid state detectors, 139 Energy resolution, of APD, 144 Energy spectra, of scattered X-rays, 145f Enhancement techniques, 188–190 ERM approach, 90 convergence rates of, 85–87 Erosion sweep, 272f traditional, 272f ESRF. See European Synchrotron Radiation Facility Estimation error, 73 Euclidean distance, 203 European Synchrotron Radiation Facility (ESRF), 121, 168 EXAFS. See Extended X-ray absorption fine structure Expected risks, 66 Experiment processing, 138–150 Experimental holograms, demonstration by, 156–159
310 Extended X-ray absorption fine structure (EXAFS), 151, 156, 158. See also XAFS
F Faces, appearance of, 51f Fe atoms atomic arrangements around, 160f holograms of, 161f Feature encoding, 2 Feedforward neural networks, 83–84 FePt film, 160–161 Filtered holographic signals, 158f Filtering. See Specific types Finite horizon cases, 98 Finite impulse response (FIR), 208–209 FIR. See Finite impulse response Fitted holographic signals, 155f Fittings curve, 159t Fourier, 159t Flat pattern recognition, 1–7 Fluorescence photons, 139 Formal language, sweep morphology and, 291–292 Forward pass, 20 Fourier fittings, r-range for, 159t Fourier transforms, 125, 127, 146, 153–154 of holograms, 148f inverse, theoretical proof of, 150–155 reconstructed intensity and, 154f of π XAFS, 176f Frontier states, 15 Full width at half maximum (FWHM), 140 Functions, 2 Fuzzy filters, 218 Fuzzy membership function, 219 Fuzzy theory, 35 FWHM. See Full width at half maximum
G G color band, 190–191 Ga emitters, 173 GaAs:Zn, 162–165 XAFS spectrum of, 172f Gabor, Dennis, 120–121 Gaussian noise, filtering of, 220f Ge crystals atomic images of, 167f
INDEX holograms, 166f holograms of, 147f, 148f Ge X-ray fluorescence, 145–146 General sweeps, 266 in computation of traditional morphology, 268–269 deformations and, 275–280 edge linking and, 281–285 formal language and, 291–292 geometric modeling and, 288–291 tolerance expression, 289–291 grammars, 297–300 three-dimensional attributes, 298–300 two-dimensional attributes, 298 image enhancement and, 278–280 mathematical morphology, 270–273 parsing algorithm, 300–303 representation scheme, 292–297 three-dimensional attributes, 293–297 two-dimensional attributes, 292–293 theoretical development of, 268–275 Generalized cylinders, 266, 267 Generalized errors, 21 Generalized similarity measure model, 204 Generalized vector directional filters (GVDF), 213 Geometric modeling, sweep mathematical morphology and, 288–291 tolerance expression, 289–291 Geometries, experimental, for normal and inverse modes, 138–140 Gradient operators, 248–249 Grammars, 297–300 Graph(s) cyclic, 22–30 encoding and output networks for, 23f recursive equivalent trees and, 28–30 recursive-equivalent transforms and, 25–28 directed labeled, 10 directed unlabeled, 9 directed, with labeled edges, 10 encoding networks associated with, 13f output networks associated with, 13f regional adjacency, 24, 36–38 RNNs and, 9–11 structures, 2 topology of, 10–11 Graph-based representation, 33–39 introduction to, 33
311
INDEX multiresolution trees, 36–38 region adjacency graphs, 36–38 segmentation of images in, 33–36 region based approaches, 33 γ -ray holography, 176–178 Gray-scale imaging, 200 GVDF. See Generalized vector directional filters
H Halton sequences, 112 Helmholtz–Kirchhoff formulas, 125, 169 Histogram thresholding, 34 Holograms, 120 Au crystal, 142f calculated from CuI dimers, 124f calculated, of dimer, 152f experimental, 156–159 Fe, 161f filtered signals, 158 fitted signals, 155 Fourier transformation of, 148f Ge crystal, 147f, 148f horizontal polarization of, 137f in inverse mode, 123–124, 123f of ions in chemical environments, 177 in k space, 153 multiple energy, 166f one-dimensional, 157f oscillations, 138f reconstructions from, 149f single-energy, 173f theoretical, of CuI clusters, 125f vertical polarization of, 137f Zn, 163f Holography. See also XFH atomic resolution, 178 complex, 133–134 complex X-ray, 170–173 γ -ray, 176–178 history of, 120 neutron, 178–180 Horizontal polarization, 137f Human luminance frequency response, 190 HVF. See Hybrid vector filters Hybrid vector filters (HVF), 213, 214 Hydrogen atom, reconstruction of planes from, 179f Hyperbolic tangents, 104
I Image analysis graph-based representation, 33–39 indoor camera, 50–51 RNNs in, 8 Image enhancement, general sweeps and, 278–280 Image filtering. See Color image filtering Image noise, 193–199 Gaussian, 220f impulsive, 222f mixed, 232f natural, 193–194 noise modeling, 194–199 sensor noise, 195–197 transmission noise, 197–199 real color, 194f simulated color, 196f Image orientation, 40 Image processing chain, 189f Image sharpening, 235–239 Image zooming techniques, 239–241 Imaging conditions, 40 Impulsive noise, filtering of, 223f Indegrees, 10 Industrial parts, 286f edges of, 285f Infinite-horizon stochastic optimization problems, 91, 96–98 approximate policy iteration and, 97–98 Inpainting techniques, 234–235, 236f Interatomic distances, 159t Inventory forecasting model, 108–109, 113t bounds for, 111t Inverse Fourier analysis, theoretical proof of, 150–155 Inverse mode, holograms in, 123–124, 123f
K Kernel trick, 4 KL. See Kossel lines K-means clustering, 36 Kohonen map, 6 Koksma–Hlawka inequality, 79 Kossel lines (KL), 128–129 k-range, for curve fittings, 159t
312
L Laboratory XFH apparatus, 140–143 illustration of, 141f Learnable problems, 68 probably approximately correct, 70 Learning active, 69 passive, 69 Learning algorithms, 68 Learning environment setup MRTs in, 46–47 RAGs in, 42–46 Learning rates, 19 Learning theory, 62 Least mean absolutes (LMA), 215 Leaves, 10 Lemmas, 82, 101, 102 Levenberg–Marquardt algorithm, 105 Lif crystals, 166 Linear extrusion, 265, 267 Linear functions, Pollard dimension of onedimensional, 72f LMA. See Least mean absolutes Localized faces, 49f Loss functions, 66 Low-discrepancy sequences, 85, 105 RMS errors for, 106t Lower–upper–middle (LUM) sharpeners, 237 LQ hypotheses, 92 LUM sharpeners. See Lower–upper–middle sharpeners
M Magnetite, 177 iron arrangements in, 178f Marginal ordering, 202 Markov models, 3 Mathematical morphology, 266 general sweep, 270–273 Maximum likelihood estimates (MLE), 205 Maxwell triangle directional processing concept on, 212f RGB color cube with, 192f Mean square error (MSE) criterion, 95, 211 Median absolute deviation, 229 Median filters (MF), 205 MF. See Median filters Minkowski metric family, 203–204
INDEX Mixed noise, 232 MLE. See Maximum likelihood estimates Mn sites, in quasicrystal, 169f Mobile robot, shortest path planning for, 286–288 Model complexity, 93 Model selection, 62 Monochromators, 175 Monte Carlo methods, 87, 92 randomized, 87 Morphology, 266 MOS problem. See Multistage stochastic optimization Mössbauer effect, 124, 133, 177–178 MRT. See Multiresolution trees MSE criterion. See Mean square error criterion Multilayer perceptrons, transition functions realized with, 16f Multiple energy method, 130 atomic images constructed using, 131f Multiply connections, 269 Multiresolution trees (MRT), 38–39 generation of, 39f in learning environment setup, 46–47 targets associated with nodes of, 47f Multistage stochastic optimization (MOS) problem, 63 deterministic, 92
N Natural image noise, 193–194 NCD. See Normalized color difference Near field effect, 136–138 Nearest neighbor vector range (NNVR), 253 Negative filling, 269 Neural networks, 1, 3 recursive, 4 supervised, 4 Neurodynamic programming, 93 Neutron holography, 178–180 Niederreiter sequences, 105, 112 NNVR. See Nearest neighbor vector range Noise. See Image noise Noise margins, 197f Noise modeling, 194–199 sensor noise, 195–197 transmission noise, 197–199 Noise-reduction techniques, 202–231
313
INDEX adaptive hybrid vector filters, 228–231 adaptive multichannel filters based on digital paths, 220–222 component-wise median filtering, 205–207 data-adaptive filters, 218–220 order-statistic theory, 202–205 performance evaluation of, 231–235 inpainting techniques, 234–235 objective, 231–233 subjective, 233–234 selection weighted vector filters, 215–218 similarity based vector filters, 226–228 switching filtering schemes, 223–226 vector directional filters, 212–215 vector median filtering, 207–212 Noisy cases, 75 in deterministic learning, 88–89 Nonlinear smoothing filter (NSF), 224 Nonlinear vector processing, 189 Normalized color difference (NCD), 233 NSF. See Nonlinear smoothing filter Nuclei, conversion electrons from, 177–178 Null pointers, 11
O Object deformation, 40 Object detection, 39–54 methods, 39–42 appearance-based, 41 challenges in, 40 feature invariant, 40–41 knowledge-based, 40–41 template matching, 41 RNNs in, 42–54 Observation noise, 74 Occlusions, 40 One-dimensional holograms, 157f Open curve paths, 271f Optical reciprocity, 123 Optimal control problems, deterministic learning for, 90–94 Optimal management problems, 108 Optimization problems discounted infinite-horizon stochastic, 91 T-stage, 91 Optimization tests, multistage, 107–114 inventory forecasting model, 108–109 water reservoir network model, 109–114
Ordering marginal, 202 reduced, 202 Order-statistic theory, 202–205 Outdegrees, 10 Output networks, 13f, 14, 31f for cyclic graphs, 23f Overfitting, 73
P PAC. See Probably approximately correct Parallelograms, 298 Parallelpiping, 293–297, 299–300 Parsing algorithm, 300–303 Passive learning, 69 Path planning, for mobile robot, 286–288 Paths, 10–11 Pattern mode, 20 Pattern recognition flat, 1–7 structural, 1–7 Pb atoms, 179–180 Pb crystals, reconstructions of planes for, 151f Performance evaluation, 189 of noise reduction techniques, 231–235 inpainting techniques, 234–235 objective evaluation, 231–233 subjective evaluation, 233–234 Performance issues, ADP, 98–99 Permutation wave medians (PWM), high pass, 238 Photons, fluorescence, 139 Planck constant, 150 Pointer matrices, 15 Polarization effect horizontal, 137f of incident X-ray, 134–136 vertical, 137f Policy evaluation, 97 Pollard dimension, 70–71 of one-dimensional linear functions, 72f Pose, 40 Probably approximately correct (PAC) learnability, 70 Protein topologies, RNNs in, 8 Pruning, 284 P-shattering, 71 Pt foils, π XAFS of, 175f PWM. See Permutation wave medians
314
Q QSAR. See Quantitative structure-activity relationships Quadratic rates, 74 Quantitative structure-activity relationships (QSAR), 8 Quasicrystal, 168–170 reconstructed real-space image around Mn sites in, 169f Quasirandom integration methods, 85
R Radial basis functions, 84–87 RAG. See Region adjacency graph Random variables, 89 Random walk (RW) techniques, 3–4 Randomized quasi-Monte Carlo methods, 87 γ -ray holography, 176–178 Real color image noise, 194f Reconstructed intensity, 153, 156–157 Fourier transforms and, 154f of single first neighbor atoms, 168f Rectangles, 298 Recursive equivalent trees cyclic graphs and, 28–30 RAG transformation to, 45f Recursive neural networks (RNN), 4 BPTS and, 18–22 cyclic graph processing with, 22–30 DAG-LE processing and, 11–18 DPAG processing and, 14–17 graphs and, 9–11 limitations of, 30–33 in object detection, 42–54 detecting objects, 47–54 learning environment setup in, 42–46 properties and applications of, 7–9 in bioinformatics, 8 in company logo classification, 8 in image analysis, 8 Recursive-equivalent transforms, 25–28 Reduced ordering, 202 Region adjacency graph (RAG), 24, 36–38 extracted, 42f features stored in, 37f in learning environment setup, 42–46 transformation to recursive-equivalent trees, 45f Region based approaches, 35
INDEX Regression estimation problem, 66 Regression functions, 66 Regularization factors, 207 Reoptimization, 95 Reservoir network, 110f Restoration, of artworks, 241–242 RGB color images, 190 with Maxwell triangle, 192f Risk functionals, 75–76 RMS. See Root of mean square RNN. See Recursive neural networks Root of mean square (RMS) errors for discretizations, 105t, 106t, 107t low-discrepancy sequences, 106t, 108t for random sequences, 106t, 108t Rotational sweep, 265, 267 Round bottom tools, 297f r-range, for Fourier fittings, 159t Rule learners, 3 RW. See Random walk
S Sample complexity, 73, 93 Sample cooling effect, 149–150 Samples, 95 orientation of, 136f Scalar operators, 245–250 gradient, 248–249 zero-crossing-based operators, 249–250 Scalar techniques, 239 Scanning probe microscopy (SPM), 168 Scattering, 169–170 electron, 121 Thomson, 135 X-ray, 121 Segmentation, 33–36 color space clustering, 34 edge detection, 35 fuzzy theory, 35 histogram thresholding, 34 neural network approaches to, 35 physics approaches to, 35 region-based, 35 Selection weighted vector filters (SWVF), 209–210, 215–218 Self-organizing maps (SOM), 6 Sensor noise, 195–197 Sequences, 2 Sets, 2
315
INDEX Si, holograms, 166f Si clusters, 167 Si:Ge, 165–168 Similarity based vector filters, 226–228 Similarity functions, 221, 227 Single-energy reconstruction, 173 Skeletons, 10–11 SL. See Statistical learning Sliding filtering, 201f Sobol sequences, 112 Solid state detector (SSD), 141, 144 energy dispersive, 139 SOM. See Self-organizing maps Sources, 10 Space-varying, 279 Spatial interpolation, 240f Spherical median (SM), 212f SPM. See Scanning probe microscopy SPring-8, 165 Spurious regions, 43 SR. See Synchrotron radiation SSD. See Solid state detector Stack filter design, 206 Star discrepancy, 77 State space, 11 State transition functions, 12 State variables, 11 Statistical learning (SL), 63, 69–74 Step edge, 278 Structural methods, 2 Structural pattern recognition, 1–7 Structure-adaptive hybrid vector filter (SAHVF), 228–231 Structuring element assignment, 279f circular, 283f elliptic, 281f Subjective image evaluation, 233–234 guidelines for, 234t Superexponential growth, 73, 86 Supersources, 11, 28f Supervised learning paradigm, 6 Supervised neural networks, 4 Support vector machines (SVM), 1 SVF. See Switching vector filter SVM. See Support vector machines Sweep, 265 general, 266 rotational, 265, 267 Sweep dilation, 281, 294f, 299–300 Sweep erosion, 272f
Sweep morphological dilation, 270 Sweep morphological erosion, 271 Sweep morphological operations, properties of, 273–275 Sweep surface modeling, 291 Switching filtering schemes, 223–225 based on fixed threshold, 224f based on fully adaptive control, 224f Switching vector filter (SVF), 224, 226 SWVF. See Selection weighted vector filters Symbolic output-equivalence, 31 Symmetry, XSWs and, 161 Synchrotron radiation (SR), fast X-ray fluorescence detection systems at, 143– 149 Syntactical methods, 2
T Tangents, hyperbolic, 104 (t, d)-sequences, 85 Television image enhancement, 243–244 TEM. See Transmission electron microscopy Temporal sequences, 17f Ternary trees, 32 Thining, 281–282 Thomson scattering factors, 135 Three-dimensional attributes, 293–297 Tolerance expression, 289–291 zones, 290f Traditional morphology, computation of, 268–269 Training phase, 62 Training set, 62 Transition functions, multilayer perceptrons and, 16f Transmission electron microscopy (TEM), 168 Transmission noise, 197–199 Tree search method, 280 Tree structures, 2 Triangles, 298 Trichromatic theory, 188 Tristimulus theory, 192 T-SO. See T-stage stochastic optimization T-stage stochastic optimization (T-SO) problems, 91 ADP algorithms and, 94–96 deterministic learning for dynamic programing, 99–102
316 TV video sequences, 49–50 Twin images concept of, 129f removal of, 129–134 complex holography, 133–134 multiple energy method, 130 two energy method, 130–133 Two energy method, 130–133 atomic images constructed using, 132f Two-dimensional attributes, decomposition of, 293f
U Ultrathin film, 159–161 Undirected structures, 6 Unfolded parameters, 20, 103 Unfoldings, 28f Uniform convergence of empirical means, 70, 77 Uniform distribution, 77 Uniform probability (URS), 105 Universal approximators, 62 Unknown functions, approximation of, 104– 107 URS. See Uniform probability U-shape blocks, 296
V Value function, 63, 94 Variations, bounded, 80–84 VC dimension, 70 VDED. See Vector dispersion edge detector VDF. See Vector directional filters Vector directional filters (VDF), 212–215 basic, 212 directional-distance, 213 generalized, 213 hybrid, 213 weighted, 214 Vector dispersion edge detector (VDED), 252 Vector median filtering (VMF), 201f, 207– 212 selection weighted, 209–210 weighted, 209 Vector operators, 250–253 Vector processing, nonlinear, 189 Vector range (VR) detectors, 250–251 Vector rational filters (VRF), 210–211 Vertical polarization, 137f VMF. See Vector median filtering
INDEX VR. See Vector range VRF. See Vector rational filters
W Water levels, 110t Water reservoir network model, 109–114, 113t bounds for, 112t Weighted filters, 225 Weighted medians (WM), 206 permutation, 238 Weighted vector directional filters (WVDF), 214–215 Weighted vector median filters (WVMF), 209 WVDF. See Weighted vector directional filters WVMF. See Weighted vector median filters
X XAFS. See X-ray absorption fine structure XFH. See X-ray fluorescence holography X-ray absorption fine structure (XAFS), 172. See also EXAFS π , 174–176 Fourier transforms of, 176f Pt foils of, 175f at As K edge, 172f X-ray fluorescence holography (XFH) applications of, 159–173 complex X-ray holography, 170–173 dopants, 162–168 ultrathin film, 159–161 experiment and data processing, 138–150 experimental geometries, 138–140 experimental holograms, 156–159 interatomic distances estimated by, 159t inverse fourier analysis, 150–155 laboratory XFH apparatus, 140–143 for obtaining atomic images, 145–149 sample cooling effect, 149–150 at synchrotron radiation, 143–149 experimental setup for, 139f introduction to, 120–122 inverse method, 122f Kossel lines and, 128–129 near field effect, 136–138 normal method, 122f outlook, 180–181 polarization effect of incident X-ray, 134– 136
317
INDEX related methods, 174–180 γ -ray holography, 176–178 neutron holography, 178–180 π XAFS, 174–176 simulation using realistic models, 124–127 theory using simple models, 122–124 twin image removal and, 129–134 complex holography, 133–134 multiple energy method, 130 two energy method, 130–133 X-ray detector performance, 145f XSWs and, 128–129
X-ray scattering, 121 X-ray standing wave lines (XSW), 128–129 four-fold symmetry and, 161 XSW. See X-ray standing wave lines
Z Zeeman sextets, 177 Zero-crossing-based operators, 249–250 Zn atom, holographic reconstruction of environment around,164f Zn holograms, 163f
This page intentionally left blank