Marek R. Ogiela and Lakhmi C. Jain (Eds.) Computational Intelligence Paradigms in Advanced Pattern Classification
Studies in Computational Intelligence, Volume 386 Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com Vol. 363. Kishan G. Mehrotra, Chilukuri Mohan, Jae C. Oh, Pramod K. Varshney, and Moonis Ali (Eds.) Developing Concepts in Applied Intelligence, 2011 ISBN 978-3-642-21331-1 Vol. 364. Roger Lee (Ed.) Computer and Information Science, 2011 ISBN 978-3-642-21377-9 Vol. 365. Roger Lee (Ed.) Computers, Networks, Systems, and Industrial Engineering 2011, 2011 ISBN 978-3-642-21374-8 Vol. 366. Mario Köppen, Gerald Schaefer, and Ajith Abraham (Eds.) Intelligent Computational Optimization in Engineering, 2011 ISBN 978-3-642-21704-3 Vol. 367. Gabriel Luque and Enrique Alba Parallel Genetic Algorithms, 2011 ISBN 978-3-642-22083-8 Vol. 368. Roger Lee (Ed.) Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing 2011, 2011 ISBN 978-3-642-22287-0 Vol. 369. Dominik Ry_zko, Piotr Gawrysiak, Henryk Rybinski, and Marzena Kryszkiewicz (Eds.) Emerging Intelligent Technologies in Industry, 2011 ISBN 978-3-642-22731-8 Vol. 370. Alexander Mehler, Kai-Uwe Kühnberger, Henning Lobin, Harald Lüngen, Angelika Storrer, and Andreas Witt (Eds.) Modeling, Learning, and Processing of Text Technological Data Structures, 2011 ISBN 978-3-642-22612-0 Vol. 371. Leonid Perlovsky, Ross Deming, and Roman Ilin (Eds.) Emotional Cognitive Neural Algorithms with Engineering Applications, 2011 ISBN 978-3-642-22829-2 Vol. 372. Ant´onio E. Ruano and Annam´aria R. V´arkonyi-K´oczy (Eds.) New Advances in Intelligent Signal Processing, 2011 ISBN 978-3-642-11738-1 Vol. 373. Oleg Okun, Giorgio Valentini, and Matteo Re (Eds.) Ensembles in Machine Learning Applications, 2011 ISBN 978-3-642-22909-1 Vol. 374. Dimitri Plemenos and Georgios Miaoulis (Eds.) Intelligent Computer Graphics 2011, 2011 ISBN 978-3-642-22906-0
Vol. 375. Marenglen Biba and Fatos Xhafa (Eds.) Learning Structure and Schemas from Documents, 2011 ISBN 978-3-642-22912-1 Vol. 376. Toyohide Watanabe and Lakhmi C. Jain (Eds.) Innovations in Intelligent Machines – 2, 2011 ISBN 978-3-642-23189-6 Vol. 377. Roger Lee (Ed.) Software Engineering Research, Management and Applications 2011, 2011 ISBN 978-3-642-23201-5 Vol. 378. János Fodor, Ryszard Klempous, and Carmen Paz Suárez Araujo (Eds.) Recent Advances in Intelligent Engineering Systems, 2011 ISBN 978-3-642-23228-2 Vol. 379. Ferrante Neri, Carlos Cotta, and Pablo Moscato (Eds.) Handbook of Memetic Algorithms, 2011 ISBN 978-3-642-23246-6 Vol. 380. Anthony Brabazon, Michael O’Neill, and Dietmar Maringer (Eds.) Natural Computing in Computational Finance, 2011 ISBN 978-3-642-23335-7 Vol. 381. Radoslaw Katarzyniak, Tzu-Fu Chiu, Chao-Fu Hong, and Ngoc Thanh Nguyen (Eds.) Semantic Methods for Knowledge Management and Communication, 2011 ISBN 978-3-642-23417-0 Vol. 382. F.M.T. Brazier, Kees Nieuwenhuis, Gregor Pavlin, Martijn Warnier, and Costin Badica (Eds.) Intelligent Distributed Computing V, 2011 ISBN 978-3-642-24012-6 Vol. 383. Takayuki Ito, Minjie Zhang, Valentin Robu, Shaheen Fatima, and Tokuro Matsuo (Eds.) New Trends in Agent-based Complex Automated Negotiations, 2011 ISBN 978-3-642-24695-1 Vol. 384. Daphna Weinshall, J¨orn Anem¨uller, and Luc van Gool (Eds.) Detection and Identification of Rare Audiovisual Cues, 2011 ISBN 978-3-642-24033-1 Vol. 385. Alex Graves Supervised Sequence Labelling with Recurrent Neural Networks, 2012 ISBN 978-3-642-24796-5 Vol. 386. Marek R. Ogiela and Lakhmi C. Jain (Eds.) Computational Intelligence Paradigms in Advanced Pattern Classification, 2012 ISBN 978-3-642-24048-5
Marek R. Ogiela and Lakhmi C. Jain (Eds.)
Computational Intelligence Paradigms in Advanced Pattern Classification
123
Editors
Prof. Marek R. Ogiela
Prof. Lakhmi C. Jain
AGH University of Science and Technology 30 Mickiewicza Ave 30-059 Krakow Poland E-mail:
[email protected] University of South Australia Adelaide Mawson Lakes Campus South Australia Australia E-mail:
[email protected] ISBN 978-3-642-24048-5
e-ISBN 978-3-642-24049-2
DOI 10.1007/978-3-642-24049-2 Studies in Computational Intelligence
ISSN 1860-949X
Library of Congress Control Number: 2011936648 c 2012 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed on acid-free paper 987654321 springer.com
Preface
Recent advances in intelligent computational intelligence paradigms have contributed tremendously in modern pattern classification techniques. This book is aimed to provide a sample of the state of art techniques in advanced pattern classification and its possible applications. In particular this book includes nine chapters on using various computational intelligent paradigms in healthcare such as intelligent agents and case-based reasoning. Additionally a number of applications and case studies are presented. Chapter one presents an introduction to pattern classification techniques including current trends in intelligent image analysis and semantic content description. Chapter two is on handwriting recognition using neural networks. The authors have proposed a novel neural network. It is demonstrated that the proposed technique offers higher recognition rate than the other reported techniques in the literature. Chapter three is on moving object detection from mobile platforms. The authors have demonstrated the applicability of their approach to detect moving objects like vehicles or pedestrian in different urban scenarios. Chapter four is on pattern classifications in cognitive environments. The author has demonstrated experimentally that the semantic technique can be used for cognitive data analysis problems in cognitive informatics. Chapter five is on optimal differential filters on hexagonal lattice. The filters are compared with existing optimised filters to demonstrate the superiority of the technique. Chapter six is on graph image language techniques supporting advanced classification and computer interpretation of 3D CT coronary vessel visualizations. Chapter seven is on a graph matching approach to symmetry detection and analysis. The authors have validated their approach using extensive experiments on two and three dimensional synthetic and real life images. Chapter eight is on pattern classification methods used for the analysis of brain visualization and Computer-aided Diagnosis of perfusion CT maps. The final chapter is on the main methods of multi-class and multi-label classification. These can be applied to a large variety of applications and research fields that relate to human knowledge, cognition and behaviour. We believe that scientists, application engineers, university professors, students, and all interested with this subject readers will find this book useful and interesting.
VI
Preface
This book would not have existed without the excellent contributions by the authors. We remain grateful to the reviewers for their constructive comments. The excellent editorial assistance by the Springer-Verlag is acknowledged. Marek R. Ogiela Poland Lakhmi C. Jain Australia
Contents
Chapter 1 Recent Advances in Pattern Classification ..........................................................1 Marek R. Ogiela, Lakhmi C. Jain 1 New Directions in Pattern Classification ......................................................1 References .........................................................................................................4 Chapter 2 Neural Networks for Handwriting Recognition..................................................5 Marcus Liwicki, Alex Graves, Horst Bunke 1 Introduction ..................................................................................................5 1.1 State-of-the-Art.....................................................................................6 1.2 Contribution..........................................................................................7 2 Data Processing ............................................................................................8 2.1 General Processing Steps......................................................................9 2.2 Our Online System .............................................................................10 2.3 Our Offline System.............................................................................12 3 Neural Network Based Recognition ...........................................................12 3.1 Recurrent Neural Networks (RNNs)...................................................12 3.2 Long Short-Term Memory (LSTM) ...................................................13 3.3 Bidirectional Recurrent Neural Networks...........................................16 3.4 Connectionist Temporal Classification (CTC) ...................................16 3.5 Multidimensional Recurrent Neural Networks ...................................17 3.6 Hierarchical Subsampling Recurrent Neural Networks ......................18 4 Experiments ................................................................................................18 4.1 Comparison with HMMs on the IAM Databases................................18 4.2 Recognition Performance of MDLSTM on Contest’ Data .................20 5 Conclusion ..................................................................................................21 References .......................................................................................................21 Chapter 3 Moving Object Detection from Mobile Platforms Using Stereo Data Registration ......................................................................................................... 25 Angel D. Sappa, David Gerónimo, Fadi Dornaika, Mohammad Rouhani, Antonio M. López 1 Introduction ............................................................................................... 25 2 Related Work............................................................................................. 26
VIII
Contents
3 Proposed Approach.................................................................................... 28 3.1 System Setup ..................................................................................... 29 3.2 Feature Detection and Tracking......................................................... 29 3.3 Robust Registration ........................................................................... 31 3.4 Frame Subtraction.............................................................................. 31 4 Experimental Results ................................................................................. 34 5 Conclusions ............................................................................................... 35 References ...................................................................................................... 35 Chapter 4 Pattern Classifications in Cognitive Informatics ............................................ 39 Lidia Ogiela 1 Introduction ............................................................................................... 39 2 Semantic Analysis Stages .......................................................................... 40 3 Semantic Analysis vs. Cognitive Informatics ............................................ 43 4 Example of a Cognitive UBIAS System.................................................... 45 5 Conclusions ............................................................................................... 52 References ...................................................................................................... 52 Chapter 5 Optimal Differential Filter on Hexagonal Lattice............................................ 59 Suguru Saito, Masayuki Nakajiama, Tetsuo Shima 1 Introduction ............................................................................................... 59 2 Preliminaries .............................................................................................. 60 3 Least Inconsistent Image ........................................................................... 60 4 Point Spread Function................................................................................ 64 5 Condition for Gradient Filter ..................................................................... 67 6 Numerical Optimization ............................................................................ 68 7 Theoretical Evaluation............................................................................... 70 7.1 Signal-to-Noise Ratio ........................................................................ 70 7.2 Localization ....................................................................................... 73 8 Experimental Evaluation............................................................................ 74 8.1 Construction of Artificial Images ...................................................... 74 8.2 Detection of Gradient Intensity and Orientation................................ 76 8.3 Overington's Method of Orientation Detection.................................. 76 8.4 Relationship between Derived Filter and Staunton Filter .................. 78 8.5 Experiment and Results ..................................................................... 80 9 Discussion.................................................................................................. 81 10 Summary.................................................................................................. 83 References ...................................................................................................... 86
Contents
IX
Chapter 6 Graph Image Language Techniques Supporting Advanced Classification and Cognitive Interpretation of CT Coronary Vessel Visualizations ............ 89 Mirosław Trzupek 1 Introduction ............................................................................................... 89 2 The Classification Problem........................................................................ 92 3 Stages in the Analysis of CT Images under a Structural Approach Utilising Graph Techniques ....................................................................... 93 4 Parsing Languages Generated by Graph Grammars .................................. 95 5 Picture Grammars in Classification and Semantic Interpretation of 3D Coronary Vessels Visualisations ............................................................... 96 5.1 Characteristics of the Image Data ...................................................... 96 5.2 Preliminary Analysis of 3D Coronary Vascularisation Reconstructions.................................................................................. 96 5.3 Graph-Based Linguistic Formalisms in Spatial Modelling of Coronary Vessels ............................................................................... 98 5.4 Detecting Lesions and Constructing the Syntactic Analyser ........... 103 5.5 Selected Results ............................................................................... 104 6 Conclusions Concerning the Advanced Classification and Cognitive Interpretation of CT Coronary Vessel Visualizations............................. 108 References .................................................................................................... 110 Chapter 7 A Graph Matching Approach to Symmetry Detection and Analysis ............113 Michael Chertok and Yosi Keller 1 Introduction ..............................................................................................113 2 Symmetries and Their Properties..............................................................115 2.1 Rotational Symmetry ........................................................................115 2.2 Reflectional Symmetry .....................................................................116 2.3 Interrelations between Rotational and Reflectional Symmetries ......117 2.4 Discussion.........................................................................................117 3 Previous Work ..........................................................................................117 3.1 Previous Work in Symmetry Detection and Analysis.......................118 3.2 Local Features...................................................................................121 3.3 Spectral Matching of Sets of Points in Rn .........................................122 4 Spectral Symmetry Analysis.....................................................................123 4.1 Spectral Symmetry Analysis of Sets in Rn .......................................123 4.1.1 Perfect Symmetry and Spectral Degeneracy..........................124 4.2 Spectral Symmetry Analysis of Images ............................................124 4.2.1 Image Representation by Local Features ...............................125 4.2.2 Symmetry Categorization and Pruning ..................................125 4.2.3 Computing the Geometrical Properties of the Symmetry ......126 5 Experimental Results ................................................................................127 5.1 Symmetry Analysis of Images ..........................................................128 5.2 Statistical Accuracy Analysis ...........................................................134
X
Contents
5.3 Analysis of Three-Dimensional Symmetry.......................................136 5.4 Implementation Issues ......................................................................137 5.5 Additional Results ............................................................................140 6 Conclusions ..............................................................................................140 References .....................................................................................................142 Chapter 8 Pattern Classification Methods for Analysis and Visualization of Brain Perfusion CT Maps............................................................................................145 Tomasz Hachaj 1 Introduction ..............................................................................................145 2 Interpretation of Perfusion Maps – Long and Short Time Prognosis........148 3 Image Processing and Abnormality Detection..........................................150 4 Image Registration....................................................................................153 4.1 Affine Registration ...........................................................................154 4.2 FFD Registration ..............................................................................154 4.3 Thirion’s Demons Algorithm............................................................154 4.4 Comparison of Registration Algorithms ...........................................155 5 Classification of Detected Abnormalities .................................................158 6 System Validation and Results .................................................................160 7 Data Visualization – Augmented Reality Environment............................162 7.1 Augmented Reality Environment .....................................................163 7.2 Real Time Rendering of 3D Data .....................................................164 7.3 Augmented Desktop - System Performance Test .............................164 8 Summary...................................................................................................167 References .....................................................................................................168 Chapter 9 Inference of Co-occurring Classes: Multi-class and Multi-label Classification ......................................................................................................171 Tal Sobol-Shikler 1 Introduction ..............................................................................................171 2 Applications..............................................................................................172 3 The Classification Process ........................................................................173 4 Data and Annotation .................................................................................175 5 Classification Approaches ........................................................................177 5.1 Binary Classification.........................................................................177 5.2 Multi-class Classification .................................................................177 5.3 Multi-label Classification .................................................................178 6 Multi-class Classification .........................................................................179 6.1 Multiple Binary Classifiers ...............................................................180 6.1.1 One-Against-All Classification..............................................180 6.1.2 One-Against-One (Pair-Wise) Classification.........................180 6.1.3 Combining Binary Classifiers................................................181
Contents
XI
6.2 Direct Multi-class Classification.......................................................181 6.3 Associative Classification.................................................................182 7 Multi-label Classification .........................................................................182 7.1 Semi-supervised (Annotation) Methods ...........................................186 8 Inference of Co-occurring Affective States from Non-verbal Speech ......186 9 Summary...................................................................................................193 References .....................................................................................................193 Author Index ......................................................................................................199
Chapter 1
Recent Advances in Pattern Classification Marek R. Ogiela1 and Lakhmi C. Jain2 1
AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Kraków, Poland e-mail:
[email protected] 2 University of South Australia, School of Electrical and Information Engineering, Adelaide, Mawson Lakes Campus, South Australia SA 5095, Australia e-mail:
[email protected] Abstract. This chapter describes some advances in modern pattern classification techniques, and new classes of information systems dedicated for image analysis, interpretation and semantic classification. In this book we present some new solutions for the development of modern pattern recognition techniques for processing and analysis of several classes of visual patterns, as well as some theoretical foundations for modern pattern interpretation approaches. In particular this monograph presents selected areas of application of pattern recognition and classification approaches including handwriting recognition, medical image analysis and interpretation, development of cognitive systems for image computer understanding, moving object detection, advanced image filtration and intelligent multi-object labeling and classification.
1 New Directions in Pattern Classification In the field of advanced pattern recognition and computational intelligence methods, new directions in the field referred to advanced visual patterns analysis, recognition and interpretation, strongly connected with computational cognitive science or cognitive informatics has recently been distinguished. Computational cognitive science is a new branch of computer science and pattern classification originating mainly from neurobiology and psychology, but is currently also developed by science (e.g. descriptive mathematics) and technical disciplines (informatics). In this science, models of the cognitive process taking place in the human brain [2], which is studied by neurophysiologists (at the level of biological mechanisms), psychologists (at the level of analysing specific human behaviours) and philosophers (at the level of a general reflection on the nature of cognitive processes and their conditions), have become the basis for designing various types of intelligent computer systems.
M.R. Ogiela and L.C. Jain (Eds.): Computational Intelligence Paradigms, SCI 386, pp. 1–4. springerlink.com © Springer-Verlag Berlin Heidelberg 2012
2
M.R. Ogiela and L.C. Jain
The requirements of an advanced user are not limited to just collecting, processing and analysing information in computer systems. Today, users expect IT systems to offer capabilities of automatically penetrating the semantic layer as well, as this is the source for developing knowledge, and not just only collecting messages. This is particularly true of information systems or decision support systems. Consequently, IT systems based on cognition will certainly be developed intensively, as they meet the growing demands of the Information Society, in which the ability to reach the contents of information collected in computer systems will be gaining increasing importance. In particular, due to the development of system which, apart from numerical data and text, also collect multimedia information, and particularly images or movies, there is a growing need to develop scientific cornerstones for designing IT systems allowing one to easily find the requisite multimedia information, which conveys a specific meaning in its structure, but which requires the semantic contents on an image to be understood, and not just the objects visible in it to be analysed and possibly classified according to their form. Such systems, capable of not only analysing but also interpreting the meaning of the data they process (scenes, real-life contexts, movies etc.), can also play the role of advisory systems supporting human decision-making, whereas the effectiveness of this support can be significantly enhanced by the system automatically acquiring knowledge adequate for the problem in question.
Fig. 1 Taxonomy of issues expplored by cognitive science
It is thus obvious that contemporary solutions should aim at the development of new classes of information systems which can be assigned the new name of Cognitive Information Systems. We are talking about systems which can process data at a very high level of abstraction and make semantic evaluations of such data. Such systems should also have autonomous learning capabilities, which will allow them to improve along with the extension of the knowledge available to them, presented in the form of various patterns and data. Such systems are significantly more complex in terms of the functions they perform than solutions currently employed in practice, so they have to be designed with the use of advanced
Recent Advances in Pattern Classification
3
achievements of computer technologies. What is more, such systems do not fit the theoretical frameworks of today's information collection and searching systems, so when undertaking the development and practical implementation of Cognitive Information Systems, the first task is to find, develop and research new theoretical formalisms adequate for the jobs given to these systems. They will use the theoretical basis and conceptual formalisms developed for cognitive science by physiology, psychology and philosophy (see Fig. 1), but they have to adjusted to the new situation, namely the intentional initiation of cognitive processes in technological systems. Informatics has already attempted to create formalisms for simpler information systems on this basis [2, 5]. In addition, elements of a cognitive approach are increasingly frequently cropping up in the structure of newgeneration pattern classification systems [3, 6], although the adequate terminology is not always used. On the other hand, some researchers believe that the cognitive domain can be conquered by IT systems just as the researchers of simple perception and classification mechanisms have managed to transplant selected biological observations into the technological domain, namely into artificial neural networks [4]. However, the authors have major doubts whether this route will be productive and efficient, as there is a huge difference in scale between neurobiological processes which are mapped by neural networks and mental processes which should be deployed in cognitive information systems or cognitive pattern recognition approaches. The reason is that whereas neural networks are based on the action of neurons numbering from several to several thousand (at the most), mental processes involve hundreds of millions of neurons in the brain, which is a significant hindrance in any attempt to imitate them with computers. This is why it seems appropriate and right to try to base the design of future Cognitive Information Systems on attempts at the behavioural modelling of psychological phenomena and not on the structural imitation of neurophysiological processes. The general foundations for the design of such systems have been the subject of earlier publications [6, 7, 8]. However, it must be said that the methodology of designing universal systems of cognitive interpretation has yet to be developed fully. This applies in particular to systems oriented towards the cognitive analysis of multimedia information. Overcoming the barrier between the form of multimedia information (e.g. the shape of objects in the picture or the tones of sounds) and the sense implicitly contained in this information requires more research initially oriented towards detailed goals. Possibly, after some time, it will be possible to aggregate the experience gained while executing these individual, detailed jobs into a comprehensive, consistent methodology. However, for the time being, we have to satisfy ourselves with achieving individual goals one after another. These goals are mainly about moving away from the analysis of data describing single objects to a more general and semantically deepened analysis of data presenting or describing various components of images or different images from the same video sequence. Some good examples of such visual data analysis will be presented in following chapters.
4
M.R. Ogiela and L.C. Jain
References 1. Bichindaritz, I., Vaidya, S., Jain, A., Jain, L.C. (eds.): Computational Intelligence in Healthcare 4. SCI, vol. 309, pp. 347–369. Springer, Heidelberg (2010) 2. Branquinho, J. (ed.): The Foundations of Cognitive Science. Clarendon Press, Oxford (2001) 3. Davis, L.S. (ed.): Foundations of Image Understanding. Kluwer Academic Publishers (2001) 4. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. A WileyInterscience Publication John Wiley & Sons, Inc. (2001) 5. Meystel, A.M., Albus, J.S.: Intelligent Systems – Architecture, Design, and Control. A Wiley-Interscience Publication John Wiley & Sons, Inc., Canada (2002) 6. Ogiela, L., Ogiela, M.R.: Cognitive Techniques in Visual Data Interpretation. Springer, Heidelberg (2009) 7. Ogiela, M.R., Tadeusiewicz, R.: Modern Computational Intelligence Methods for the Interpretation of Medical Images. Springer, Heidelberg (2008) 8. Ogiela, M.R., Ogiela, L.: Cognitive Informatics in Medical Image Semantic Content Understanding. In: Kim, T.-H., Stoica, A., Chang, R.-S. (eds.) Security-Enriched Urban Computing and Smart Grid. CCIS, vol. 78, pp. 131–138. Springer, Heidelberg (2010) 9. Tolk, A., Jain, L.C.: Intelligence-Based Systems Engineering. Intelligence Systems Reference Library 10 (2011) 10. Vernon, D., Metta, G., Sandini, G.: A survey of artificial cognitive systems: Implications for the autonomous development of mental capabilities in computational agents. IEEE Transactions on Evolutionary Computation 11(2), 151–180 (2007)
Chapter 2 Neural Networks for Handwriting Recognition Marcus Liwicki1, Alex Graves2, and Horst Bunke3 1
German Research Center for Artificial Intelligence, Trippstadter Str. 122, 67663 Kaiserslautern, Germany e-mail:
[email protected] 2 Institute for Informatics 6, Technical University of Munich, Boltzmannstr. 3, 85748 Garching bei München, Germany e-mail:
[email protected] 3 Institute for Computer Science and Applied Mathematics, Neubrückstr. 10, 3012 Bern, Switzerland e-mail:
[email protected] Abstract. In this chapter a novel kind of Recurrent Neural Networks (RNNs) is described. Bi- and Multidimensional RNNs combined with Connectionist Temporal Classification allow for a direct recognition of raw stroke data or raw pixel data. In general, recognizing lines of unconstrained handwritten text is a challenging task. The difficulty of segmenting cursive or overlapping characters, combined with the need to assimilate context information, has led to low recognition rates for even the best current recognizers. Most recent progress in the field has been made either through improved preprocessing, or through advances in language modeling. Relatively little work has been done on the basic recognition algorithms. Indeed, most systems rely on the same hidden Markov models that have been used for decades in speech and handwriting recognition, despite their wellknown shortcomings. This chapter describes an alternative approach based on a novel type of recurrent neural network, specifically designed for sequence labeling tasks where the data is hard to segment and contains long-range, bidirectional or multidirectional interdependencies. In experiments on two unconstrained handwriting databases, the new approach achieves word recognition accuracies of 79,7% on on-line data and 74,1% on off-line data, significantly outperforming a state-of-the-art HMM-based system. Promising experimental results on various other datasets from different countries are also presented. A toolkit implementing the networks is freely available for public.
1 Introduction Handwriting recognition is traditionally divided into on-line and off-line recognition. In on-line recognition a time ordered sequence of coordinates, representing M.R. Ogiela and L.C. Jain (Eds.): Computational Intelligence Paradigms, SCI 386, pp. 5–24. springerlink.com © Springer-Verlag Berlin Heidelberg 2012
6
M. Liwicki, A. Graves, and H. Bunke
the movement of the pen-tip, is captured, while in the off-line case only an image of the text is available. Because of the greater ease of extracting relevant features, online recognition generally yields better results [1]. Another crucial division is that between the recognition of isolated characters or words, and the recognition of whole lines of text. Unsurprisingly, the latter is substantially harder, and the excellent results that have been obtained for e.g. digit and character recognition [2], [3] have never been matched for complete lines. Lastly, handwriting recognition can be split into cases where the writing style is constrained in some way—for example, only hand printed characters are allowed—and the more challenging scenario where it is unconstrained. Despite more than 40 years of handwriting recognition research [2], [3], [4], [5], developing a reliable, general-purpose system for unconstrained text line recognition remains an open problem.
1.1 State-of-the-Art A well known testbed for isolated handwritten character recognition is the UNIPEN database [6]. Systems that have been found to perform well on UNIPEN include: a writer-independent approach based on hidden Markov models [7]; a hybrid technique called cluster generative statistical dynamic time warping (CSDTW) [8], which combines dynamic time warping with HMMs and embeds clustering and statistical sequence modeling in a single feature space; and a support vector machine with a novel Gaussian dynamic time warping kernel [9]. Typical error rates on UNIPEN range from 3% for digit recognition, to about 10% for lower case character recognition. Similar techniques can be used to classify isolated words, and this has given good results for small vocabularies (e.g., a writer dependent word error rate of about 4.5% for 32 words [10]). However an obvious drawback of whole word classification is that it does not scale to large vocabularies. For large vocabulary recognition tasks, such as those considered in this chapter, the usual approach is to recognize individual characters and map them onto complete words using a dictionary. Naively, we could do this by presegmenting words into characters and classifying each segment. However, segmentation is difficult for cursive or unconstrained text, unless the words have already been recognized. This creates a circular dependency between segmentation and recognition that is often referred to as Sayre’s paradox [11]. Nonetheless, approaches have been proposed where segmentation is carried out before recognition. Some techniques for character segmentation, based on unsupervised learning and data-driven methods, are given in [3]. Other strategies first segment the text into basic strokes, rather than characters. The stroke boundaries may be defined in various ways, such as the minima of the velocity, the minima of the y-coordinates, or the points of maximum curvature. For example, one online approach first segments the data at the minima of the y-coordinates then applies self-organizing maps [12]. Another, offline, approach [13] uses the minima of the vertical histogram for an initial estimation of the character boundaries and then applies various heuristics to improve the segmentation.
Neural Networks for Handwriting Recognition
7
A more satisfactory solution to Sayre’s paradox would be to segment and recognize at the same time. Hidden Markov models (HMMs) are able to do this, which is one reason for their popularity for unconstrained handwriting [14], [15], [16], [17], [18], [19]. The idea of applying HMMs to handwriting recognition was originally motivated by their success in speech recognition [20], where a similar conflict exists between recognition and segmentation. Over the years, numerous refinements of the basic HMM approach have been proposed, such as the writer independent system considered in [7], which combines point oriented and stroke oriented input features. However, HMMs have several well-known drawbacks. One of these is that they assume the probability of each observation depends only on the current state, which makes contextual effects difficult to model. Another is that HMMs are generative, when discriminative models generally give better performance labeling and classification tasks. Recurrent neural networks (RNNs) do not suffer from these limitations, and would therefore seem a promising alternative to HMMs. However the application of RNNs alone to handwriting recognition have so far been limited to isolated character recognition (e.g. [21]). One reason for this is that the traditional neural network objective functions require a separate training signal for every point in the input sequence, which in turn requires presegmented data. A more successful use of neural networks for handwriting recognition has been to combine them with HMMs in the so-called hybrid approach [22], [23]. A variety of network architectures have been tried for hybrid handwriting recognition, including multilayer perceptrons [24], [25], time delay neural networks (TDNNs) [18], [26], [27], and RNNs [28], [29], [30]. However, although hybrid models alleviate the difficulty of introducing context to HMMs, they still suffer from many of the drawbacks of HMMs, and they do not realize the full potential of RNNs for sequence modeling.
1.2 Contribution This chapter describes a recently introduced alternative approach, in which a single RNN is trained directly for sequence labeling. The network uses the connectionist temporal classification (CTC) combined with bidirectional Long Short-Term Memory (BLSTM) architecture, which provides access to long range input context in both directions. A further enhancement which allows the network to work in multiple dimensions will be presented in this chapter. The so-called Multidimemsional LSTM (MDLSTM) is very successful even on raw pixel data. The rest of this Chapter is organized as follows. Section 2 presents the handwritten data and the feature extraction techniques. Subsequently, Section 3 describes the novel neural network classifier. Experimental results are presented in Section 4. Finally, Section 5 concludes this chapter.
8
M. Liwicki, A. Graves, and H. Bunke
Fig. 1 Processing steps of the handwriting recognition system
2 Data Processing As stated above, handwritten data can be acquired for two formats, online and offline format. In this section typical preprocessing and feature extraction techniques are presented. Those techniques have been applied for our experiments.
Neural Networks for Handwriting Recognition
9
The online and offline databases used are the IAM-OnDB [31] and the IAMDB [32] respectively. Note that these do not correspond to the same handwriting samples: the IAM-OnDB was acquired from a whiteboard, while the IAM-DB consists of scanned images of handwritten forms.1
2.1 General Processing Steps A recognition system for unconstrained Roman script is usually divided into consecutive units which iteratively process the handwritten input data to finally obtain the transcription. The main units are illustrated in Fig. 1 and summarized in this section. Certainly, there are differences between offline and online processing, but the principles are the same. Only the methodology for performing the individual steps differs. First, preprocessing steps are applied to reduce noise in the raw data. The input is raw handwritten data and the output usually consists of extracted text lines. The amount of effort that need to be invested into the preprocessing depends on the given data. If the data have been acquired from a system that does not produce any noise and only single words have been recorded, there is nothing to do in this step. But usually the data contains noise which need to be removed to improve the quality of the handwriting, e.g., by means of image enhancement. The offline images are furthermore binarized and the online data, which usually contain noisy points and gaps within strokes, is processed with some heuristics to recover from these artifacts. These operations are described in Ref. [33]. The cleaned text data is then automatically divided into lines using some simple heuristics. Next, the data is normalized, i.e., it is attempted to remove writer-specific characteristics of the handwriting to make writings from different authors looking more similar to each other. This is a very important step in any handwriting recognition system, because the writing styles of the writers differ with respect to skew, slant, height, and width of the characters. In the literature there is no standard way of normalizing the data, but many systems use similar techniques. First, the text line is corrected in regard to its skew, i.e., it is rotated, so that the baseline is parallel to the x-axis. Then, slant correction is performed so that the slant becomes upright. The next important step is the computation of the baseline and the corpus line. These two lines divide the text into three areas: the upper area, which mainly contains the ascenders of the letters; the middle area, where the corpus of the letters is present; and the lower area with the descenders of some letters. These three areas are normalized to predefined heights. Often, some additional normalization steps are performed, depending on the domain. In offline recognition, thinning and binarization may be applied. In online recognition the delayed strokes, e.g., the crossing of a “t” or the dot of an “i”, are usually removed, and equidistant resampling is applied. 1
The databases and benchmark tasks are available on http://www.iam.unibe.ch/fki/databases
10
M. Liwicki, A. Graves, and H. Bunke
Subsequently, features are extracted from the normalized data. This particular step is needed because the recognizers need numerical data as their input. However, no standard method for computing the features exists in the literature. One common method in offline recognition of handwritten text lines is the use of a sliding window moving in the writing direction over the text. Features are extracted at every window position, resulting in a sequence of feature vectors. In the case of online recognition the points are already available in a time-ordered sequence, which makes it easier to get a sequence of feature vectors in writing order. If there is a fixed size of the input pattern, such as in character or word recognition, one feature vector of a constant size can be extracted for each pattern.
Fig. 2 Features of the vicinity
2.2 Our Online System In the system described in this chapter state-of-the-art feature extraction methods are applied to extract the features from the preprocessed data. The feature set input to the online recognizer consists of 25 features which utilize information from both the real online data stored in XML format, and pseudo offline information automatically generated from the online data. For each (x, y)-coordinate recorded by the acquisition device a set of 25 features are extracted, resulting in a sequence of 25-dimensional vectors for each given text line. These features can be divided into two classes. The first class consists of features extracted for each point by considering the neighbors with respect to time. The second class takes the offline matrix representation into account, i.e., it is based on spatial information. The features of the first class are the following: • pen-up/pen-down: a boolean variable indicating whether the pen-tip touches the board or not. Consecutive strokes are connected with straight lines for which this feature has the value false.
Neural Networks for Handwriting Recognition
11
• hat-feature: this binary feature indicates whether a delayed stroke has been removed at the same horizontal position as the considered point. • speed: the velocity is computed before resampling and then interpolated. • x-coordinate: the x-position is taken after high-pass filtering, i.e., after subtracting a moving average from the real horizontal position. • y-coordinate: this feature represents the vertical position of the point after normalization. • writing direction: here we have a pair of features, given by the cosine and sine of the angle between the line segment starting at the point and the x-axis. • curvature: similarly to the writing direction, this is a pair of features, given by the cosine and sine of the angle between the lines to the previous and the next point. • vicinity aspect: this feature is equal to the aspect of the trajectory (See Fig. 2) • vicinity slope: this pair of features is given by the cosine and sine of the angle of the straight line from the first to the last vicinity point (see Fig. 2). • vicinity curliness: this feature is defined as the length of the trajectory in the vicinity divided by max(x(t); y(t)) (see Fig. 2). • vicinity linearity: here we use the average squared distance d² of each point in the vicinity to the straight line from the first to the last vicinity point (see Fig. 2).
Fig. 3 Pseudo offline features
The features of the second class are all computed using a two-dimensional matrix B representing the offline version of the data. For each position the number of points on the trajectory of the strokes is stored. This can be seen as a lowresolution image of the handwritten data. The following features are used: • ascenders/descenders: these two features count the number of points above the corpus line (ascenders) and below the baseline (descenders). Only points which
12
M. Liwicki, A. Graves, and H. Bunke
have an x-coordinate in the vicinity of the current point are considered. Additionally the points must have a minimal distance to the lines to be considered as part of an ascender or descender. The corresponding distances are set to a predefined fraction of the corpus height. • context map: the two-dimensional vicinity of the current point is transformed to a 3×3 map. The number of black points in each region is taken as a feature value. So we obtain altogether nine features of this type.
2.3 Our Offline System To extract the feature vectors from the offline images, a sliding window approach is used. The width of the window is one pixel, and nine geometrical features are computed at each window position. Each text line image is therefore converted to a sequence of 9-dimensional vectors. The nine features are as follows: • • • • • •
The mean gray value of the pixels The center of gravity of the pixels The second order vertical moment of the center of gravity The positions of the uppermost and lowermost black pixels The rate of change of these positions (with respect to the neighboring windows) The number of black-white transitions between the uppermost and lowermost pixels • The proportion of black pixels between the uppermost and lowermost pixels. For a more detailed description of the offline features, see [17]. In the next phase indicated in Fig. 1, a classification system is applied which generates a list of candidates or even a recognition lattice. This step and the last step, the postprocessing, are described in the next section.
3 Neural Network Based Recognition The main focus of this chapter is the recently introduced Neural Network classifier based on CTC combined with Bidirectional or Multidimensional LSTM. This Section describes the different aspects of the architecture and gives brief insights into the algorithms behind.
3.1 Recurrent Neural Networks (RNNs) Recurrent neural networks (RNNs) are a connectionist model containing a selfconnected hidden layer. One benefit of the recurrent connection is that a `memory' of previous inputs remains in the network's internal state, allowing it to make use
Neural Networks for Handwriting Recognition
13
of past context. Context plays an important role in handwriting recognition, as illustrated in Figure 4. Another important advantage of recurrency is that the rate of change of the internal state can be finely modulated by the recurrent weights, which builds in robustness to localized distortions of the input data.
Fig. 4 Importance of context. The characters “ur” would be hard to recognize without the context of the word “entourage”.
3.2 Long Short-Term Memory (LSTM) Unfortunately, the range of contextual information that standard RNNs can access is quite limited. The problem is that the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the network's recurrent connections, and is repeatedly scaled by the connection weights. In practice this shortcoming (referred to in the literature as the vanishing gradient problem) makes it hard for an RNN to bridge gaps of more than about 10 time steps between relevant input and target events. Long Short-Term Memory (LSTM) is an RNN architecture specifically designed to address the vanishing gradient problem. An LSTM hidden layer consists of multiple recurrently connected subnets, known as memory blocks. Each block contains a set of internal units, or cells, whose activation is controlled by three multiplicative units: the input gate, forget gate and output gate. Figure 5 provides a detailed illustration of an LSTM memory block with a single cell. The effect of the gates is to allow the cells to store and access information over long periods of time. For example, as long as the input gate remains closed (i.e., has an activation close to 0), the activation of the cell will not be overwritten by the new inputs arriving in the network. Similarly, the cell activation is only available to the rest of the network when the output gate is open, and the cell's recurrent connection is switched on and off by the forget gate.
14
M. Liwicki, A. Graves, and H. Bunke
Fig. 5 LSTM memory block with one cell
The mathematical background of LSTM is described in depth in [40,41,34]. A short description follows. A conventional recurrent multi-layer Perceptron network (MLP) contains a hidden layer where all neurons of the hidden layer are fully connected with all neurons of the same layer (the recurrent connections). The activation of a single cell at the timestamp t is a weighted sum of the inputs xit plus the weighted sum of the outputs of the previous timestamp bh(t-1). This can be expressed as follows (or in a matrix form):
a t = wi xit + wh bht −1 =Wi ⋅ X t + Wh ⋅ B t −1 bt = f (a t ) B t = f ( At ) Since the outputs of the previous timestamp are just calculated by the squashing function of the corresponding cell activations, the influence of the network input in the previous time stamp can be considered as smaller, since it has been weighted already a second time. Thus the overall network activation can be roughly rewritten as:
At = g ( X t ,Wh X t −1 ,Wh X t − 2 ,,Wh X 1 ) 2
t
Neural Networks for Handwriting Recognition
15
where Xt is the overall net input at timestamp t and Wh is the weight matrix of the hidden layer. Note that for clarity reasons we use this abbreviated form of the complex formula, where the input weights do not directly appear (all is hidden in the function g(…)). This formula reveals that the influence of earlier time stamps t-n vanishes rapidly, as the time difference n appears in the exponent of the weight matrix. Since all values of the weight matrix Wh are smaller than 1, the n-th power of Wh is close to zero. Introducing the LSTM cell brings in three new cells which all get the weighted sum of the outputs of the hidden layer in the previous timestamp as an input, i.e., for the input gate:
aιt = Wi ,ι ⋅ X t + Wh ,ι ⋅ B t −1 + wc ,ι s tc−1 where sct-1 is the cell state of the previous timestamp and Wi,t and Wh,t are the weights for the current net input and the hidden layer output of the previous timestamp, respectively. The activation of the forget gate is:
aθt = Wi ,θ ⋅ X t + Wh ,θ ⋅ B t −1 + wc,θ s tc−1 which is the same formula just with other weights (those trained for the forget gate). The cell activation is usually calculated by:
act = Wi ,c ⋅ X t + Wh,c ⋅ B t −1 However, the cell state is then weighted with the outputs of the two gate cells:
sct = σ (aιt ) g (act ) + σ (a θt ) sct −1 where σ indicates that the sigmoid function is used as a squashing function for the gates and g() is cell’s activation function. As the sigmoid function often returns a value close to zero or one, the formula can be interpreted as:
sct = [0or1] g ( act ) + [0or1]sct −1 or in words: the cell state is either depending on the input activation (if the input gates opens, i.e., the first weight is close to 1) or on the previous cell state (if the forget gate opens, i.e., the second weight is close to one). This particular property enables the LSTM-cell to bridge over long time periods. The value of the output gate is calculated similarly to the other gates, i.e.:
aωt = Wi ,ω ⋅ X t + Wh ,ω ⋅ B t −1 + wc ,ω s tc and the final cell output is:
bct = σ (aωt )h( sct ) which again is close to zero or the usual output of the cell h(…).
16
M. Liwicki, A. Graves, and H. Bunke
3.3 Bidirectional Recurrent Neural Networks For many tasks it is useful to have access to future as well as past context. In handwriting recognition, for example, the identification of a given letter is helped by knowing the letters both to the right and left of it. Bidirectional Recurrent Neural Networks (BRNNs) [35] are able to access context in both directions along the input sequence. BRNNs contain two separate hidden layers, one of which processes the inputs forwards, while the other processes them backwards. Both hidden layers are connected to the output layer, which therefore has access to all past and future context of every point in the sequence. Combining BRNNs and LSTM gives bidirectional LSTM (BLSTM) [42].
3.4 Connectionist Temporal Classification (CTC) Standard RNN objective functions require a presegmented input sequence with a separate target for every segment. This has limited the applicability of RNNs in domains such as cursive handwriting recognition, where segmentation is difficult to determine. Moreover, because the outputs of a standard RNN are a series of independent, local classifications, some form of post processing is required to transform them into the desired label sequence. Connectionist Temporal Classification (CTC) [36,34] is an RNN output layer specifically designed for sequence labeling tasks. It does not require the data to be presegmented, and it directly outputs a probability distribution over label sequences. CTC has been shown to outperform RNN-HMM hybrids in a speech recognition task [36]. A CTC output layer contains as many units as there are labels in the task, plus an additional ‘blank’ or ‘no label’ unit. The output activations are normalized (using the softmax function), so that they sum to 1 and are each in the range (0; 1): t
y = t k
where
e ak
t
e ak ′ k′
,
akt is the unsquashed activation of output unit k at time t, and y kt is the ac-
tivation of the same unit after the softmax function is applied. The above activations are used to estimate the conditional probabilities p (k , t | x) of observing the label (or blank) with index k at time t in the input sequence x:
y kt = p (k , t | x)
Neural Networks for Handwriting Recognition
17
The conditional probability p (π | x ) of observing a particular path π through the lattice of label observations is then found by multiplying together the label and blank probabilities at every time step: T
T
t =1
t =1
p (π | x) = ∏ p (π t , t | x) = ∏ yπt t , where
πt
is the label observed at time t along path
π.
≤T
≤T
Paths are mapped onto label sequences l ∈ L , where L denotes the set of all strings on the alphabet L of length ≤ T , by an operator B that removes first the repeated labels, then the blanks. For example, both B (a ,−, a, b,− ) and
B (−, a, a,−,−, a, b, b) yield the labeling ( a, a, b) . Since the paths are mutually exclusive, the conditional probability of a given labelling probabilities of all the paths corresponding to it:
p (l | x) =
l ∈ L≤T is the sum of the
p(π | x)
π ∈B −1 ( l )
The above step is what allows the network to be trained with unsegmented data. The intuition is that, because we don’t know where the labels within a particular transcription will occur, we sum over all the places where they could occur. In general, a large number of paths will correspond to the same label sequence, so a naïve calculation of the equation above is unfeasible. However, it can be efficiently evaluated using a graph-based algorithm, similar to the forward-backward algorithm for HMMs. More details about the CTC forward-backward algorithm appear in [39].
3.5 Multidimensional Recurrent Neural Networks Ordinary RNNs are designed for time-series and other data with a single spatiotemporal dimension. However the benefits of RNNs (such as robustness to input distortion, and flexible use of surrounding context) are also advantageous for multidimensional data, such as images and video sequences. Multidimensional recurrent neural networks (MDRNNs) [43, 34], a special case of Directed Acyclic Graph RNNs [44], generalize the basic structure of RNNs to multidimensional data. Rather than having a single recurrent connection, MDRNNs have as many recurrent connections as there are spatio-temporal dimensions in the data. This allows them to access previous context information along all input directions. Multidirectional MDRNNs are the generalization of bidirectional RNNs to multiple dimensions. For an n-dimensional data sequence, 2n different hidden layers are used to scan through the data in all directions. As with bidirectional RNNs, all
18
M. Liwicki, A. Graves, and H. Bunke
the layers are connected to a single output layer, which therefore has access to context information in both directions along all dimensions. Multidimensional LSTM (MDLSTM) is the generalization of bidirectional LSTM to multidimensional data.
3.6 Hierarchical Subsampling Recurrent Neural Networks Hierarchical subsampling is a common technique in computer vision [45] and other domains with large input spaces. The basic principle is to iteratively rerepresent the data at progressively lower resolutions, using a hierarchy of feature extractors. The features extracted at each level are subsampled and used as input to the next level. The number and complexity of the features typically increases as one climbs the hierarchy. This is much more efficient for high-resolution data than a single `flat’ feature extractor, since most of the computations are carried out on low resolution feature maps, rather than, for example, raw pixels. A well-known connectionist hierarchical subsampling architecture is Convolutional Neural Networks [46]. Hierarchical subsampling is also possible with RNNs, and hierarchies of MDLSTM layers have been applied to offline handwriting recognition [47]. Hierarchical subsampling with LSTM is equally useful for long 1D sequences, such as raw speech data or online handwriting trajectories with a high sampling rate. From the point of view of handwriting recognition, the most interesting aspect of hierarchical subsampling RNNs is that they can be applied directly to the raw input data (offline images or online point-sequences) without any normalization or feature extraction.
4 Experiments The experiments have been performed with the freely available RNNLIB tool by Alex Graves.2 This tool implements the network architecture and furthermore provides examples for the recognition of several scripts.
4.1 Comparison with HMMs on the IAM Databases The aim of the first experiments was to evaluate the performance of the complete RNN handwriting recognition system, illustrated in Figure 6, for both online and offlne handwriting. In particular we wanted to see how it compared to an HMMbased system. The online and offline databases used were the IAM-OnDB and the IAM-DB respectively (see above). Note that these do not correspond to the same handwriting samples: the IAM-OnDB was acquired from a whiteboard, while the IAM-DB consists of scanned images of handwritten forms.
2
http://sourceforge.net/projects/rnnl/
Neural Networks for Handwriting Recognition
19
Fig. 6 Complete RNN handwriting recognition system (here applied to offline Arabic data)
To make the comparisons fair, the same online and offline preprocessing was used for both the HMM and RNN systems. In addition, the same dictionaries and language models were used for the two systems. For all the experiments, the task was to transcribe the text lines in the test set, using the words in the dictionary. The basic performance measure was the word accuracy:
insertions + substitiutions + delitions 100 ⋅ 1 − number _ of _ words _ in _ transcription where the number of word insertions, substitutions and deletions is summed over the whole test set. For the RNN system, we also recorded the character accuracy, defined as above except with characters instead of words.
20
M. Liwicki, A. Graves, and H. Bunke
Table 1 Main results for online data System
Word Accuracy Character Accuracy
HMM 65.0% CTC (BLSTM)79.7%
88.5%
Table 2 Main results for offline data System
Word Accuracy Character Accuracy
HMM 64.5% CTC (BLSTM)74.1%
81.8%
As can be seen from Tables 1 and 2, the RNN substantially outperformed the HMM on both databases. To put these results in perspective, the Microsoft tablet PC handwriting recognizer [37] gave a word accuracy score of 71.32% on the online test set. This result is not directly comparable to our own, since the Microsoft system was trained on a different training set, and uses considerably more sophisticated language modeling than the HMM and RNN systems we implemented. However, it indicates that the RNN-based recognizer is competitive with the best commercial systems for unconstrained handwriting.
4.2 Recognition Performance of MDLSTM on Contest’ Data The MDLSTM system participated in three handwriting recognition contests at the ICDAR 2009 (see the proceedings in [38]). The recognition tasks were based on different scripts. In all cases, the systems had to recognize handwriting from unknown writers. Table 3 Summarized results from the online Arabic handwriting recognition competition System
Word Accuracy Time/Image
REGIM HMM 52.67% Vision Objects 98.99% CTC (BLSTM)95.70%
6402.24 ms 69.41 ms 1377.22 ms
Table 4 Summarized results from the offline Arabic handwriting recognition competition System
Word Accuracy Time/Image
Arab-Reader HMM 76.66% Multi-Stream HMM74.51% CTC (MDLSTM) 81.06%
2583.64 ms 143269.81 ms 371.61 ms
Neural Networks for Handwriting Recognition
21
Table 5 Summarized results from the offline French handwriting recognition competition System
Word Accuracy
HMM+MLP Combination 83.17% Non-Symmetric HMM 83.17 % CTC (MDLSTM) 93.17%
A summary of the results appear in Tables 3-5. As can be seen, the approach described in this chapter always outperformed the other systems in the offline case. This observation is very promising, because the system just uses the 2dimensional raw pixel data as an input. For the online competition (Table 3) a commercial recognizer performed better than the CTC approach. However, if the CTC system would be combined with State-of-the-Art preprocessing and feature extraction methods, it would probably reach a higher performance. This observation has been made in [39], where extended experiments to those in Section 1.4.1 have been performed. Having a look at the calculation time (milliseconds per text line) also reveals very promising results. The MDLSTM combined with CTC was among the fastest recognizers in the competitions. Using some pruning strategies could further increase the recognition speed.
5 Conclusion This chapter described a novel approach for recognizing unconstrained handwritten text, using a recurrent neural network. The key features of the network are the bidirectional Long Short-Term Memory architecture, which provides access to long range, bidirectional contextual information, and the Connectionist Temporal Classification output layer, which allows the network to be trained on unsegmented sequence data. In experiments on online and offline handwriting data, the new approach outperformed state-of-the-art HMM-based classifiers and several other recognizers. We conclude that this system represents a significant advance in the field of unconstrained handwriting recognition, and merits further research. A toolkit implementing the presented architecture is freely available to the public.
References [1] Seiler, R., Schenkel, M., Eggimann, F.: Off-line cursive handwriting recognition compared with on-line recognition. In: ICPR 1996: Proceedings of the International Conference on Pattern Recognition (ICPR 1996), vol. IV-7472, p. 505. IEEE Computer Society, Washington, DC, USA (1996) [2] Tappert, C., Suen, C., Wakahara, T.: The state of the art in online handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(8), 787–808 (1990)
22
M. Liwicki, A. Graves, and H. Bunke
[3] Plamondon, R., Srihari, S.N.: On-line and off-line handwriting recognition: A comprehensive survey. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 63–84 (2000) [4] Vinciarelli, A.: A survey on off-line cursive script recognition. Pattern Recognition 35(7), 1433–1446 (2002) [5] Bunke, H.: Recognition of cursive roman handwriting - past present and future. In: Proc. 7th Int. Conf. on Document Analysis and Recognition, vol. 1, pp. 448–459 (2003) [6] Guyon, I., Schomaker, L., Plamondon, R., Liberman, M., Janet, S.: Unipen project of on-line data exchange and recognizer benchmarks. In: Proc. 12th Int. Conf. on Pattern Recognition, pp. 29–33 (1994) [7] Hu, J., Lim, S., Brown, M.: Writer independent on-line handwriting recognition using an HMM approach. Pattern Recognition 33(1), 133–147 (2000) [8] Bahlmann, C., Burkhardt, H.: The writer independent online handwriting recognition system frog on hand and cluster generative statistical dynamic time warping. IEEE Trans. Pattern Anal. and Mach. Intell. 26(3), 299–310 (2004) [9] Bahlmann, C., Haasdonk, B., Burkhardt, H.: Online handwriting recognition with support vector machines - a kernel approach. In: Proc. 8th Int. Workshop on Frontiers in Handwriting Recognition, pp. 49–54 (2002) [10] Wilfong, G., Sinden, F., Ruedisueli, L.: On-line recognition of handwritten symbols. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(9), 935–940 (1996) [11] Sayre, K.M.: Machine recognition of handwritten words: A project report. Pattern Recognition 5(3), 213–228 (1973) [12] Schomaker, L.: Using stroke- or character-based self-organizing maps in the recognition of on-line, connected cursive script. Pattern Recognition 26(3), 443–450 (1993) [13] Kavallieratou, E., Fakotakis, N., Kokkinakis, G.: An unconstrained handwriting recognition system. Int. Journal on Document Analysis and Recognition 4(4), 226–242 (2002) [14] Bercu, S., Lorette, G.: On-line handwritten word recognition: An approach based on hidden Markov models. In: Proc. 3rd Int. Workshop on Frontiers in Handwriting Recognition, pp. 385–390 (1993) [15] Starner, T., Makhoul, J., Schwartz, R., Chou, G.: Online cursive handwriting recognition using speech recognition techniques. In: Int. Conf. on Acoustics, Speech and Signal Processing, vol. 5, pp. 125–128 (1994) [16] Hu, J., Brown, M., Turin, W.: HMM based online handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(10), 1039–1045 (1996) [17] Marti, U.-V., Bunke, H.: Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. Int. Journal of Pattern Recognition and Artificial Intelligence 15, 65–90 (2001) [18] Schenkel, M., Guyon, I., Henderson, D.: On-line cursive script recognition using time delay neural networks and hidden Markov models. Machine Vision and Applications 8, 215–223 (1995) [19] El-Yacoubi, A., Gilloux, M., Sabourin, R., Suen, C.: An HMM-based approach for off-line unconstrained handwritten word modeling and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(8), 752–760 (1999) [20] Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of the IEEE 77(2), 257–286 (1989)
Neural Networks for Handwriting Recognition
23
[21] Bourbakis, N.G.: Handwriting recognition using a reduced character method and neural nets. In: Proc. SPIE Nonlinear Image Processing VI, vol. 2424, pp. 592–601 (1995) [22] Bourlard, H., Morgan, N.: Connnectionist Speech Recognition: A Hybrid Approach. Kluwer Academic Publishers (1994) [23] Bengio, Y.: Markovian models for sequential data. Neural Computing Surveys 2, 129–162 (1999) [24] Brakensiek, A., Kosmala, A., Willett, D., Wang, W., Rigoll, G.: Performance evaluation of a new hybrid modeling technique for handwriting recognition using identical on-line and off-line data. In: Proc. 5th Int. Conf. on Document Analysis and Recognition, Bangalore, India, pp. 446–449 (1999) [25] Marukatat, S., Artires, T., Dorizzi, B., Gallinari, P.: Sentence recognition through hybrid neuro-markovian modelling. In: Proc. 6th Int. Conf. on Document Analysis and Recognition, pp. 731–735 (2001) [26] Jaeger, S., Manke, S., Reichert, J., Waibel, A.: Online handwriting recognition: the NPen++ recognizer. Int. Journal on Document Analysis and Recognition 3(3), 169–180 (2001) [27] Caillault, E., Viard-Gaudin, C., Ahmad, A.R.: MS-TDNN with global discriminant trainings. In: Proc. 8th Int. Conf. on Document Analysis and Recognition, pp. 856– 861 (2005) [28] Senior, A.W., Fallside, F.: An off-line cursive script recognition system using recurrent error propagation networks. In: International Workshop on Frontiers in Handwriting Recognition, Buffalo, NY, USA, pp. 132–141 (1993) [29] Senior, A.W., Robinson, A.J.: An off-line cursive handwriting recognition system. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 309–321 (1998) [30] Schenk, J., Rigoll, G.: Novel hybrid NN/HMM modelling techniques for on-line handwriting recognition. In: Proc. 10th Int. Workshop on Frontiers in Handwriting Recognition, pp. 619–623 (2006) [31] IAM-OnDB an on-line English sentence database acquired from handwritten text on a whiteboard, In: Proc. 8th Int. Conf. on Document Analysis and Recognition, pp. 956–961 (2005) [32] The IAM-database: an English sentence database for offline handwriting recognition. Int. Journal on Document Analysis and Recognition 5, 39–46 (2002) [33] Liwicki, M., Bunke, H.: Handwriting recognition of whiteboard notes – studying the influence of training set size and type. Int. Journal of Pattern Recognition and Artificial Intelligence 21(1), 83–98 (2007) [34] Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks., Ph.D. thesis, Technical University of Munich (2008) [35] Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Processing 45, 2673–2681 (1997) [36] Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labeling unsegmented sequence data with recurrent neural networks. In: Proc. Int. Conf. on Machine Learning, pp. 369–376 (2006) [37] Pitman, J.A.: Handwriting recognition: Tablet pc text input. Computer 40(9), 49–54 (2007) [38] Proc. 10th Int. Conf. on Document Analysis and Recognition (2009)
24
M. Liwicki, A. Graves, and H. Bunke
[39] Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(5), 855–868 (2009) [40] Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computation 9(8), 1735–1780 (1997) [41] Gers, F.: Long Short-Term Memory in Recurrent Neural Networks. Ph.D.thesis, EPFL (2001) [42] Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5-6), 602–610 (2005) [43] Graves, A., Fernández, S., Schmidhuber, J.: Multidimensional recurrent neural networks. In: Proc. Int. Conf. on Artificial Neural Networks (2007) [44] Baldi, P., Pollastri, G.: The principled design of large-scale recursive neural network architectures–DAG-RNNs and the protein structure prediction problem. J. Mach. Learn. Res. 4, 575–602 (2003) [45] Reisenhuber, M., Poggio, T.: Hierarchical models of object recognition in cortex. NatureNeuroscience 2(11), 1019–1025 (1999) [46] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998) [47] Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. Advances in Neural Information Processing Systems 21, 545–552 (2009)
Chapter 3
Moving Object Detection from Mobile Platforms Using Stereo Data Registration Angel D. Sappa1 , David Ger´onimo1,2, Fadi Dornaika3,4, Mohammad Rouhani1 , and Antonio M. L´opez1,2 1 2 3 4
Computer Vision Center Universitat Aut`onoma de Barcelona, 08193, Bellaterra, Barcelona, Spain Computer Science Department Universitat Aut`onoma de Barcelona, 08193, Bellaterra, Barcelona, Spain University of the Basque Country, San Sebastian, Spain IKERBASQUE, Basque Foundation for Science, Bilbao, Spain {asappa,dgeronimo,rouhani,antonio}@cvc.uab.es, fadi
[email protected] Abstract. This chapter describes a robust approach for detecting moving objects from on-board stereo vision systems. It relies on a feature point quaternion-based registration, which avoids common problems that appear when computationally expensive iterative-based algorithms are used on dynamic environments. The proposed approach consists of three main stages. Initially, feature points are extracted and tracked through consecutive 2D frames. Then, a RANSAC based approach is used for registering two point sets, with known correspondences in the 3D space. The computed 3D rigid displacement is used to map two consecutive 3D point clouds into the same coordinate system by means of the quaternion method. Finally, moving objects correspond to those areas with large 3D registration errors. Experimental results show the viability of the proposed approach to detect moving objects like vehicles or pedestrians in different urban scenarios.
1 Introduction The detection of moving objects in dynamic environments is generally tackled by first modelling the background. Then, foreground objects are directly obtained by performing an image subtraction (e.g., [14], [15], [32]). An extensive survey on motion detection algorithms can be found in [21]. In general, most of the approaches assume stationary cameras, which means all frames are registered in the same coordinate system. However, when the camera moves, the problem becomes intricate since it is unfeasible to have a unique background model. In such a case, moving object detection is generally tackled by compensating the camera motion so that all frames from a given video sequence, obtained from a moving camera/platform, are referred to the same reference system (e.g., [7], [27]). Moving object detection from a moving camera is a challenging problem in computer vision, having a number of applications in different domains: mobile robots M.R. Ogiela and L.C. Jain (Eds.): Computational Intelligence Paradigms, SCI 386, pp. 25–37. c Springer-Verlag Berlin Heidelberg 2012 springerlink.com
26
A.D. Sappa et al.
[26]; aerial surveillance [35] [34]; video segmentation [1]; vehicles and driver assistance [15], [24]; just to mention a few. As mentioned above, the underlying strategy in the solutions proposed in the literature essentially relies on the compensation of the camera motion. The difference between them lie on the sensor (i.e., monocular/stereoscopic) or on the use of prior-knowledge of the scene together with visual cues. For instance, [26] uses a stereo system and predicts the depth image for the current time by using ego-motion information and the depth image obtained at the previous time. Then, moving objects are easily detected by comparing the predicted depth image with the one obtained at the current time. The prior-knowledge of the scene is also used in [35] and [34]. In these cases the authors assume that the scene is far from the camera (monocular) and the depth variation of the objects of interest is small compared to the distance (e.g., airborne image sequences). In this context camera motion can be approximately compensated by a 2D parametric transformation (a 3x3 homography). Hence, motion compensation is achieved by warping a sequence of frames to a reference frame, where moving objects are easily detected by image subtraction like in the stationary camera cases. A more general approach has been proposed in [1] for segmenting videos captured with a freely moving camera, which is based on recording complex background and large moving non-rigid foreground objects. The authors propose a region-based motion compensation. It estimates the motion of the camera by finding the correspondence of a set of salient regions obtained by segmenting successive frames. In the vehicle on-board vision systems and driver assistance fields, the compensation of camera motion has also attracted researchers’ attention in recent years. For instance, in [15] the authors present a simple but effective approach based on the use of GPS information to roughly align frames from video sequences. A local appearance comparison between the aligned frames is used to detect objects. In the driver assistance context, but by using an onboard stereo rig, [24] introduce a 3D data registration based approach to compensate camera motion from two consecutive frames. In that work, consecutive stereo frames are aligned into the same coordinate system; then moving objects are obtained from a 3D frame subtraction, similar to [26]. The current chapter proposes an extension of [24], by detecting misregistration regions according to an adaptive threshold from the depth information. The remainder of this chapter is organized as follows. Section 2 introduces related work in the 3D data registration problem. Then, Section 3 presents the proposed approach for moving object detection. It consists of three stages: i) 2D feature point detection and tracking; ii) robust 3D data registration; and iii) moving object detection through consecutive stereo frame subtraction. Experimental results in real environments are presented in Section 4. Finally, conclusions and future works are given in Section 5.
2 Related Work A large number of approaches have been proposed in the computer vision community for 3D Point registration during the last two decades (e.g., [3], [4], [22]). 3D
Moving Object Detection from Mobile Platforms Using Stereo Data Registration
27
data point registration aims at finding the best transformation that places both the given data set and corresponding model set into the same reference system. The different approaches proposed in the literature can be broadly classified into two categories, depending on whether an initial information is required (fine registration) or not (coarse registration); a comprehensive survey of registration methods can be found in [23]. The approach followed in the current work for moving object detection lies within the fine rigid registration category. Typically, the fine registration process consists in iterating the following two stages. Firstly, the correspondence between every point from the current data set and the model set shall be found. These correspondences are used to define the residual of the registration. Secondly, the best set of parameters that minimizes the accumulated residual shall be computed. These two stages are iteratively applied until convergence is reached. The Iterative Closest Point (ICP)—originally introduced by [3] and [4]—is one of the most widely used registration techniques using this two-stage scheme. Since then, several variations and improvements have been proposed in order to increase the efficiency and robustness (e.g., [25], [8], [5]). In order to avoid the point-wise nature of ICP, which makes the problem discrete and non-smooth, different techniques have been proposed: i) probabilistic representations are used to describe both data and model set (e.g. [31], [13]); ii) in [8] the point-wise problem is avoided by using a distance field of the model set; iii) an implicit polynomial (IP) is used in [36] to fit the distance field, which later defines a gradient field leading the data points towards that model set; iv) implicit polynomials have been also used in [28] to represent both the data set and model set. In this case, an accurate pose estimation is computed based on the information from the polynomial coefficients. Probabilistic-based approaches avoid the point-wise correspondence problem by representing each set by a mixture of Gaussians (e.g., [13], [6]); hence, registration becomes a problem of aligning two mixtures. In [13] a closed-form expression for the L2 distance between two Gaussian mixtures is proposed. Instead of Gaussian mixture models, [31] proposes an approach based on multivariate t-distributions, which is robust to large number of missing values. Both approaches, as all mixture models, are highly dependent on the number of mixtures used for modelling the sets. This problem is generally solved by assuming a user defined number of mixtures or as many as the number of points. The former one needs the points to be clustered, while the latter one results in a very expensive optimization problem that cannot handle large data sets or could get trapped in local minimum when complex sets are considered. The non-differentiable nature of ICP is overcome by using a derivable distance transform—Chamfer distance—in [8]. A non-linear minimization (Levenberg Marquardt algorithm) of the error function, based on that distance transform, is used for finding the optimal registration parameters. The main disadvantage of [8] is the precision dependency on the grid resolution, where the Chamfer distance transform and discrete derivatives are evaluated. Hence, this technique cannot be directly applied when the point set is sparse or unorganized.
28
A.D. Sappa et al.
On the contrary to the previous approaches, [36] proposes a fast registration method based on solving an energy minimization problem derived from an implicit polynomial fitted to the given model set [37]. This IP is used to define a gradient flow that drives the data set to the model set without using point-wise correspondences. The energy functional is minimized by means of a heuristic two step process. Firstly, every point in the given data set moves freely along the gradient vectors defined by the IP. Secondly, the outcome of the first step is used to define a single transformation that represents this movement in a rigid way. These two steps are repeated alternately until convergence is reached. The weak point of this approach is the first step of the minimization that lets the points move independently in the proposed gradient flow. Furthermore, the proposed gradient flow is not smooth, specially close to the boundaries. Most of the algorithms presented above have been originally proposed for registering overlapped sets of points corresponding to the 3D surface of a single rigid object. Extensions to a more general framework, where the 3D surfaces to be registered correspond to different views of a given scene, have been presented in the robotic field (e.g., [30, 18]). Actually, in all these extensions, the registration is used for the simultaneous localization and mapping (SLAM) of the mobile platform (i.e., the robot). Although some approaches differentiate static and dynamic parts of the environment before registration (e.g., [30], [33]), most of them assume that the environment is static, containing only rigid, non-moving objects. Therefore, if moving objects are present in the scene, the least squares formulation of the problem will provide a rigid transformation biased by the motions in the scene. Independently to the kind of scenario to be tackled (partial view of a single object or whole scene), 3D registration algorithms are computationally expensive, which prevents their use in real time applications. In the current work a robust strategy that reduces the CPU time by focusing only on feature points is proposed. It is intended to be used in ADAS (Advanced Driver Assistance Systems) applications, in which an on-board camera explores the current scene in real time. Usually, an exhaustive window scanning approach is adopted to extract regions of interests (ROIs), needed in pedestrian or vehicle detection systems. The concept of consecutive frame registration for moving object detection has been explored in [11], in which an active frame subtraction for pedestrian detection from images of moving cameras is proposed. In that work, consecutive frames were not registered by a vision based approach but by estimating the relative camera motion using vehicle speed and a gyrosensor. A similar solution has been proposed in [15], but by using GPS information.
3 Proposed Approach The proposed approach combines 2D detection of key points with 3D registration. The first stage consists in extracting a set of 2D feature points at a given frame and track it through the next frame; 3D coordinates corresponding to each of these 2D feature points are later on used during the registration process, where the rigid displacement (six degrees of freedom) that maps the 3D scene associated with frame
Moving Object Detection from Mobile Platforms Using Stereo Data Registration
29
(n) into the 3D scene associated with frame (n + 1) is computed (see Figure 1). This rigid transform represents the 3D motion of the camera between frame (n) and frame (n + 1). Finally, moving objects are detected by computing the difference between the 3D coordinates of points represented in the same coordinate system. Before going into details in the stages of the proposed approach a brief description of the used stereo vision system is given.
3.1 System Setup A commercial stereo vision system (Bumblebee from Point Grey1 ) is used to acquire the 3D information of the scene in front of the host vehicle. It consists of two Sony ICX084 Bayer pattern CCDs with 6mm focal length lenses. Bumblebee is a precalibrated system that does not require in-field calibration. The baseline of the stereo head is 12cm and it is connected to the computer by an IEEE-1394 interface. Right and left color images (Bayer pattern) were captured at a resolution of 640×480 pixels. After capturing each right-left pair of images, a dense cloud of 3D data points Pn is computed by using a 3D reconstruction software at each frame n. The right intensity image In is used during the feature point detection and tracking stage.
3.2 Feature Detection and Tracking As previously mentioned, the proposed approach is intended to be used on on-board vision systems for driver assistance applications. Hence, due to real time constraint, it is clear that the whole cloud of points cannot be used to find the rigid transformation that maps two consecutive frames to the same reference system. In order to tackle this problem, an efficient approach that relies only on the use of a reduced set n of points from the given image In is proposed. Feature points, fi(u,v) ⊂ In , far away n from the camera position (Pi(x,y,z) > δ ) are discarded in order to increase registration accuracy2 (δ = 15 m in the current implementation). The proposed approach does not depend on the technique used for detecting feature points; actually, two different approaches have been tested: one based on the Harris corner points [10] and another on SIFT features [16]. In the first case, once feature points have been selected a tracking window WT of (9×9) pixels is set. Feature points are tracked by minimizing the sum of squared differences between two consecutive frames by using an iterative approach [17]. In the second case SIFT features [16] are detected in the extreme of difference of Gaussians in a scale-space representation and described as histograms of gradient orientations. In this case, following [16], a function based on the corresponding histograms distance is used to match the features in consecutive frames (the public implementation of SIFT in [29] has been used). 1 2
www.ptgrey.com Stereo head data uncertainty grows quadratically with depth [19].
30
A.D. Sappa et al.
50
50
100
100
150
150
200
200
250
250
300
300
350
350
400
400
450
450 100
200
300
400
500
600
100
200
Frame (n)
300
400
500
600
Frame (n + 1)
Frame (n)
Frame (n + 1)
Fig. 1 Feature points detected and tracked through consecutive frames: (top) using Harris corner detector; (bottom) using SIFT detector and descriptor
n+1
n
P1(x,y,z)
P1(x,y,z)
Z
Z X
X
n
P2(x,y,z)
Y n
P3(x,y,z)
[R | t]
n+1
P2(x,y,z)
Y n+1
P3(x,y,z)
Fig. 2 Illustration of feature points represented in the 3D space, together with three couples of points used for computing the 3D rigid displacement: [R|t]—RANSAC-like technique
Moving Object Detection from Mobile Platforms Using Stereo Data Registration
31
3.3 Robust Registration The set of 2D-to-2D point correspondences obtained in the previous stage, is easily converted to a set of 3D-to-3D points since for every frame we have a quasi dense 3D reconstruction that is rapidly provided by Bumblebee. In the current approach, contrary to Iterative Closest Point (ICP) based algorithms, the correspondences between the two point sets are known; hence, the main challenge that should be faced during this stage is the fact that feature points can belong to static or moving objects in the scene. Since the camera is moving there are no additional clues to differentiate them easily. Hence, the use of a robust RANSAC-like technique is proposed to find the best rigid transformation that maps the 3D points of frame (n) into their corresponding in frame (n + 1). The closed-form solution provided by unit quaternions [12] is chosen to compute this 3D rigid displacement, with rotation matrix R and translation vector t between the two sets of vertices. The proposed approach works as follows: Random sampling. Repeat the following three steps K times (in our experiments K was set to 100): n+1 n 1. Draw a random subsample of 3 different pairs of feature points (Pi(x,y,z) , Pi(x,y,z) )k , n+1 n where Pi(x,y,z) ∈ Pn , Pi(x,y,z) ∈ Pn+1 and i = {1, 2, 3}. 2. For this subsample, indexed by k (k = 1, ...., K), compute the 3D rigid displacen+1 n − Rk Pi(x,y,z) − ment Dk = [Rk |tk ] that minimizes the residual error ∑3i=1 |Pi(x,y,z)
tk |2 . This minimization is carried out by using the closed-form solution provided by the unit quaternion method [12]. 3. For this solution Dk , compute the number of inliers among the entire set of pairs of feature points according to a user defined threshold value. Solution 1. Choose the best solution, i.e., the solution that has the highest number of inliers. Let Dq be this solution. 2. Refine the 3D rigid displacement [Rq |tq ] by using the whole set of couples considered as inliers, instead of the corresponding 3 pairs of feature points. A simn+1 ilar unit quaternion representation [2] is used to minimize: ∑#inliers |Pi(x,y,z) − i=1 n − t q |2 . Rq Pi(x,y,z)
3.4 Frame Subtraction The best 3D rigid displacement [Rq |tq ] computed above with inliers 3D feature points is representing the camera motion. Thus, it will be used for detecting moving regions after motion compensation. First, the whole set of 3D data points at frame (n) is mapped by: n+1 n Pi(x,y,z) = Rq Pi(x,y,z) + tq ,
(1)
32
A.D. Sappa et al.
Fig. 3 Synthesized views representing frames (n) (from Fig. 1(le f t)) in the coordinate systems of frames (n + 1), by using their corresponding rigid displacements: [Rq |tq ] n+1 where Pi(x,y,z) denotes the mapping of a given point from frame n into the next frame. Note that for static 3D points we ideally have Pn+1 = Pn+1 . i(x,y,z)
i(x,y,z)
Once the whole set of points Pn has been mapped, we can also synthesize the n+1 corresponding 2D view (I(u,v) ) as follows: xn+1 i un+1 = (round) u + f , 0 i zn+1 i yn+1 n+1 i , vi = (round) v0 + f n+1 zi
(2)
where f denotes the focal length in pixels, (u0 , v0 ) represents the coordinates of the camera principal point, and ( xn+1 , yn+1 , zn+1 ) correspond to the 3D coordinates of i i i
Fig. 4 (le f t) D(u,v) map of moving regions, from frames (n) and (n + 1) presented in Fig. 1(top). (right) Image difference between these consecutive frames: (|I(n) − I(n+1) |) to illustrate their relative displacement.
Moving Object Detection from Mobile Platforms Using Stereo Data Registration
33
the mapped point (1). Figure 3 shows two synthesized views obtained after mapping frames (n) (Fig. 1(left)) with their corresponding [Rq |tq ]. A moving region map, D(u,v) , is then computed using the difference between the synthesized scene and the actual scene as follows: n+1 n+1 − Pi(x,y,z) | < τi 0, if |Pi(x,y,z) D(u,v) = , (3) n+1 n+1 (I(u,v) + I(u,v) )/2, otherwise where, τi is a threshold directly related with the depth to the camera (since the accuracy of the stereo rig decreases with the depth, the value of τ increases to compensate that loss of accuracy). Image differences are used in the above map just to see the correlation between intensity differences and 3D coordinate differences of mapped points (i.e., a given point in frame (n) with its corresponding one in frame (n + 1)). Figure 4(left) presents the map of moving regions, D(u,v) , resulting from the frame (n + 1) (Fig. 1(right)) and the synthesized view corresponding to frame
Frame (n)
Frame (n + 1)
Fig. 5 Feature points detected and tracked through consecutive frames
Fig. 6 (le f t) Synthesized view of frame (n) (Fig. 5(le f t)). (right) Difference between consecutive frames: (|I(n) − I(n+1) |) to illustrate their relative displacement (pay special attention at the traffic lights and stop signposts)
34
A.D. Sappa et al.
(n) (see Figure 3). Additionally, Fig. 4(right) illustrates the raw image difference between the two consecutive frames (|I(n) − I(n+1)|).
4 Experimental Results Experimental results in real environments and different vehicle speeds are presented in this section. In all the cases large error regions correspond to both moving objects and misregistered areas. Several video sequences were processed on a 3.2 GHz Pentium IV PC. Experimental results presented in this chapter correspond to video sequences recorded at 10 fps. In other words the elapsed time between two consecutive frames is about 100 ms. The proposed algorithm takes, on average, 31 ms for registering consecutive frames by using about 300 feature points. Fig. 1(top) shows two frames of a crowded urban scene. This scene is particularly interesting since a large set of feature points over surfaces moving at different speed have been extracted. In this case, the use of classical ICP based approaches (e.g., [18]) would provide a wrong scene registration since points from static and moving objects are considered together. The synthesized view obtained from frame (n) is presented in Fig. 3(le f t). The quality of the registration result can be appreciated in the map of moving regions presented in Fig. 4(left). Particularly interesting is the lamp post region, where there is a perfect registration between the 3D coordinates of these pixels. Large errors at the top of trees or further away regions are mainly due to depth uncertainty, which as mentioned before grows quadratically with depth [19]. Wrong moving regions mainly correspond to hidden areas in frame (n) that are unveiled in frame (n + 1). Figure 4(right) presents the difference between consecutive frames (|I(n) − I(n+1) |)
Horizon Line
Fig. 7 Map of moving regions (D(u,v) ) obtained from the synthesized view (In+1 ) (Fig. 6(le f t)) and the corresponding frame (In+1 ) (Fig. 5(right))—bounding boxes are only illustrative and have been placed using the information of horizon line position as in [9]
Moving Object Detection from Mobile Platforms Using Stereo Data Registration
35
to highlight that although these frames (Fig. 1(top)) look quite similar there is a considerable relative displacement between them. A different scenario is shown in the two consecutive frames presented in Fig. 5. In that scene, the car is reducing the speed to stop for a red light, three pedestrian are crossing the street. Although the vehicle is reducing the speed there is a relative displacement between these consecutive frames (see Fig. 6(right)). The synthesized view of frame (n), using the computed 3D rigid displacement, is presented in Fig. 6(le f t). Finally, the corresponding moving regions map is depicted in Fig. 7. Bounding boxes enclosing moving objects can provide a reliable information to select candidate windows to be used by a classification process (e.g., a pedestrian classifier). In this case, the number of windows would greatly decrease compared to other approaches in the literature, such as 108 windows in an exhaustive scan [20] or 2,000 windows in a road uniform sampling [9].
5 Conclusions This chapter presents a novel and robust approach for moving object detection by registering consecutive clouds of 3D points obtained by an on-board stereo camera. The registration process is only applied over two small sets of 3D points with known correspondences by using key point features extraction and a RANSAC-like technique based on the closed-form solution provided by the unit quaternion method. Then, a synthesized 3D scene is obtained after mapping the whole set of points from the previous frame to the current one. Finally, a map of moving regions is generated by considering the difference between current 3D scene and the synthesized one. As future work more evolved approaches for combining registered frames will be studied. For instance, instead of only using consecutive frames, temporal windows including more frames are likely to help filtering out noisy areas. Furthermore, color information of each pixel could be used during the estimation of the moving region map. Acknowledgment. This work was supported in part by the Spanish Ministry of Science and Innovation under Projects TRA2010-21371-C03-01, TIN2010-18856 and Research Program Consolider Ingenio 2010: MIPRCV (CSD2007-00018).
References 1. Amir, S., Barhoumi, W., Zagrouba, E.: A robust framework for joint background/foreground segmentation of complex video scenes filmed with freely moving camera. Pattern Analysis and Applications 46(2), 175–205 2. Benjemaa, R., Schmitt, F.: A solution for the registration of multiple 3D point sets using unit quaternions. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, pp. 34–50. Springer, Heidelberg (1998) 3. Besl, P., McKay, N.: A method for registration of 3D shapes. IEEE Trans. on Pattern Analysis and Machine Intelligence 14(2), 239–256 (1988)
36
A.D. Sappa et al.
4. Chen, Y., Medioni, G.: Object modelling by registration of multiple range images. Image Vision Comput. 10(3), 145–155 (1992) 5. Chetverikov, D., Stepanov, D., Krsek, P.: Robust Euclidean alignment of 3D point sets: the trimmed iterative closest point algorithm. Image and Vision Computing 23(1), 299– 309 (2005) 6. Chui, H., Rangarajan, A.: A feature registration framework using mixture models. In: MMBIA 2000: Proceedings of the IEEE Workshop on Mathematical Methods in Biomedical Image Analysis, pp. 190–197 (2000) 7. Dahyot, R.: Unsupervised camera motion estimation and moving object detection in videos. In: Proc. of the Irish Machine Vision and Image Processing, Dublin, Ireland (August 2006) 8. Fitzgibbon, A.: Robust registration of 2D and 3D point sets. Image and Vision Computing 21(13), 1145–1153 (2003) 9. Ger´onimo, D., Sappa, A.D., L´opez, A., Ponsa, D.: Adaptive image sampling and windows classification for on–board pedestrian detection. In: Proc. Int. Conf. on Computer Vision Systems, Bielefeld, Germany (2007) 10. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. of The Fourth Alvey Vision Conference, Manchester, UK, pp. 147–151 (1988) 11. Hashiyama, T., Mochizuki, D., Yano, Y., Okuma, S.: Active frame subtraction for pedestrian detection from images of moving camera. In: Proc. IEEE Int. Conf. on Systems, Man and Cybernetics, Washington, USA, pp. 480–485 (October 2003) 12. Horn, B.: Closed-form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America A 4, 629–642 (1987) 13. Jian, B., Vemuri, B.: A robust algorithm for point set registration using mixture of Gaussians. In: 10th IEEE International Conference on Computer Vision, Beijing, China, October 17-20, pp. 1246–1251 (2005) 14. Kastrinaki, V., Zervakis, M., Kalaitzakis, K.: A survey of video processing techniques for traffic applications. Image and Vision Computing 21(4), 359–381 (2003) 15. Kong, H., Audibert, J., Ponce, J.: Detecting abandoned objects with a moving camera. IEEE Transactions on Image Processing 19(8), 2201–2210 (2010) 16. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 2(60), 91–110 (2004) 17. Ma, Y., Soatto, S., Koseck´a, J., Sastry, S.: An Invitation to 3D Vision: From Images to Geometric Models. Springer, New York (2004) 18. Milella, A., Siegwart, R.: Stereo-based ego-motion estimation using pixel tracking and iterative closest point. In: Proc. IEEE Int. Conf. on Mechatronics and Automation, USA (January 2006) 19. Oniga, F., Nedevschi, S., Meinecke, M., To, T.: Road surface and obstacle detection based on elevation maps from dense stereo. In: Proc. IEEE Int. Conf. on Intelligent Transportation Systems, Seattle, USA, pp. 859–865 (September 2007) 20. Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., Poggio, T.: Pedestrian detection using wavelet templates. In: Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, Puerto Rico (June 1997) 21. Radke, R., Andra, S., Al-Kofahi, O., Roysam, B.: Image change detection algorithms: A systematic survey. IEEE Trans. on Image Processing 14(3), 294–307 (2003) 22. Restrepo-Specht, A., Sappa, A.D., Devy, M.: Edge registration versus triangular mesh registration, a comparative study. Signal Processing: Image Communication 20(9-10), 853–868 (2005) 23. Salvi, J., Matabosch, C., Fofi, D., Forest, J.: A review of recent range image registration methods with accuracy evaluation. Image Vision Computing 25(5), 578–596 (2007)
Moving Object Detection from Mobile Platforms Using Stereo Data Registration
37
24. Sappa, A.D., Dornaika, F., Ger´onimo, D., L´opez, A.: Registration-based moving object detection from a moving camera. In: Proc. on Workshop on Perception, Planning and Navigation for Intelligent Vehicles, Nice, France (September 2008) 25. Sharp, G., Lee, S., Wehe, D.: ICP registration using invariant features. IEEE Trans. Pattern Anal. Mach. Intell. 24(1), 90–102 (2002) 26. Shimizu, S., Yamamoto, K., Wang, C., Satoh, Y., Tanahashi, H., Niwa, Y.: Moving object detection by mobile stereo omni-directional system (SOS) using spherical depth image. Pattern Analysis and Applications (2), 113–126 27. Taleghani, S., Aslani, S., Saeed, S.: Robust moving object detection from a moving video camera using neural network and kalman filter. In: Iocchi, L., Matsubara, H., Weitzenfeld, A., Zhou, C. (eds.) RoboCup 2008. LNCS, vol. 5399, pp. 638–648. Springer, Heidelberg (2009) 28. Tarel, J.-P., Civi, H., Cooper, D.B.: Pose estimation of free-form 3D objects without point matching using algebraic surface models. In: Proceedings of IEEE Worshop Model Based 3D Image Analysis, Mumbai, India, pp. 13–21 (1998) 29. Vedaldi, A.: An open implementation of the SIFT detector and descriptor. Technical Report 070012, UCLA CSD (2007) 30. Wang, C., Thorpe, C., Thrun, S.: Online simultaneous localization and mapping with detection and tracking of moving objects: theory and results from a ground vehicle in crowded urban areas. In: Proc. IEEE Int. Conf. on Robotics and Automation, Taipei, Taiwan, pp. 842–849 (September 2003) 31. Wang, H., Zhang, Q., Luo, B., Wei, S.: Robust mixture modelling using multivariate t-distribution with missing information. Pattern Recogn. Lett. 25(6), 701–710 (2004) 32. Wange, L., Yung, N.: Extraction of moving objects from their background based on multiple adaptive thresholds and boundary evaluation. IEEE Transactions on Intelligent Transportation Systems 11(1), 40–51 (2010) 33. Wolf, D., Sukhatme, G.: Mobile robot simultaneous localization and mapping in dynamic environments. Autonomous Robots 19(1), 53–65 (2005) 34. Yu, Q., Medioni, G.: Map-enhanced detection and tracking from a moving platform with local and global data association. In: Proc. IEEE Workshops on Motion and Video Computing, Austin, Texas (February 2007) 35. Yu, Q., Medioni, G.: A GPU-based implementation of motion detection from a moving platform. In: Proc. IEEE Workshops on Computer Vision and Pattern Recognition, Anchorage, Alaska (June 2008) 36. Zheng, B., Ishikawa, R., Oishi, T., Takamatsu, J., Ikeuchi, K.: A fast registration method using IP and its application to ultrasound image registration. IPSJ Transactions on Computer Vision and Applications 1, 209–219 (2009) 37. Zheng, B., Takamatsu, J., Ikeuchi, K.: An adaptive and stable method for fitting implicit polynomial curves and surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(3), 561–568 (2010)
Chapter 4
Pattern Classifications in Cognitive Informatics Lidia Ogiela AGH University of Science and Technology Al. Mickiewicza 30, PL-30-059 Krakow, Poland
[email protected] Abstract. This chapter presents problems to which cognitive data analysis algorithms are applied. The cognitive data analysis approach presented here will be used to build cognitive systems for classifying patterns, which in turn will form the basis for discussing and characterising Understanding Based Image Analysis Systems (UBIAS). The purpose of these systems is image analysis, while their design and operation follow cognitive/reasoning processes characteristic for human cognition. Such processes represent not just simple data analysis, but their main function is to interpret, understand and reason about analysed data sets, employing the automatic data understanding process carried out by the system. A characteristic feature of cognitive analysis processes is the use of linguistic algorithms – which are to extract the meaning from data sets – to describe the analysed sets. It is those algorithms that serve to describe the data, interpret it and to reason. Keywords: Cognitive informatics, pattern classifications, pattern understanding, cognitive analysis, computational intelligence, natural intelligence, intelligent systems.
1 Introduction Data understanding by interpreting the meaning of analysed data is based on a semantic analysis of patterns defined in the system. Such analysis and reasoning processes are conducted by cognitive systems, understood as systems that carry out semantic analysis processes. Semantic analysis processes make use of cognitive resonance, described in the author's publications [17]-[42], [44], [45], [48]-[51], which forms the main stage in the correct operation of the system. The development of cognitive information systems was sparked by moves to combine systems which recognised data (most often data in the form of images) with automatic data understanding processes (an implementation of semantic analysis solutions understood as the cognitive analysis). These processes have been described in publications [17]-[45], [48]-[57] and are still successfully developed. M.R. Ogiela and L.C. Jain (Eds.): Computational Intelligence Paradigms, SCI 386, pp. 39–57. springerlink.com © Springer-Verlag Berlin Heidelberg 2012
40
L.Ogiela
So far, the authors of the above analysed image data from medical imaging, and more specifically, analysed lesions occurring within the central nervous system, i.e. the spinal cord, in the bones of feet, palms and wrists. The second, independent type of data analysed consisted in economic and financial figures of companies, where the aim was to assess whether the financial, strategic and marketing decisions suggested by the system were reasonable. Here, the author will concentrate on cognitive systems analysing the meaning of image data, with particular attention to UBIAS (Understanding Based Image Analysis Systems) analysing lesions within foot bones. The novelty presented here is that the UBIAS system will analyse lesions within the bones of the whole foot, as previously efforts were only made to analyse foot bones excluding phalanx bones. Including the phalanx bones in the analysis of foot bone lesions makes the process of understanding the analysed medical (health) situation much richer, so the process can identify the right lesion more unambiguously, consequently yielding a deeper cognitive analysis and the semantic analysis as such of the lesion observed.
2 Semantic Analysis Stages Semantic analysis is the key to the correct operation of cognitive data analysis systems. When this analysis is conducted, several different (but equally important for the analysis) processes occur: interpretation, description, analysis and reasoning. The main stages of semantic analysis are as follows [26], [43], [46]: • data pre-processing: filtration and amplification; approximation; coding; • data presentation: segmentation; recognising picture primitives; identifying relations between picture primitives; • linguistic perception; • syntactic analysis; • pattern classification; • data classification; • feedback; • cognitive resonance; • data understanding. The main stages of semantic analysis are shown in Figure 1. A clear majority of the above semantic analysis stages deals with the data understanding process, as from the beginning of the syntactic analysis conducted using the formal grammar
Pattern Classifications in Cognitive Informatics
41
defined in the system, there are stages aimed at identifying the analysed data with particular attention to its semantics (the meaning it contains). The stages of recognition itself become the starting point for further stages, referred to as the cognitive analysis. This is why the understanding process as such requires the application of feedback during which the features of the analysed data are compared to expectations which the system had generated from its expert knowledge base. This feedback is called cognitive resonance. It identifies those feedbacks which turn out to be material for the analysis conducted, i.e. in which features are consistent with expectations. The next element necessary is the data understanding as such, during which the significance of the analysed changes for their further growth or atrophy (as in lesions) is determined.
Fig. 1 Processes of semantic data analysis
The data is analysed by identifying the characteristic features of the given data set, which then determine the decision. This decision is the result of the completed data analysis (Fig. 2).
42
L.Ogiela
Fig. 2 Process of data analysis
The data analysis process is supplemented with the cognitive analysis, which consists in selecting consistent pairs and non-consistent pairs of elements from the generated set of features characteristic for the analysed set and from the set of expectations as to the analysed data generated using the expert knowledge base kept in the system. The comparison leads to cognitive resonance, which identifies consistent pairs and non-consistent pairs, where the latter not material in the further analysis process. In cognitive analysis, the consistent pairs are used to understand the meaning (semantics) of the analysed data sets (Fig. 3).
Fig. 3 Data understanding in cognitive data analysis
Pattern Classifications in Cognitive Informatics
43
Because of the method of conducting the semantic analysis and the linguistic perception algorithms – grammar formalisms – used in its course, semantic analysis has become the core of the operation of cognitive data analysis, interpretation and reasoning systems.
3 Semantic Analysis vs. Cognitive Informatics Semantic analysis processes which form the cornerstone of cognitive information systems also underpin a new branch of science, which is now undergoing very fast development: cognitive informatics. The notion of cognitive informatics has been proposed in publications [59], [60] and has become the starting point for a formal approach to interdisciplinary considerations of running semantic analyses in cognitive areas. Cognitive informatics is understood as the combination of cognitive science and information science with the goal of researching mechanisms by which information processes run in human minds. These processes are treated as elements of natural intelligence, and they are mainly applied to engineering and technical problems in an interdisciplinary approach. Semantic analysis in the sense of cognitive analysis plays a significant role, as it identifies the meaning in areas analysed. The meaning as such is identified using the formal grammar defined in the system and its related set of productions, within which productions are defined, which elements the system utilises to analyse the meaning. Features such as those below are analysed: • • • • • • •
lesion occurrence, its size, length, width, lesion vastness, number of lesions observed, repetition frequency, lesion structure, lesion location.
These features can be identified correctly using the set of productions of the linguistic reasoning algorithm. For this reason, the definition of linguistic algorithms of perception and reasoning forms the key stage in building a cognitive system. In line with the discussed cognitive approach, the entire process of linguistic data perception and understanding hinges on a grammatical analysis aimed at answering the question whether the data set is semantically correct from the perspective of the grammar defined in the system, or is not. If there are consistencies, the system runs an analysis to identify these consistencies and assign them the correct names. If the is no consistency, the system will not execute further analysis stages as the lack of consistency may be due to various reasons.
44
L.Ogiela
The most frequent ones include: • • • • •
the wrong definition of the formal grammar, no definition of the appropriate semantic reference, an incompletely defined pattern, a wrongly defined pattern, a representative from outside the recognisable data class accepted for analysing.
All these reasons may cause a failure at the stage of determining the semantic consistency of the analysed specimen and the formal language adopted for the analysis. In this case, the whole definition process should be reconsidered, as the error could have occurred at any stage of it. Cognitive systems carry out the correct semantic analysis by applying a linguistic approach developed by analogy to cognitive/decision-making processes taking place in the human brain. These processes are aimed at the in-depth analysis of various data sets. Their correct course implies that the human cognitive system is successful. The system thus becomes the foundation for designing cognitive data analysis systems. Cognitive systems designed for semantic data analysis used in the cognitive informatics field execute data analysis in three “stages”. This split is presented in Figure 4.
Fig. 4 The three-stage operation of cognitive systems in the cognitive informatics field
Pattern Classifications in Cognitive Informatics
45
The above diagram shows three separate data analysis stages. The first of them is the traditional data analysis, which includes qualitative and quantitative analyses. The results of this analysis are supplemented with the linguistic presentation of the analysed data set, which forms the basis for extracting the semantic features from these sets. Extracting the meaning of the sets from them is the starting point for the second stage of the analysis, referred to as the semantic analysis. The end of this stage at the same time forms the beginning of the next analysis process, referred to as the cognitive data analysis. During this stage, the results obtained are interpreted using the semantic data notations generated previously. The interpretation of results consists in not just their simple description or in recognising the situation being analysed, but it is in particular the stage at which data, the situation and information is understood, the stage of reasoning based on the results obtained and forecasting the changes that may appear in the future.
4 Example of a Cognitive UBIAS System A cognitive data analysis system analysing lesions occurring within foot bones shown in X-ray images is an example of a cognitive system applied to the semantic analysis of image data. The UBIAS system carries out the analysis by using mathematical linguistic algorithms based on graph formalisms proposed in publications [24]-[27], [45]. The key aspect in introducing the right definition of the formal grammar is to adopt names of bones found within the foot, which include: • • • • • • • • • • •
talus (t), calcaneus (c), os cuboideum (cu), os naviculare (n), os cuneiforme laterale (cl), os cuneiforme mediale (cm), os cuneiforme intermedium (ci), os sesamoidea (ses), os metatarsale (tm), os digitorum (dip), phalanx (pip).
The healthy structure of foot bones is shown in Figure 5. Figure 5 presents all the foot bones, divided into metatarsus, tarsus and phalanx bones, which will be analysed by the proposed UBIAS system. To provide the right insight into the proposed way of analysing foot bone lesions in X-ray images, an X-ray of a foot free of any lesions within it, which represents the pattern defined in the UBIAS system, is shown below (Fig. 6).
46
Fig. 5 Healthy foot structure
Fig. 6 Healthy foot structure – X-ray images
L.Ogiela
Pattern Classifications in Cognitive Informatics
47
The example UBIAS system discussed in this chapter is used for the image analysis of foot bone lesions in the dorsopalmar projection of the foot, including tarsus, metatarsus and phalanx bones. The grammatical graph formalism for the semantic analysis has been defined as the Gfoot grammar taking the following form:
G foot = ( N f , T f , Γ f , S , P ) where: The set of non-terminal labels of apexes: Nf = {ST, TALUS, CUBOIDEUM, NAVICULARE, LATERALE, MEDIALE, INTERMEDIUM, SES1, SES2, TM1, TM2, TM3, TM4, TM5, MP1, MP2, MP3, MP4, MP5, PIP1, PIP2, PIP3, PIP4, PIP5, DIP2, DIP3, DIP4, DIP5, TPH1, TPH2, TPH3, TPH4, TPH5, ADD1, ADD2, ADD3, ADD4, ADD5, ADD6, ADD7, ADD8, ADD9, ADD10, ADD11, ADD12, ADD13, ADD14} The set of terminal labels of apexes: Tf = {c, t, cu, n, cl, cm, ci, s1, s2, tm1, tm2, tm3, tm4, tm5, mp1, mp2, mp3, mp4, mp5, pip1, pip2, pip3, pip4, pip5, dip2, dip3, dip4, dip5, tph1, tph2, tph3, tph4, tph5, add1, add2, add3, add4, add5, add6, add7, add8, add9, add10, add11, add12, add13, add14}. Γf – {p, q, r, s, t, u, v, w, x, y, z} – the graph shown in Fig. 7.
Fig. 7 Definitions of elements of the set Гf
48
L.Ogiela
S – The start symbol P – set of productions (Fig.8).
Fig. 8 Set of productions P
Pattern Classifications in Cognitive Informatics
49
Figure 9 shows the graph of relations between individual tarsus, metatarsus and phalanx bones.
Fig. 9 A relation graph of foot bones in the dorsopalmar projection
Figure 10 shows the graph of relations between individual tarsus, metatarsus and phalanx bones including the angles of slopes between individual foot bones.
Fig. 10 A relation graph in the dorsopalmar projection
This definition method allows the UBIAS system to start analysing image data. Selected results of its operation are illustrated by Figures 11-16, which, to comprehensively present the universality of the analysis, show selected examples of automatic image data interpretation and their semantic interpretation. For a comparison, the authors have chosen the following medical images used for cognitive data interpretation with the application of graph formalisms to analyse images showing various foot bone lesions. Figure 11 shows an example of the automatic analysis of an image depicting a fracture of os naviculare.
50
L.Ogiela
Fig. 11 Image data analysis by UBIAS systems to understand data showing foot bone deformations – a fracture of os naviculare
Figure 12 illustrates the method of UBIAS system operation when this system analyses image data to automatically understand data showing a foot deformation.
Fig. 12 Image data analysis by UBIAS systems to understand data showing foot bone deformations
Figure 13 is an example of an attempt to automatically analyse image data illustrating a fracture of the neck of the talus.
Fig. 13 Image data analysis by UBIAS systems to understand data a fracture of the neck of the talus
Pattern Classifications in Cognitive Informatics
51
Figure 14 is an image data analysis carried out by an UBIAS system on an example showing a foot deformation caused by diabetes.
Fig. 14 Image data analysis by UBIAS systems to understand data showing a foot deformation caused by diabetes
Figure 15 is an image data analysis carried out by an UBIAS system on an example showing a osteoarthritis of foot.
Fig. 15 Image data analysis by UBIAS systems to understand data showing a osteoarthritis
Figure 16 is an image data analysis carried out by an UBIAS system on an example showing a osteomyelitis of foot.
Fig. 16 Image data analysis by UBIAS systems to understand data showing a osteomyelitis
52
L.Ogiela
All the above examples of automatic image data analysis demonstrate the essence of UBIAS cognitive system operation, namely the correct understanding of the analysed lesion using series of productions defined in the system and the semantics of the analysed images.
5 Conclusions Examples of the automatic understanding of image data presented in this chapter demonstrate the extent to which semantic analysis can be used for cognitive data analysis problems in cognitive informatics. This type of reasoning systems play quite a notable role and are quite significant as they use robust formalisms of linguistic description and analysis of data. These formalisms, based on the formal grammar presented in this chapter, meet the requirements of an in-depth analysis and a cognitive interpretation of analysed data sets. Due to the semantic analysis carried out by UBIAS systems, cognitive systems are becoming increasingly important in data analysis processes. Apart from UBIAS systems, other systems of cognitive data analysis are also being developed, which the Reader can find in publications including [26], [42]. The approach to the subject of cognitive data analysis systems presented in this chapter is discussed by reference to methods of semantically analysing image-type data. The essence of this approach is to apply cognitive/interpretation/reasoning processes in the operation of systems. Systems can be built based on cognitive and decision-making processes only if the system will analyse and interpret data as well as conduct the reasoning and projecting stages using the semantic characteristics of the analysed data. The semantics of the analysed sets makes in-depth analysis processes possible and at the same time becomes the starting point for projecting changes that may occur in the future, thus allowing errors that could occur in the future to be eliminated. Traditional data analysis systems frequently cannot identify such errors to be eliminated. So the characteristic feature which also distinguishes cognitive systems from others is the process of reasoning on the basis of analysed data and the process of projecting based on the data analysis conducted. Acknowledgement. This work has been supported by the National Science Center, Republic of Poland, under project number N N516 478940.
References 1. Albus, J.S., Meystel, A.M.: Engineering of Mind – An Introduction to the Science of Intelligent Systems. A Wiley-Interscience Publication John Wiley & Sons Inc. (2001) 2. Berners-Lee, T.: Weaving the Web. Texere Publishing (2001)
Pattern Classifications in Cognitive Informatics
53
3. Berners-Lee, T., Fensel, D., Hendler, J.A., Lieberman, H., Wahlster, W. (eds.): Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press (2005) 4. Branquinho, J. (ed.): The Foundations of Cognitive Science. Clarendon Press, Oxford (2001) 5. Brejl, M., Sonka, M.: Medical image segmentation: Automated design of border detection criteria from examples. Journal of Electronic Imaging 8(1), 54–64 (1999) 6. Burgener, F.A., Kormano, M.: Bone and Joint Disorders. Thieme, Stuttgart (1997) 7. Chomsky, N.: Language and Problems of Knowledge: The Managua Lectures. MIT Press, Cambridge (1988) 8. Cohen, H., Lefebvre, C. (eds.): Handbook of Categorization in Cognitive Science. Elsevier, The Netherlands (2005) 9. Davis, L.S. (ed.): Foundations of Image Understanding. Kluwer Academic Publishers (2001) 10. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. 2nd edn. A WileyInterscience Publication John Wiley & Sons, Inc. (2001) 11. Jurek, J.: On the Linear Computational Complexity of the Parser for Quasi Context Sensitive Languages. Pattern Recognition Letters 21,179–187 (2000) 12. Jurek, J.: Recent developments of the syntactic pattern recognition model based on quasi-context sensitive language. Pattern Recognition Letters 2(26), 1011–1018 (2005) 13. Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.): Intelligent Information Processing and Web Mining. Proceedings of the International IIS: IIP WM 2004 Conference, Zakopane, May 17-20. Springer, Poland (2004) 14. Köpf-Maier, P.: Wolf_Heidegger’s Atlas of Human Anatomy, Part 1. Systemic Anatomy, Warszawa (2002) 15. Lassila, O., Hendler, J.: Embracing web 3.0. IEEE Internet Computing, 90–93 (MayJune 2007) 16. Meystel, A.M., Albus, J.S.: Intelligent Systems – Architecture, Design, and Control. A Wiley-Interscience Publication John Wiley & Sons, Inc., Canada (2002) 17. Ogiela, L.: Usefulness assessment of cognitive analysis methods in selected IT systems, Ph. D. Thesis, AGH Kraków (2005) 18. Ogiela, L.: Cognitive Systems for Medical Pattern Understanding and Diagnosis. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part I. LNCS (LNAI), vol. 5177, pp. 394–400. Springer, Heidelberg (2008) 19. Ogiela, L.: Modelling of Cognitive Processes for Computer Image Interpretation. In: Al-Dabass, D., Nagar, A., Tawfik, H., Abraham, A., Zobel, R. (eds.) EMS 2008 European Modelling Symposium, Second UKSIM European Symposium on Computer Modeling and Simulation, Liverpool, United Kingdom, September 8-10, pp. 209–213 (2008) 20. Ogiela, L.: Syntactic Approach to Cognitive Interpretation of Medical Patterns. In: Xiong, C.-H., Liu, H., Huang, Y., Xiong, Y.L. (eds.) ICIRA 2008. LNCS (LNAI), vol. 5314, pp. 456–462. Springer, Heidelberg (2008) 21. Ogiela, L.: Cognitive Computational Intelligence in Medical Pattern Semantic Understanding. In: Guo, M., Zhao, L., Wang, L. (eds.) Fourth International Conference on Natural Computation, ICNC 2008, Jinan, Shandong, China, October 18-20, vol. 6, pp. 245–247 (2008) 22. Ogiela, L.: Innovation Approach to Cognitive Medical Image Interpretation. In: 5th International Conference on Innovations in Information Technology, Innovation 2008, Al Ain, United Arab Emirates, December 16-18, pp. 722–726 (2008)
54
L.Ogiela
23. Ogiela, L.: UBIAS Systems for Cognitive Interpretation and Analysis of Medical Images. Opto-Electronics Review 17(2), 166–179 (2008) 24. Ogiela, L.: Computational intelligence in cognitive healthcare information systems. In: Bichindaritz, I., Vaidya, S., Jain, A., Jain, L.C. (eds.) Computational Intelligence in Healthcare 4. SCI, vol. 309, pp. 347–369. Springer, Heidelberg (2010) 25. Ogiela, L.: Cognitive Informatics in Automatic Pattern Understanding and Cognitive Information Systems. In: Wang, Y., Zhang, D., Kinsner, W. (eds.) Advances in Cognitive Informatics and Cognitive Computing. SCI, vol. 323, pp. 209–226. Springer, Heidelberg (2010) 26. Ogiela, L., Ogiela, M.R.: Cognitive Techniques in Visual Data Interpretation. SCI. Springer, Heidelberg (2009) 27. Ogiela, L., Ogiela, M.R.: ognitive Approach to Bio-Inspired Medical Image Understanding. In: Nagar, A.K., Thamburaj, R., Li, K., Tang, Z., Li, R. (eds.) Proceedings 2010 IEEE Fifth International Conference Bio-Inspired Computing: Theories and Applications, Liverpool, UK, September 8-10, pp. 1010–1013 (2010) 28. Ogiela, L., Ogiela, M.R., Tadeusiewicz, R.: Mathematical Linguistic in Cognitive Medical Image Interpretation Systems. Journal of Mathematical Imaging and Vision 34, 328–340 (2009) 29. Ogiela, L., Ogiela, M.R., Tadeusiewicz, R.: Cognitive reasoning in UBIAS systems supporting interpretation of medical. In: 2nd International Conference on Computer Science and its Applications CSA 2009, Jeju, Korea, December 10-12, vol. 2, pp. 448–451 (2009) 30. Ogiela, L., Tadeusiewicz, R., Ogiela, M.R.: Cognitive Informatics in Automatic Pattern Understanding. In: Hang, D., Wang, Y., Kinsner, W. (eds.) Proceedings of the Sixth IEEE International Conference on Cognitive Informatics, ICCI 2007, Lake Tahoe, CA, USA, August 6-8, pp. 79–84 (2007) 31. Ogiela, L., Tadeusiewicz, R., Ogiela, M.R.: Cognitive Computing in Analysis of 2D/3D Medical Images. In: The 2007 International Conference on Intelligent Pervasive Computing – IPC 2007, Jeju Island, Korea, October 11-13, pp. 15–18. IEEE Computer Society (2007) 32. Ogiela, L., Tadeusiewicz, R., Ogiela, M.R.: Cognitive Linguistic Categorization for Medical Multi-dimensional Pattern Understanding. In: ACCV 2007 Workshop on Multi-dimensional and Multi-view Image Processing, Tokyo, Japan, November 18-22, pp. 150–156 (2007) 33. Ogiela, L., Tadeusiewicz, R., Ogiela, M.R.: Cognitive Techniques in Medical Information Systems, Computers in Biology and Medicine 38, 502–507 (2008) 34. Ogiela, L., Tadeusiewicz, R., Ogiela, M.R.: Cognitive Approach to Medical Image Semantics Description and Interpretation. In: INFOS 2008, The 6th International Conference on Informatics and Systems, Egypt, March 27-29, vol. 5, pp. HBI-1–HBI-5 (2008) 35. Ogiela, L., Tadeusiewicz, R., Ogiela, M.R.: Cognitive Modeling in Medical Pattern Semantic Understanding. In: The 2nd International Conference on Multimedia and Ubiquitous Engineering MUE 2008, Busan, Korea, April 24-26, pp. 15–18 (2008) 36. Ogiela, L., Tadeusiewicz, R., Ogiela, M.R.: Al-Cognitive Description in Visual Pattern Mining and Retrieval. In: Second Asia Modeling & Simulation AMS, Kuala Lumpur, Malaysia, May 13-15, pp. 885–889 (2008)
Pattern Classifications in Cognitive Informatics
55
37. Ogiela, L., Tadeusiewicz, R., Ogiela, M.R.: Cognitive Modeling in Computational Intelligence Methods for Medical Pattern Semantic Categorization and Understanding. In: Proceedings of the Fourth IASTED International Conference Advances in Computer Science and Technology (ACST 2008), Malaysia, Langkawi, April 2-4, pp. 368-371 (2008) 38. Ogiela, L., Tadeusiewicz, R., Ogiela, M.R.: Cognitive Categorization in Medical Structures Modeling and Image Understanding. In: Li, D., Deng, G. (eds.) 2008 International Congress on Image and Signal Processing, CISP 2008, Sanya, Hainan, China, May 27-30, vol. 4, pp. 560–564 (2008) 39. Ogiela, L., Tadeusiewicz, R., Ogiela, M.R.: Cognitive Approach to Medical Pattern Recognition, Structure Modeling and Image Understanding. In: Peng, Y., Zhang, Y. (eds.) First International Conference on BioMedical Engineering and Informatics, BMEI 2008, Sanya, Hainan, China, May 27-30, vol. 30, pp. 33–37 (2008) 40. Ogiela, L., Tadeusiewicz, R., Ogiela, M.R.: Cognitive Methods in Medical Image Analysis and Interpretation. In: The 4th International Workshop on Medical Image and Augmented Reality, MIAR 2008, The University of Tokyo, Japan (2008) 41. Ogiela, L., Tadeusiewicz, R., Ogiela, M.R.: Cognitive Categorizing in UBIAS Intelligent Medical Information Systems. In: Sordo, M., Vaidya, S., Jain, L.C. (eds.) Advanced Computational Intelligence Paradigms in Healthcare 3. SCI, vol. 107, pp. 75–94. Springer, Heidelberg (2008) 42. Ogiela, M.R., Ogiela, L., Tadeusiewicz, R.: Cognitive Reasoning UBIAS & E-UBIAS Systems in Medical Informatics. In: INC 2010 6th International Conference on Networked Computing, Gyeonju, Korea, May 11-13, pp. 360–364 (2010) 43. Ogiela, M.R., Tadeusiewicz, R.: Modern Computational Intelligence Methods for the Interpretation of Medical Images. Springer, Heidelberg (2008) 44. Ogiela, M.R., Tadeusiewicz, R., Ogiela, L.: Image languages in intelligent radiological palm diagnostics. In: Pattern Recognition 39, 2157–2165 (2165) 45. Ogiela, M.R., Tadeusiewicz, R., Ogiela, L.: Graph image language techniques supporting radiological, hand image interpretations. In: Computer Vision and Image Understanding, vol. 103, pp. 112–120. Elsevier Inc. (2006) 46. Rutkowski, L.: New Soft Computing Technigues for System Modelling, Pattern Classification and Image Processing. Studies in Fuzziness and Soft Computing. Springer, Heidelberg (2004) 47. Rutkowski, L.: Computational Intelligence, Methods and Techniques. Springer, Heidelberg (2008) 48. Skomorowski, M.: Use of random graph parsing for scene labeling by probabilistic relaxation. Pattern Recognition Letters 20(9), 949–956 (1999) 49. Skomorowski, M.: Syntactic recognition of distorted patterns by means of random graph parsing. Pattern Recognition Letters 28(5), 572–581 (2007) 50. Tadeusiewicz, R., Ogiela, L.: Selected Cognitive Categorization Systems. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 1127–1136. Springer, Heidelberg (2008) 51. Tadeusiewicz, R., Ogiela, L.: Categorization in Cognitive Systems. In: Svetoslav, N. (ed.) Fifth International Conference of Applied Mathematics and Computing, FICAMC 2008, Plovdiv, Bulgaria, August 12-18, vol. 3, p. 451 (2008) 52. Tadeusiewicz, R., Ogiela, L., Ogiela, M.R.: Cognitive Analysis Techniques in Business Planning and Decision Support Systems. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Żurada, J.M. (eds.) ICAISC 2006. LNCS (LNAI), vol. 4029, pp. 1027–1039. Springer, Heidelberg (2006)
56
L.Ogiela
53. Tadeusiewicz, R., Ogiela, L., Ogiela, M.R.: The automatic understanding approach to systems analysis and design. International Journal of Information Management 28, 38–48 (2008) 54. Tadeusiewicz, R., Ogiela, M.R.: Medical Image Understanding Technology, Artificial Intelligence and Soft-Computing for Image Understanding. Springer, Heildelberg (2004) 55. Tadeusiewicz, R., Ogiela, M.R.: New Proposition for Intelligent Systems Design: Artificial Understanding of the Images as the Next Step of Advanced Data Analysis after Automatic Classification and Pattern Recognition. In: Kwasnicka, H., Paprzycki, M. (eds.) Intelligent Systems Design and Applications, Proceedings 5th International Conference on Intelligent Systems Design and Application ISDA 2005, Wrocław, September 8-10, pp. 297–300. IEEE Computer Society Press, Los Alamitos (2005) 56. Tadeusiewicz, R., Ogiela, M.R.: Automatic Image Understanding – A New Paradigm for Intelligent Medical Image Analysis. Bio-Algorithms and Med-Systems 2(3), 5–11 (2006) 57. Tadeusiewicz, R., Ogiela, M.R.: Why Automatic Understanding? In: Beliczynski, B., Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007. LNCS, vol. 4432, pp. 477–491. Springer, Heidelberg (2007) 58. Tadeusiewicz, R., Ogiela, M.R., Ogiela, L.: A New Approach to the Computer Support of Strategic Decision Making in Enterprises by Means of a New Class of Understanding Based Management Support Systems. In: Saeed, K., Abraham, A., Mosdorf, R. (eds.) CISIM 2007 – IEEE 6th International Conference on Computer Information Systems and Industrial Management Applications Ełk, Poland, June 2830, pp. 9–13. IEEE Computer Society (2007) 59. Tadeusiewicz, R., Ogiela, M.R.: Automatic Understanding of Images. In: Adipranata, R. (ed.) Proceedings of International Conference on Soft Computing, Intelligent System and Information Technology (ICSIIT 2007), Bali-Indonesia, July 26-27, pp. 13–38. Special Book for Keynote Talks, Informatics Engineering Department, Petra Christian University, Surabaya (2007) 60. Tanaka, E.: Theoretical aspects of syntactic pattern recognition. Pattern Recognition 28, 1053–1061 (1995) 61. Wang, Y.: The Real-Time Process Algebra (RTPA). The International Journal of Annals of Software Engineering 14, 235–274 (2002) 62. Wang, Y.: On Cognitive Informatics. Brain and Mind: A Transdisciplinary Journal of Neuroscience and Neurophilosophy 4(2), 151–167 (2003) 63. Wang, Y.: The Theoretical Framework of Cognitive Informatics. International Journal of Cognitive Informatics and Natural Intelligence 1(1), 1–27 (2007) 64. Wang, Y.: The Cognitive Processes of Formal Inferences. International Journal of Cognitive Informatics and Natural Intelligence 1(4), 75–86 (2007) 65. Wang, Y.: On Concept Algebra: A Denotational Mathematical Structure for Knowledge and Software Modeling. International Journal of Cognitive Informatics and Natural Intelligence 2(2), 1–19 (2008) 66. Wang, Y.: On System Algebra: A Denotational Mathematical Structure for Abstract System modeling. International Journal of Cognitive Informatics and Natural Intelligence 2(2), 20–42 (2008) 67. Wang, Y.: Deductive Semantics of RTPA. International Journal of Cognitive Informatics and Natural Intelligence 2(2), 95–121 (2008)
Pattern Classifications in Cognitive Informatics
57
68. Wang, Y.: On Visual Semantic Algebra (VSA) and the Cognitive Process of Pattern Recognition. In: Proc. 7th International Conference on Cognitive Informatics (ICCI 2008). IEEE CS Press, Stanford University, CA (2008) 69. Wang, Y., Johnston, R., Smith, M. (eds.): Cognitive Informatics: Proceedings 1st IEEE International Conference (ICCI 2002). IEEE CS Press, Canada (2002) 70. Wang, Y., Kinsner, W.: Recent Advances in Cognitive Informatics. IEEE Transactions on Systems, Man, and Cybernetics (Part C) 36(2), 121–123 (2006) 71. Wang, Y., Wang, Y., Patel, S., Patel, D.: A Layered Reference Model of the Brain (LRMB). IEEE Transactions on Systems, Man, and Cybernetics (Part C) 36(2), 124–133 (2006) 72. Wang, Y., Zhang, D., Latombe, J.C., Kinsner, W. (eds.): Proceedings 7th IEEE International Conference on Cognitive Informatics (ICCI 2008). IEEE CS Press, Stanford University, USA (2008) 73. Wilson, R.A., Keil, F.C.: The MIT Encyclopedia of the Cognitive Sciences. MIT Press (2001) 74. Yao, Y., Shi, Z., Wang, Y., Kinsner, W. (eds.): Proc. 5th IEEE International Conference on Cognitive Informatics (ICCI 2006). IEEE CS Press, China (2006) 75. Zadeh, L.A.: Fuzzy Sets and Systems. In: Fox, J. (ed.) Systems Theory, pp. 29–37. Polytechnic Press, Brooklyn NY (1965) 76. Zadeh, L.A.: Fuzzy logic, neural networks, and soft computing. Communications of the ACM 37(3), 77–84 (1994) 77. Zadeh, L.A.: Toward human level machine intelligence–Is it achievable? In: Proc. 7th International Conference on Cognitive Informatics (ICCI 2008). IEEE CS Press, Stanford University, CA (2008) 78. Zajonc, R.B.: On the Primacy of Affect. American Psychologist 39, 117–123 (1984) 79. Zhong, N., Raś, Z.W., Tsumoto, S., Suzuki, E. (eds.): 14th International Symposium on Foundatation of Intelligent Systems ISMIS 2003, Maebashi City, Japan (2003)
Chapter 5
Optimal Differential Filter on Hexagonal Lattice Suguru Saito, Masayuki Nakajiama, and Tetsuo Shima Department of Computer Science, Tokyo Institute of Technology, Japan
Abstract. Digital two-dimensional images are usually sampled on square lattices, whose adjacent pixel distances in the horizontal-perpendicular and diagonal directions are not equal. On the other hand, a hexagonal lattice, however, covers an area with sampling points whose adjacent pixel distances are the same; therefore, it has te potential advantage that it can be used to calculate accurate two-dimensional gradient. The fundamental image filter in many image processing algorithms is used to extract the gradient information. For the extraction, various gradient filters have been proposed on square lattices, and some of them have been thoroughly optimized but not on a hexagonal lattice. In this chapter, consistent gradient filters on hexagonal lattices are derived, the derived filters are compared with existing optimized filters on square lattices, and the relationship between the derived filters and existing filters on a hexagonal lattice is investigated. The results of the comparison show that the derived filters on a hexagonal lattice achieve better signal-to-noise ratio and localization than filters on a square lattice.
1 Introduction Obtaining the differential image of a given input image is a fundamental operation in image processing. In most cases, the differential image is the result of a convolution of the input image with a differential filter. Accordingly, the more accurate the differential filter is, the better the convolution results will be. Many discrete differential filters[8, 13, 19, 18, 21] have been proposed; however, the gradient derived by them are not so accurate. Ando therefore proposed “consistent gradient filters,” which are optimized differential filters on a square lattice[2]. These filters are derived by minimizing the difference between the ideal differential and the differential obtained with filters in the frequency domain, and they have succeeded in obtaining more accurate differential values. On the other hand, image processing on hexagonal lattices has also been studied for many years. Fundamental research on hexagonal lattices, for example, research M.R. Ogiela and L.C. Jain (Eds.): Computational Intelligence Paradigms, SCI 386, pp. 59–87. c Springer-Verlag Berlin Heidelberg 2012 springerlink.com
60
S. Saito, M. Nakajiama, and T. Shima
on signal processing[14], geometric transform[10], co-occurrence matrix[15], efficient algorithms for Fourier transforms[9], and FIR filter banks[11] has been reported. Moreover, image processing on hexagonal lattices, ranging from hardware to application algorithms [3, 4, 5, 12, 22, 24, 26, 27, 28] has also been widely researched. Moreover, a book that collects researches about image processing on hexagonal lattices has been published[16]. It is also known that the human eye receptors follow a hexagonal alignment, there are works related to human perception and image processing on hexagonal lattice. Overington[17] proposed lightweight image processing methods on hexagonal lattices. Gabor filters on hexagonal lattice, believed to be performed in the lower levels of the human vision system, have also been proposed[25]. In this chapter, consistent gradient filters on hexagonal lattices are described, adding to our paper [20]. The filters are derived on the basis of a previously proposed on square lattices [2]. The relationship between the derived filters and existing filters on a hexagonal lattice designed in another way is then discussed. After that, the derived filters are compared with conventional optimized filters on square lattices.
2 Preliminaries First, F(u, v) is taken as the Fourier transform of f (x, y): F(u, v) =
∞ ∞ −∞ −∞
f (x, y)e−2π i(ux+vy) dxdy
(1)
In this chapter, The pixel placement on hexagonal lattices and square lattices is defined as Figure 1. It is assumed that the input image contains frequency components only in the hexagonal region which is described in Figure 2. According to the sampling theorem, the impulse response of an image sampled by an hexagonal lattice is repeated and the repeating unit region is illustrated in Figure 2 ([7]).
3 Least Inconsistent Image Let the x-axis and the y-axis correspond to the horizontal axis (0◦ ) and the vertical axis (90◦ ) respectively. Let ha (x, y), hb (x, y), and hc (x, y) be elements of discrete gradient filters in the directions 0◦ , 60◦ , and 120◦ respectively, on hexagonal lattices, while the arguments (x, y) of them indicate a point in a traditional orthogonal coordinate system. The gradients of image f (x, y) in the directions of the filters, which are derived by the convolution, are denoted by fa (x, y), fb (x, y), fc (x, y), fa (x, y) = ha (x, y) ∗ f (x, y) =
∑
x1 ,y1 ∈RH
ha (x1 , y1 ) f (x − x1 , y − y1 ) (2)
Optimal Differential Filter on Hexagonal Lattice
fb (x, y) = hb (x, y) ∗ f (x, y) =
fc (x, y) = hc (x, y) ∗ f (x, y) =
61
∑
hb (x1 , y1 ) f (x − x1 , y − y1 ) (3)
∑
hc (x1 , y1 ) f (x − x1 , y − y1 ), (4)
x1 ,y1 ∈RH
x1 ,y1 ∈RH
where RH is a set of pixels inside the filters. They are described in the frequency domain as follows: Fa (u, v) = Ha (u, v)F(u, v) Fb (u, v) = Hb (u, v)F(u, v) Fc (u, v) = Hc (u, v)F(u, v).
(a) Hexagonal lattices
(5) (6) (7)
(b) Square lattices
Fig. 1 Hexagonal and square lattices. The √ distance between adjacent pixels is always 1 on hexagonal lattices, while it is either 1 or 2 on square lattices.
The least-inconsistent image of discrete gradient images fa (x, y), fb (x, y), and fc (x, y) is denoted as g(x, y). In a similar manner to Ando’s definition of g for square lattices, g(x, y) is determined by minimizing the following criterion: 2 √ 2 ∂ 1 ∂ 3 ∂ g(x, y) − fa (x, y) + + g(x, y) − fb (x, y) 2 ∂x ∂x 2 ∂y −∞
∞ ∞ −∞
2 √ 1 ∂ 3 ∂ + − + g(x, y) − fc (x, y) dxdy. (8) 2 ∂x 2 ∂y
62
S. Saito, M. Nakajiama, and T. Shima
In the frequency domain, by using Parseval’s theorem, and by considering that the same shape of spectra appears repeatedly due to the sampling theorem, the problem is transformed into the minimization of the following on a single unit of spectra:
2 √ 1 3 |2π iuG(u, v) − Fa(u, v)| + 2π i u+ v G(u, v) − Fb(u, v) 2 2 √ 2 1 3 + 2π i − u + v G(u, v) − Fc(u, v) dudv, (9) 2 2 2
D
where D
•dudv ≡ +
1/2 −1/2√3
√ √ •dudv v/ 3−1/ 3 √ √ 1/2 −v/ 3+1/ 3
0
0
√ 1/2 3
+
•dudv +
0
0
−1/2√3
√ √ •dudv −1/2 −v/ 3−1/ 3 √ √ 1/2 1/2√3 v/ 3+1/ 3
√ −1/2 1/2 3
•dudv +
√ •dudv. −1/2 −1/2 3
(10)
This area D is represented in Figure 2. The integrand of (9) is expanded as follows (where •∗ denotes the complex conjugate of •):
Fig. 2 Hexagonal region to be integrated
Optimal Differential Filter on Hexagonal Lattice
4π 2 u2 GG∗ − 2π iuGFa∗ + 2π iuG∗Fa + Fa Fa∗ √ 2 √ 1 3 1 3 + 4π 2 u+ v GG∗ − 2π i u+ v GFb∗ 2 2 2 2 √ 1 3 + 2π i u+ v G∗ Fb + Fb Fb∗ 2 2 √ 2 √ 1 3 1 3 2 ∗ + 4π − u + v GG − 2π i − u + v GFc∗ 2 2 2 2 √ 1 3 + 2π i − u + v G∗ Fc + Fc Fc∗ 2 2
63
(11)
In a similar manner to Ando’s method, differentiating with respect to G and G∗ gives the following conditions for g(x, y) to be the least inconsistent gradient image. √ √ 1 3 1 3 6π 2 (u2 +v2 )G∗ −2π iuFa∗ −2π i( u + v)Fb∗ − 2π i(− u + v)Fc∗ = 0 (12) 2 2 2 2 √ √ 1 3 1 3 6π (u + v )G + 2π iuFa + 2π i( u + v)Fb + 2π i(− u + v)Fc = 0 (13) 2 2 2 2 2
2
2
The sum and difference of these two expressions are given respectively as follows: 6π 2 (u2 + v2 )(G + G∗) + 2π iu(Fa − Fa∗ ) √ √ 1 3 1 3 ∗ + 2π i( u + v)(Fb − Fb ) + 2π i(− u + v)(Fc − Fc∗ ) = 0 (14) 2 2 2 2 6π 2 (u2 + v2 )(G − G∗) + 2π iu(Fa + Fa∗ ) √ √ 1 3 1 3 ∗ + 2π i( u + v)(Fb + Fb ) + 2π i(− u + v)(Fc + Fc∗ ) = 0 (15) 2 2 2 2 Equation (14) consists only of real parts, while Equation (15) consists of imaginary terms only. For g to be least inconsistent, both equations must equal zero. Summing these two expressions, gives the following condition: G(u, v) =
−i G1 (u, v), 3π (u2 + v2 )
(16)
64
S. Saito, M. Nakajiama, and T. Shima
where √ √ 1 3 1 3 G1 (u, v) ≡ (uHa (u, v) + ( u + v)Hb (u, v) + (− u + v)Hc (u, v))F(u, v). 2 2 2 2 (17) (16) gives the condition on g for (9) to be minimal. Errors of gradient filters are attributed to the inconsistency and to the smoothing effect. Applying a gradient filter on discrete image means that the resultant gradient value is actually smoothed, as the gradient filter for discrete images is not equal to the gradient defined on the continuous domain. As Ando described[2], the smoothing effect of gradient filters is not important in comparison to the inconsistency. Similarly, in the present derivation, the smoothing effect is therefore, also discarded. To reduce the inconsistency, the minimization of (9) is thus targeted.
4 Point Spread Function The aim of this chapter is to derive gradient filters for the three orientations on the hexagonal lattice: 0◦ , 60◦ and 120◦ . It is supposed that the gradient filters for 60◦ and 120◦ are obtained by rotating the gradient filter derived for 0◦ . It is also supposed that the gradient filter for 0◦ is symmetric with respect to the x-axis and antisymmetric with respect to the y-axis (See Figure 4). And amn is taken as a discrete coefficient of a gradient filter. Using a δ -function as the common impulse function, makes it possible to write a set of elements of a gradient filter as follows. mn mn The point spread functions hmn a , hb and hc , which define elements of gradient ◦ ◦ filters in the respective directions of 0 , 60 and 120◦ are described as hmn a (x, y)
√ √ 3n 3n m m = amn {−δ (x − , y − ) + δ (x + , y − ) 2 4 2 4 √ √ m 3n m 3n − δ (x − , y + ) + δ (x + , y + )}, (18) 2 4 2 4 √ √ 3 3n 2 (− 4 ), y + √ √ x + 12 ( m2 ) − 23 (− 43n ), y + √ √ x + 12 (− m2 ) − 23 ( 43n ), y + √ √ x + 12 ( m2 ) − 23 ( 43n )}, y +
1 m hmn b (x, y) = amn {−δ (x + 2 (− 2 ) −
+δ ( −δ ( +δ (
√ √ 3 m 1 3n 2 (− 2 ) + 2 (− 4 )) √ √ 3 m 3n 1 2 ( 2 ) + 2 (− 4 )) √ √ 3 3n m 1 2 (− 2 ) + 2 ( 4 )) √ √ 3 m 3n 1 2 ( 2 ) + 2 ( 4 ))}
(19)
Optimal Differential Filter on Hexagonal Lattice
65
Fig. 3 Positions of amn of hmn a where the distance between the center and any element is less than or equal to 1. Note that the elements on the horizontal axis are overlapped. The resultant derived filter with radius of 1 is shown in Figure 4(a).
and √ √ √ √ 3 3n 3 3n m 1 (− ), y + (− ) − (− 2 4 2 2 2 4 )) √ √ √ √ x − 12 ( m2 ) − 23 (− 43n ), y + 23 ( m2 ) − 12 (− 43n )) √ √ √ √ x − 12 (− m2 ) − 23 ( 43n ), y + 23 (− m2 ) − 12 ( 43n )) √ √ √ √ x − 12 ( m2 ) − 23 ( 43n ), y + 23 ( m2 ) − 12 ( 43n ))},
m 1 hmn c (x, y) = amn {−δ (x − 2 (− 2 ) −
+δ ( −δ ( +δ (
(20)
where m = 0, 1, 2, . . ., n = (m%2)(2k + 1) + ((m + 1)%2)2k(k = 0, 1, 2, . . . ) and the δ -function is a common impulse function. Point spread function hmn a of the gradient filter with a radius of 1, which means that every element of the filter is located at a distance from the center of the filter less than or equal to 1, is illustrated in Figure 3. a , η b and η c , are defined as To simplify notations, three functions ηmn mn mn √ 3 a nv) (21) ηmn (u, v)≡sin(π mu) cos(π √2 √ 1 3 3 3 b ηmn (u, v)≡sin(π m( u + v)) · cos(π n( u − v)) (22) 2 2√ 4 4√ 1 3 3 3 c ηmn (u, v)≡sin(π m(− u + v)) · cos(π n( u + v)) (23) 2 2 4 4 The Fourier transforms of ha , hb and hc are described by using η as follows: a Ha (u, v) = ∑ Hamn (u, v) = 4i ∑ amn ηmn (u, v)
(24)
b Hb (u, v) = ∑ Hbmn (u, v) = 4i ∑ amn ηmn (u, v)
(25)
c Hc (u, v) = ∑ Hcmn (u, v) = 4i ∑ amn ηmn (u, v)
(26)
m,n
m,n
m,n
m,n
m,n
m,n
66
S. Saito, M. Nakajiama, and T. Shima
where a Hamn (u, v)=4amn iηmn (u, v)
(27)
b Hbmn (u, v)=4amn iηmn (u, v) mn c Hc (u, v)=4amn iηmn (u, v).
(28) (29)
To ease the numerical optimization, the condition (16) is transformed. That is, Function G(u, v) is rewritten as G(u, v) =
−i G1 (u, v), 3π (u2 + v2 )
(30)
where √ √ 1 3 1 3 v)Hb (u, v) + (− u + v)Hc (u, v))F(u, v) G1 (u, v) ≡ (uHa (u, v) + ( u + 2 2 2 2 = 4i ∑ amn σmn (u, v)F(u, v) (31) m,n
and 1 a σmn (u, v) ≡ uηmn (u, v) + ( u + 2
√
√ 3 1 3 b c v)ηmn (u, v) + (− u + v)ηmn (u, v). (32) 2 2 2
The expression to be minimized, Equation (9), is then rewritten as D
where
Ψ (u, v)|F(u, v)|2 dudv
a b c Ψ (u, v)=16 ∑ ∑ akl amn (τkla τmn + τklb τmn + τklc τmn ) ,
(33)
(34)
k,l m,n
and 2u a σmn (u, v) − ηmn (u, v) 3(u2 + v2 ) √ u + 3v b b τmn (u, v)≡ 2 σmn (u, v) − ηmn (u, v) 3(u + v2 ) √ −u + 3v c c τmn (u, v)≡ σmn (u, v) − ηmn (u, v). 3(u2 + v2 ) a τmn (u, v)≡
(35) (36) (37)
Optimal Differential Filter on Hexagonal Lattice
67
5 Condition for Gradient Filter The condition for determining the amplitude of the gradient filters is obtained next. The following expression represents the gradient of function k(x, y) in direction of θ .
∂ ∂ cos θ + sin θ k(x, y) (38) ∂x ∂y In the Fourier domain, this equation is transformed to 2π i (u cos θ + v sin θ ) K(u, v).
(39)
As ha (x, y), hb (x, y) and hc (x, y) are gradient filters in the 0◦ , 60◦ and 120◦ directions, the following conditions hold: Ha (u, v)=2π iu
√
1 3 u+ v 2 2 √ 1 3 Hc (u, v)=2π i − u + v . 2 2
Hb (u, v)=2π i
(40) (41) (42)
It thus follows that
∑ Hamn (u, v)=2π iu
m,n
√ 1 3 u+ v ∑ 2 2 m,n √ 1 3 ∑ Hcmn (u, v)=2π i − 2 u + 2 v , m,n
(43)
Hbmn (u, v)=2π i
(44) (45)
must hold for any u and v. By taking the limit as |u|, |v| → 0 and using first-order approximations for the trigonometric functions, makes it possible to rewrite these expressions as 4i ∑ amn π mu = 2π iu (46) m,n
√ √ 1 3 1 3 4i ∑ amn π m( u + v) = 2π i( u + v) 2 2 2 2 m,n √ √ 1 3 1 3 4i ∑ amn π m(− u + v) = 2π i(− u + v). 2 2 2 2 m,n
(47)
(48)
68
S. Saito, M. Nakajiama, and T. Shima
These three expressions are respectively equivalent to 2 ∑ mamn = 1.
(49)
m,n
6 Numerical Optimization The goal here to minimize (33) under the derived condition (49). |F(u, v)|2 in (33) is replaced with its ensemble average P(u, v) in a similar manner to that reported by Ando as follows P(u, v) = E[|F(u, v)|2 ]. (50) That is, Equation (33), repeated below as
Ψ (u, v)|F(u, v)|2 dudv,
(51)
Ψ (u, v)P(u, v)dudv = 16 ∑ ∑ akl amn Rkl,mn
(52)
D
is rewritten as D
where Rkl,mn ≡
k,l m,n
D
a b c (τkla τmn + τklb τmn + τklc τmn )P(u, v)dudv.
(53)
Our objective is now to minimize J0 ≡ 16 ∑ ∑ akl amn Rkl,mn
(54)
k,l m,n
under the condition (49). The optimal values of amn are computed by a traditional gradient-descent optimization as follows. Taking i as the number of steps of optimization, gives J0 ≡ 16 ∑ ∑ akl amn Rkl,mn . (i)
(i) (i)
(55)
k,l m,n
Differentiating this expression, gives
∂
(i) J (i) 0 ∂ amn
= 16 ∑ akl Rkl,mn . (i)
(56)
k,l
To satisfy condition (49), the values are updated at each iteration as follows: (i)
atmp mn = amn − α
∂
(i) J , (i) 0 ∂ amn
(57)
Optimal Differential Filter on Hexagonal Lattice (i+1)
amn
69 tmp
=
amn 2 ∑m,n matmp mn
(58)
where α is a constant and atmp mn is a temporary variable. The minimization stops when all amn that compose the gradient filter satisfy the following condition. (i+1)
|amn
(i)
− amn | < ε ,
(59)
where ε is a constant. Under the assumption that P(u, v) = 1, the actual values of the gradient filter are obtained, as Ando described on square lattices. This assumption means that the input image’s spectrum is equivalent to white noise, so the derived filters are not specialized for a specific frequency. The optimization was performed with Octave[1] with α = 0.01 and ε = 10−12.
a = 0.333011 b = 0.166989 (a) Hex 1: Filter of radius 1
a = 0.272665 b = 0.136340 c = 0.030332 √ (b) Hex 3: Filter of √ radius 3
a = 0.182049 b = 0.091025 c = 0.050753 d = 0.024889 e = 0.012444 (c) Hex 2: Filter of radius 2 Fig. 4 Forms of derived gradient filters on hexagonal lattices
70
S. Saito, M. Nakajiama, and T. Shima
The resulting consistent gradient filters for the 0◦ , 60◦ , 120◦ directions are given in Figure 4.
7 Theoretical Evaluation 7.1 Signal-to-Noise Ratio In the evaluation of signal-to-noise ratio, first, fi (x, y) and f j (x, y) are taken as discrete gradient images in the x and y directions on square lattices, respectively, and Fi (u, v) and Fj (u, v) are taken as their Fourier transforms, respectively. Next, gsqr (x, y) is taken as the least inconsistent image of fi (x, y) and f j (x, y), and Gsqr (u, v) are taken as its Fourier transform. Ando[2] transformed the error 2 ∂ sqr g (x, y) − fi (x, y) ∂x −∞
∞ ∞ −∞
+
2 ∂ sqr g (x, y) − f j (x, y) dxdy (60) ∂y
by applying Parseval’s theorem and defined inconsistency as J sqr =
1/2 1/2 −1/2 −1/2
|2π uiGsqr (u, v) − Fi(u, v)|2 2 + 2π viGsqr (u, v) − Fj (u, v) dudv (61)
where the domain of integration is a unit of repeated spectra due to the sampling theory. The gradient intensity on square lattices is defined as J1sqr =
1/2 1/2 −1/2 −1/2
sqr 2 2 dudv, |Gsqr (u, v)| + G (u, v) x y
(62)
sqr sqr sqr where Gsqr x (u, v) and Gy (u, v) are the Fourier transforms of gx (x, y) and gy (x, y), which are the partial differentials of the least inconsistent image in the x and y directions, respectively, on square lattices. Similarly,
2 ∂ g(x, y) − 2 ( fa (x, y) + 1 ( fb (x, y) − fc (x, y))) ∂x 3 2 −∞
∞ ∞ −∞
2 ∂ 1 + g(x, y) − √ ( fb (x, y) + fc (x, y)) dxdy (63) ∂y 3
Optimal Differential Filter on Hexagonal Lattice
71
is transformed by using Parseval’s theorem to give 2 2π iuG(u, v) − 2 (Fa (u, v) + 1 (Fb (u, v) − Fc(u, v))) 3 2 −∞
∞ ∞ −∞
2 1 + 2π ivG(u, v) − √ (Fb (u, v) + Fc(u, v)) dudv. (64) 3
Inconsistency on hexagonal lattices is therefore defined as G˜ x (u, v)2 + G˜ y (u, v)2 dudv, J hex ≡
(65)
D
where 2 1 G˜ x (u, v) ≡ 2π iuG(u, v) − (Fa (u, v) + (Fb (u, v) − Fc(u, v))) 3 2
(66)
1 G˜ y (u, v) ≡ 2π ivG(u, v) − √ (Fb (u, v) + Fc(u, v)). 3
(67)
And, gradient intensity J1hex is defined as follows. J1hex ≡
D
(|Gx (u, v)|2 + |Gy (u, v)|2 )dudv =
D
(4π 2 (u2 + v2 )|G(u, v)|2 )dudv.
The following expressions are defined to simplify the above expressions. 1 2 1 b b c a c ψmn ≡ √ u(ηmn + ηmn ) − v(ηmn + (ηmn − ηmn )) 3 2 3
(68)
2 1 b 1 a c b c φmn ≡ u(ηmn + (ηmn − ηmn )) + √ v(ηmn + ηmn ) 3 2 3
(69)
g˜x (x, y) ≡
∂ 2 1 g(x, y) − ( fa (x, y) + ( fb (x, y) − fc (x, y))) ∂x 3 2
(70)
∂ 1 g(x, y) − √ ( fb (x, y) + fc (x, y)) ∂y 3
(71)
g˜y (x, y) ≡
|G˜ x (u, v)|2 + |G˜ y (u, v)|2 |F(u, v)|2 1 1 2 1 = 2 | √ u(Hb (u, v) + Hc (u, v)) − v(Ha (u, v) + (Hb (u, v) − Hc (u, v)))|2 3 2 u + v2 3 16 = 2 ( ∑ amn ψmn )2 (72) u + v2 m,n
Ψ0 (u, v)≡
72
S. Saito, M. Nakajiama, and T. Shima
Φ (u, v)≡
|Gx (u, v)|2 + |Gy (u, v)|2 |F(u, v)|2
√ √ 4π 2 (u2 + v2 ) 1 3 1 3 = 2 2 |uHa (u, v) + ( u + v)Hb (u, v) + (− u + v)Hc (u, v)|2 2 2 2 2 9π (u + v2 )2 16 = 2 ( ∑ amn φmn )2 (73) u + v2 m,n
Inconsistency J hex thus is rewritten as
J
hex
=
D
=
D
=
Ψ0 (u, v)P(u, v)dudv 16
| ∑ amn ψmn |2 P(u, v)dudv
D
u2 + v2
D
16 | ∑ amn ψmn |2 dudv. u2 + v2 m,n
=
(|G˜ x (u, v)|2 + |G˜ y (u, v)|2 )dudv
m,n
(74)
And gradient intensity J1hex is rewritten as J1hex = = = =
D
D
D
(|Gx (u, v)|2 + |Gy (u, v)|2 )dudv
Φ (u, v)P(u, v)dudv 16 { ∑ amn φmn (u, v)}2 P(u, v)dudv u2 + v2 m,n
16 { a φ (u, v)}2 dudv. 2 + v2 ∑ mn mn u D m,n
(75)
The ratio of J1 to J corresponds to signal-to-noise ratio (SNR), which is defined as follows: 1 J1 SNR ≡ log2 . (76) 2 J SNR is used to compare the consistent gradient filters on square lattices[2] with the gradient filters on hexagonal lattices derived here. The results for J, J1 and SNR are listed in Table 1. First, the table shows that the values stated in [2] for the square lattices could be reproduced. The derived filters on hexagonal lattices also achieved higher J1 than similar size filters on square lattices. Since J1 is the integration of signal intensity for each frequency that the filter lets pass, our filters on hexagonal lattices are superior to similar size filters on square lattices with respect to frequency permeability.
Optimal Differential Filter on Hexagonal Lattice
73
Table 1 Properties of derived filters Filter Sqr 3 × 3 Sqr 4 × 4 Sqr 5 × 5
J .000778989 .000016162 .000000376
J1 .40310040 .18895918 .11585859
SNR 4.51 6.76 9.12
Num. o f Pixels 9 16 25
Hex 1 √ Hex 3 Hex 2
.001432490 .000017713 .000000086
.74197938 .51425309 .25649444
4.51 7.41 10.75
7 13 19
7.2 Localization How the elements of a filter are “balanced” is evaluated next. In particular, how close to the center of a filter are the element values spread is investigated here. If the element values gather close to the center, the filter focuses on the information close to the center. Localization on square lattices is defined as Locsqr ≡ where sqr Pmn =
sqr ∑m,n Pmn dm,n sqr , ∑m,n Pmn
(hi (m, n))2 + (h j (m, n))2 ,
(77)
(78)
where hi (m, n) and h j (m, n) are the elements of the gradient filters derived in the x and y directions, respectively, and dm,n is distance from the center of the filter. Moreover, the localization on hexagonal lattices is defined as Lochex ≡
hex d ∑m,n Pmn m,n , hex ∑m,n Pmn
(79)
where hex Pmn
2 2 1 = ha (m, n) + (hb (m, n) − hc (m, n)) 3 2 2 1/2 1 + √ (hb (m, n) + hc (m, n)) , (80) 3
where ha (m, n), hb (m, n) and hc (m, n) are the elements of the gradient filters derived in the 0◦ , 60◦ and 120◦ directions, respectively. The resultant localizations are listed in Table 2 with resultant SNR, and they are plotted in Figure 5. It is clear that the smaller the filter is, the better its localization is. At the same time, the larger the filter is, the better the SNR is. Accordingly, there
74
S. Saito, M. Nakajiama, and T. Shima
Table 2 Localization and SNR Filter Sqr 3 × 3 Sqr 4 × 4 Sqr 5 × 5
Num. of Pixels 9 16 25
Loc 1.1522 1.2698 1.5081
SNR 4.51 6.76 9.12
Hex 1 √ Hex 3 Hex 2
7 13 19
1 1.0833 1.2553
4.51 7.41 10.75
12 10 SNR
8 6 4 2
Fig. 5 Filters on hexagonal lattices have better SNR despite localization increases in comparison to those on square lattices.
0 0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
Loc Sqr
Hex
is a trade-off between localization Loc and SNR. However, for a given size, filters derived on hexagonal lattices achieve better SNR-to-Loc ratio than filters derived on square lattices, as shown in Figure 5.
8 Experimental Evaluation 8.1 Construction of Artificial Images The consistent gradient filters on hexagonal and square lattices are compared as follows. Specifically, to get both gradient intensity and orientation at any point on the image analytically, artificial images defined by mathematical functions were constructed. The error between the ideal value and the value obtained from the filtered image were the measured. Since it was assumed in the derivation of the element values of the gradient filters that the frequency characteristics of the input image are close to white noise, images that present other frequency profiles for evaluation were constructed. The artificial input image is taken as f (x, y), and f int (x, y) and f ori (x, y) are taken as its ideal gradient intensity and its ideal orientation, respectively. fex1 , fex2 and fex3 are defined as follows.
Optimal Differential Filter on Hexagonal Lattice
75
fex1 is composed mostly of low frequencies. It has smooth changes in luminance given as: fex1 (x, y) ≡
R21 − (x2 + y2 ).
The gradient intensity and orientation of this image are x2 + y2 int fex1 (x, y) ≡ − R21 − (x2 + y2 ) ori fex1 (x, y) ≡ arctan(x, y),
(81)
(82)
(83)
where R1 is a constant. fex2 is an image composed mostly of high frequencies. It has periodical changes in luminance given as: fex2 (x, y) ≡ A2 · cos2 (ω2 x2 + y2 ). (84) Its gradient intensity and orientation are int fex2 (x, y) ≡ |A2 · ω2 · sin (2ω2
x2 + y2 )|
ori fex2 (x, y) ≡ arctan(x, y),
(85) (86)
where A2 and ω2 are constants. fex3 is composed of low to high frequencies given as fex3 (x, y) ≡ A3 · cos2 (ω3 (x2 + y2 )).
(87)
Its gradient intensity and orientation are int fex3 (x, y) ≡ 2|A3 · ω3 · sin (2ω3 (x2 + y2 ))| x2 + y2
(88)
ori fex3 (x, y) ≡ arctan(x, y),
(89)
where A3 and ω3 are constants. For a given input image f , Int( f ) is the gradient intensity computed from the filtered image and Ori( f ) is the orientation computed from the filtered image. To evaluate the accuracy of the filters, the errors are then calculated as follows: | f int − Int( f )|
(90)
| f ori − Ori( f )|.
(91)
and
76
S. Saito, M. Nakajiama, and T. Shima
8.2 Detection of Gradient Intensity and Orientation Ando’s gradient filters (3 × 3 and 5 × 5) for square lattices and the derived gradient filters (radius 1 and 2) for hexagonal lattices are evaluated as follows. The gradient intensity and orientation are determined from a filtered image as follows. The differential values in the x and y directions are taken as fx and fy respectively, on square lattices, fr , fs and ft are taken as the differential values in the 0◦ , 60◦ , and 120◦ directions, respectively, on hexagonal lattices. The functions for calculating the gradient intensity on the square and hexagonal lattices, respectively, are taken as Int sqr and Int hex as follows: Int sqr ( f ) ≡ ( fx )2 + ( fy )2 (92)
Int
hex
2
2 1/2 2 1 1 (f) ≡ fr + ( fs − ft ) + √ ( f s + ft ) 3 2 3
(93)
Similarly, the orientation is given as Orisqr ( f ) ≡ arctan( fx , fy )
hex
Ori
2 1 1 ( f ) ≡ arctan fr + ( fs − ft ) , √ ( fs + ft ) . 3 2 3
(94)
(95)
Both functions rely on arctan, a function known for its high computational cost. In the following section, the accuracy of the orientation calculated by Overington’s method[17], which has a smaller calculation cost, is evaluated.
8.3 Overington’s Method of Orientation Detection Overington[17] proposed an orientation-detection method specially designed for hexagonal lattices. The method performs better in terms of calculation time than methods using arctangent, such as (94) and (95). While Overington focused on hexagonal lattices (six axes), his approach is extended here to any number of axes. First, it is supposed that n gradient filters are available in n directions, uniformly sampled. Each filter’s orientation is at angle π /n from its neighbors. The differential value in the direction θ f is taken as zθ f . Accordingly, the following equations are defined:
θ1f ≡argmaxθ f (|zθ f |)
(96)
f f θ0 ≡ θ1 θ2f ≡θ1f
− π /n
(97)
+ π /n
(98)
Optimal Differential Filter on Hexagonal Lattice
77
z0 ≡zθ f
(99)
z1 ≡zθ f
(100)
z2 ≡zθ f
(101)
0 1 2
Overington assumed that differential values for three directions can be fitted by a trigonometric function, where the three directions consist of a direction that gives the highest absolute differential value and two other adjacent directions. That is, it is assumed that these equations can be fitted as follows. z0 =B cos(θ1 − π /n),
(102)
z1 =B cos θ1 , z2 =B cos(θ1 + π /n),
(103) (104)
θ1 ≡θ1f − θ p
(105)
where
and θ p is the orientation to be detected. The differences between (103) and (102), and between (103) and (104) give z1 − z0 =B(cos θ1 − cos(θ1 − π /n)),
(106)
z1 − z2 =B(cos θ1 − cos(θ1 + π /n)).
(107)
Dividing the sum of them by the difference of them gives tan θ1 =
1 − cos(π /n) z0 − z2 · . sin(π /n) 2z1 − z0 − z2
(108)
To reduce computation cost, the first term of Maclaurin expansion is used for tangent, that is tan θ1 ≈ θ1 .
θ1 =
z0 − z2 1 − cos(π /n) · sin(π /n) 2z1 − z0 − z2
(109)
The orientation θ p is detected as follows. f
θ p = θ1 − θ1
(110)
As stated above, this method is good in situations where calculation time is more important than high accuracy, because it does not need to call for the time-consuming arctangent. Overington proposed this method for hexagonal lattices. For the sake of comparison, his method is also adopted here for square lattices, where n = 6 for both lattices were prepared. For hexagonal lattices, gradient filters in six directions were prepared. Filters for the 0◦ , 60◦ and 120◦ directions are derived in section 6. Filter
78
S. Saito, M. Nakajiama, and T. Shima
hd for the 90◦ direction is composed from the filters for the 60◦ and 120◦ directions as follows: 1 (111) hd ≡ √ (hb + hc ) . 3 The filters for the 30◦ and 150◦ directions are the obtained by rotating hd (Figure 6). In the same way, for square lattices, gradient filters for the 30◦ , 60◦ , 120◦ and 150◦ directions were prepared by composing Ando’s filters in the 0◦ and 90◦ directions.
a
0 -a
a
0 -a
-
0
-
0
0 a-
0
a-
-
0 a-
0 a-
0
a = 0.2886751 he
hd
hf
Fig. 6 Derived differential filters for 30, 90 and 150 degree from filters for 0, 60 and 120 degree
8.4 Relationship between Derived Filter and Staunton Filter The relationship between the filters derived here (whose radius is 1) and Staunton filters[23] is investigated as follows. Staunton designed a set of edge detecting filters as illustrated in Figure 7. He mentioned that the element values in these filters, which are 1 or −1, are nearly optimal according to Davies’ design principle[6]. This principle is based on a super sampling of a disc whose center is the center of a filter in the image domain. On the contrary, the derivation of the present filters is in frequency domain. The equations for detecting intensity and orientation from convolution values with the Staunton filters are described in the following. hex Intensity IntStaunton is given as 1 hex IntStaunton ≡√ 3
f p2a + f p2c + f pa f pc ,
(112)
where f pa and f pc are convolution values with filters pa and pc , respectively1 Orientation is given as 1 Orihex Staunton ≡ arctan( f pa + f pc , √ ( f pa − f pc )). 3 1
(113)
Since we use opposite signed pc , the sign of the third term differs from that in [23]. The constant √1 also differs because the present grid distance is 1, but it is √2 in [23]. 3
3
Optimal Differential Filter on Hexagonal Lattice
79
Since our hexagonal filter assumes that the element values are distributed as shown in Figure 4(a), these above-described Staunton filters are not derived. However, hd derived by (111), and he and h f derived by rotating hd give filters with the same shape as the Staunton filters, though the element values are not the same. On the other hand, filters that have the same proportion of elements as ha , hb and hc can be derived from pa , pb and pc by 1 pd = √ (pa + pb ) , 3 1 pe = √ (pb + pc ) , 3 1 p f = √ (pc − pa ) . 3
(114) (115) (116)
Our equation for deriving gradient intensity (93) can be rewritten as
Int
hex
2
2 12 2 1 1 ( f )= fr + ( fs − ft ) + √ ( fs + ft ) (117) 3 2 3
1 2 4 1 1 = fr2 + fr fs − fr ft + ( fs2 − 2 fs ft + ft2 ) + fs2 + 2 fs ft + ft2 9 4 3
1 2 1 = (4 fr2 + 4 fs2 + 4 ft2 + 4 fr fs + 4 fs ft − 4 fr ft ) 9 1 2 = fr2 + fs2 + fr fs + ft ( ft + fs − fr ) 2 3 1 2 = ( fr2 + fs2 + fr fs ) 2 , 3
which is the same as (112) except for the constant coefficient. The gradient intensity detecting equation with Staunton filters can therefore detect gradient intensities with the same accuracy as our filters.
-a
-
a
0
a
0 -a
0 pa
0
-
0
a-
0 a-
a=1 pb
Fig. 7 Staunton filters[23].The original pc has opposite sign.
-
0 a-
0 a-
0 pc
80
S. Saito, M. Nakajiama, and T. Shima
The equation for the orientation derivation (113) also has the same meaning as (95). Filter pa + pc is equal to pb , therefore, it gives a perpendicular differential value. The element values of the filter, pa − pc , shown in Figure Figure 8 actually forms horizontal differential filter, however,√this norm is not the same as that of pb . Therefore (113) adopts the constant 1/ 3 to pa − pc to avoid an anisotropic error. Hence (113) has the same accuracy as (95) because the only difference between them is caused by existence or non-existence of a redundancy. The Staunton filters and the derived consistent gradient filters with radius of 1 thus have the same accuracy.
Fig. 8 Composed horizontal filter by pa − pc .
-1 -2
1 2
0 -1
1
8.5 Experiment and Results Java 1.5 was used for the experiments and BigDecimal class, which provides fixedpoint calculation, was used to avoid calculation error. StrictMath class was used for mathematical functions. Artificial images of 201 × 201 pixels on square lattices and 201 × 233 pixels on hexagonal lattices were generated to get similar area and shape. Pixels located at a distance less than 90 from the center of both lattices were taken into consideration. To construct the artificial test images, the previously defined functions with the following parameters were used: R1 = 1000 for fex1 , A2 = 255, ω2 = π /4 for fex2 , and A3 = 255, ω3 = π /(8 ∗ 90) for fex3 . The period of fex2 is 4, and the period of fex3 is 4 at the boundary of the evaluated region whose radius is 90. That is, to avoid any aliasing errors, only those artificial images well under the Nyquist frequency were sampled. These test images are shown in Figure 9. The experimental results are summarized as follows. For each image, the error statistics (mean, variance, maximum and minimum) are shown. Table 3 lists the results for fex1 , Table 4 lists the results for fex2 , and Table 5 lists the results for fex3 . Visual versions of these results are shown in Figure 10, Figure 11, and Figure 12. The error is mapped to the luminance value. To show the difference in the accuracy by four filters, the minimum and maximum errors are mapped to the luminance values of 0 and 255. As stated above, only pixels located at a distance less than 90 from the center are included in the evaluation. Pixels outside this area are mapped to a luminance value of 0.
Optimal Differential Filter on Hexagonal Lattice
(a) f ex1
(b) f ex2
81
(c) f ex3
Fig. 9 Original test images on square lattices. Note that the luminance of fex1 varies slowly.
Table 3 Errors for f ex1 , R1 = 1000 Filter
Mean
Variance Max Min Gradient intensity Sqr 3 × 3 4.39986E-8 2.47255E-16 6.64841E-8 7.25474E-10 Sqr 5 × 5 3.33738E-7 1.40526E-14 5.02276E-7 5.53689E-9 Hex 1 3.03035E-8 1.17287E-16 4.58079E-8 5.00000E-10 Hex 2 5.78494E-8 4.12577E-16 8.64151E-8 9.71484E-10 Orientation by arctan (radian) Sqr 3 × 3 1.15054E-10 8.60308E-21 3.60189E-10 0.00000E0 Sqr 5 × 5 1.72864E-11 1.93587E-22 5.41087E-11 0.00000E0 Hex 1 4.18671E-13 3.32438E-25 1.11029E-11 0.00000E0 Hex 2 2.45733E-13 1.12628E-25 6.12340E-12 0.00000E0 Orientation by Overington’s method (radian) Sqr 3 × 3 1.52959E-3 3.05383E-6 6.15061E-3 5.61363E-15 Sqr 5 × 5 1.52959E-3 3.05383E-6 6.15061E-3 5.61363E-15 Hex 1 1.52469E-3 3.01695E-6 6.14350E-3 0.00000E0 Hex 2 1.52469E-3 3.01695E-6 6.14350E-3 0.00000E0
Moreover, the means of error of gradient intensity, orientation detection (using arctangent) and orientation detection (by Overington’s method) are shown in Figure 13(a),(b) and (c), respectively.
9 Discussion Generally speaking, the computational cost of extracting the gradient value of a pixel on a hexagonal lattice is higher than that for a square lattice, because hexagonal lattices use three axes, while square lattices use two axes. However, the results show our filters on hexagonal lattices have many advantages with respect to accuracy. The most accurate results for gradient intensity were obtained when the radius1 gradient filter on hexagonal lattices was used for all test images. The smaller the
82
S. Saito, M. Nakajiama, and T. Shima
Table 4 Errors for f ex2 , ω2 = π /4, A2 = 255 Filter
Mean
Variance Max Min Gradient intensity Sqr 3 × 3 4.78122E1 6.04649E2 1.03478E2 1.36418E-1 Sqr 5 × 5 7.96775E1 1.67478E3 1.61632E2 5.08181E-1 Hex 1 3.54782E1 3.00115E2 7.64858E1 1.51197E-1 Hex 2 6.13548E1 8.97091E2 1.27053E2 3.68400E-1 Orientation by arctan (radian) Sqr 3 × 3 2.48541E-2 4.62197E-4 3.12839E-1 0.00000E0 Sqr 5 × 5 2.36221E-3 3.37087E-5 7.15883E-2 0.00000E0 Hex 1 3.35539E-3 5.45738E-5 1.18910E-1 0.00000E0 Hex 2 3.67423E-4 5.04084E-7 9.71053E-3 0.00000E0 Orientation by Overington’s method (radian) Sqr 3 × 3 2.46246E-2 4.76264E-4 3.12654E-1 5.61363E-15 Sqr 5 × 5 3.19489E-3 3.44523E-5 7.43153E-2 5.61363E-15 Hex 1 3.67976E-3 6.23404E-5 1.19508E-1 0.00000E0 Hex 2 1.59309E-3 3.31283E-6 1.31418E-2 0.00000E0
value of localization is, the smaller the mean error of gradient intensity is. This result is expected, since the analytical gradient is defined as the first partial derivatives, and the derivative is defined as the limit of the difference quotient. For applications where a precise intensity is pursued, it is recommended to use the derived gradient filter with radius of 1 on hexagonal lattices. For orientation detection using arctangent, the larger the value of localization is, the smaller the mean error is, regardless of the image. This result is consistent with the result presented in section 7; namely, the larger the filter size is, the larger the theoretical SNR becomes. Moreover, the derived gradient filters on hexagonal lattices showed smaller mean errors in comparison with the consistent gradient filters on square lattices. For better orientation detection, it is concluded that the derived gradient filters on hexagonal lattices (with higher localization) are more appropriate. The errors in orientation detection using Overington’s method become smaller as the value of localization of the filter gets larger. The same holds for orientation detection using arctangent. Filters on hexagonal lattices perform better than the square ones for fex2 and fex3 . For fex1 where luminance varies slowly, the results of errors are similar for both types of lattices, however, the results for hexagonal lattices are better than those for square ones. Detecting the orientation of the gradient on hexagonal lattices with Overington’s method performs better than detecting the orientation on square lattices using arctangent for fex2 which mainly consists of high frequency components and fex3 which consists of low to high frequency components. For fex1 which mainly consists of low frequency components, this is not the case. The main advantage of using Overington’s method might be low computational cost, since it does not need to call for arctangent. As a result, using Overington’s method with the derived filters on hexagonal lattices has advantages to using arctangent with gradient filters on square
Optimal Differential Filter on Hexagonal Lattice
83
Table 5 Errors for f ex3 , ω3 = π /(8 · 90), A3 = 255 Filter
Mean
Variance Max Min Gradient intensity Sqr 3 × 3 2.02389E1 3.60586E2 7.85115E1 2.91073E-4 Sqr 5 × 5 3.61619E1 1.05073E3 1.27089E2 1.04375E-4 Hex 1 1.45449E1 1.88399E2 5.56746E1 2.46549E-4 Hex 2 2.64606E1 5.85607E2 9.59810E1 7.97684E-3 Orientation by arctan (radian) Sqr 3 × 3 1.17490E-2 9.75118E-5 8.30716E-2 0.00000E0 Sqr 5 × 5 1.20048E-3 3.78492E-6 3.03901E-2 0.00000E0 Hex 1 8.01422E-4 8.44267E-7 3.98812E-3 0.00000E0 Hex 2 1.11561E-4 2.05645E-8 1.69539E-3 0.00000E0 Orientation by Overington’s method (radian) Sqr 3 × 3 1.17166E-2 1.02811E-4 8.40553E-2 5.61363E-15 Sqr 5 × 5 2.14414E-3 5.88746E-6 3.01349E-2 5.61363E-15 Hex 1 1.79184E-3 3.83489E-6 1.01309E-2 0.00000E0 Hex 2 1.53078E-3 3.02628E-6 6.48015E-3 0.00000E0
lattices, with respect to accuracy, computational cost, and simplicity of circuit implementation. For applications where high accuracy in gradient intensity or orientation detection is needed, the filters derived in this chapter for hexagonal lattices are good solutions.
10 Summary Consistent gradient filters on hexagonal lattices were derived, the relationship between Staunton filters and the derived filters was investigated, and the derived filters were compared with existing filters on square lattices theoretically and practically. Staunton filters and the derived consistent gradient filters with radius of 1 have a relationship that 30 degrees rotation make others, though the Staunton filters were derived on image domain but the derived filters are derived on the frequency domain. Thus both filters can calculate intensities and orientations with the same accuracy. In a theoretical evaluation of the derived filters, these show better SNR in spite of their smaller localization, in comparison to the consistent gradient filters on square lattices. In the theoretical evaluation, an assumption that the frequency characteristics of the input image is flat was adopted. To cope with this, the derived filters are evaluated with artificial images that present various frequency characteristics. The derived filters on hexagonal lattices achieve a higher accuracy in terms of gradient intensity and orientation detection in comparison to the consistent gradient filters on square lattices. Moreover, the derived filters on hexagonal lattices with Overington’s method can detect the orientation with higher accuracy than gradient filters
84
S. Saito, M. Nakajiama, and T. Shima
(a) Sqr3 × 3
(b) Sqr 5 × 5
(c) Hex 1
(d) Hex 2
(e) Sqr 3 × 3
(f) Sqr 5 × 5
(g) Hex 1
(h) Hex 2
(i) Sqr 3 × 3
(j) Sqr 5 × 5
(k) Hex 1
(l) Hex 2
Fig. 10 Results for fex1 . Errors of gradient intensity for four images are mapped to luminance values in (a), (b), (c) and (d). Errors of orientation detection (arctangent) for four images are mapped to luminance value as (e), (f), (g) and (h). Errors of orientation detection (Overington) for four images are mapped to luminance values as (i), (j), (k) and (l).
(a) Sqr 3 × 3
(b) Sqr 5 × 5
(c) Hex 1
(d) Hex 2
(e) Sqr 3 × 3
(f) Sqr 5 × 5
(g) Hex 1
(h) Hex 2
(i) Sqr 3 × 3
(j) Sqr 5 × 5
(k) Hex 1
(l) Hex 2
Fig. 11 Results for fex2 . Errors of gradient intensity for four images are mapped to luminance values in (a), (b), (c) and (d). Errors of orientation detection (arctangent) for four images are mapped to luminance value as (e), (f), (g) and (h). Errors of orientation detection (Overington) for four images are mapped to luminance values as (i), (j), (k) and (l).
Optimal Differential Filter on Hexagonal Lattice
85
(a) Sqr 3 × 3
(b) Sqr 5 × 5
(c) Hex 1
(d) Hex 2
(e) Sqr 3 × 3
(f) Sqr 5 × 5
(g) Hex 1
(h) Hex 2
(i) Sqr 3 × 3
(j) Sqr 5 × 5
(k) Hex 1
(l) Hex 2
0.025
0.025
0.02
Mean of error
80 70 60 50 40 30 20 10 0
Mean of error
Mean of error
Fig. 12 Results for fex3 . Errors of gradient intensity for four images are mapped to luminance values in (a), (b), (c) and (d). Errors of orientation detection (arctangent) for four images are mapped to luminance values as (e), (f), (g) and (h). Errors of orientation detection (Overington) for four images are mapped to luminance values as (i), (j), (k) and (l).
0.015 0.01 0.005
fex1
fex2
Sqr 3x3 Sqr 5x5
fex3 Hex 1 Hex 2
(a)
0
0.02 0.015 0.01 0.005
fex1
fex2
Sqr 3x3 Sqr 5x5
fex3 Hex 1 Hex 2
(b)
0
fex1
fex2
Sqr 3x3 Sqr 5x5
fex3 Hex 1 Hex 2
(c)
Fig. 13 Means of errors of Table 3, Table 4 and Table 5: (a) gradient intensity, (b) orientation detection by arctangent, and (c) orientation detection by Overington’s method.
on square lattices with arctangent, while being computationally lighter. The derived filters thus reduce calculation time and to simplifies circuit implementation. The standard image processing framework is still, however, to use square lattices. We hope that, in the near future, hexagonal lattices will become widely adopted and give better results when used with the derived gradient filters.
86
S. Saito, M. Nakajiama, and T. Shima
References 1. Octave, http://www.gnu.org/software/octave/ 2. Ando, S.: Consistent gradient operators. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(3), 252–265 (2000) 3. Balakrishnan, M., Pearlman, W.A.: Hexagonal subband image coding with perceptual weighting. Optical Engineering 32(7), 1430–1437 (1993) 4. Chettir, S., Keefe, M., Zimmerman, J.: Obtaining centroids of digitized regions using square and hexagonal tilings for photosensitive elements. In: Svetkoff, D.J. (ed.) Optics, Illumination, and Image Sensing for Machine Vision IV, SPIE Proc., vol. 1194, pp. 152– 164 (1989) 5. Choi, K., Chan, S., Ng, T.: A new fast motion estimation algorithm using hexagonal subsampling pattern and multiple candidates search. In: International Conference on Image Processing, vol. 1, pp. 497–500 (1996) 6. Davies, E.: Circularity – a new principle underlying the design of accurate edge orientation operators. Image and Vision Computing 2(3), 134–142 (1984) 7. Dubois, E.: The sampling and reconstruction of time-varying imagery with application in video systems. Proceedings of the IEEE 73(4), 502–522 (1985) 8. Frei, W., Chen, C.C.: Fast boundary detection: A generalization and a new algorithm. IEEE Transactions on Computers C-26(10), 988–998 (1977); doi:10.1109/TC.1977.1674733 9. Grigoryan, A.M.: Efficient algorithms for computing the 2-d hexagonal fouriertransforms. IEEE Transactions on Signal Processing 50(6), 1438–1448 (2002) 10. Her, I.: Geometric transformations on the hexagonal grid. IEEE Transactions on Image Processing 4(9), 1213–1222 (1995) 11. Jiang, Q.: Fir filter banks for hexagonal data processing. IEEE Transactions on Image Processing 17(9), 1512–1521 (2008); doi:10.1109/TIP.2008.2001401 12. Kimuro, Y., Nagata, T.: Image processing on an omni-directional view using a spherical hexagonal pyramid: vanishing points extraction and hexagonal chain coding. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems 1995. Human Robot Interaction and Cooperative Robots, Pittsburgh, PA, vol. 3, pp. 356–361 (1995) 13. Kirsch, R.A.: Computer determination of the constituent structure of biological images. Computers and Biomedical Research 4, 315–328 (1971) 14. Mersereau, R.M.: The processing of hexagonally sampled two-dimensional signals. Proceedings of The IEEE 67(6), 930–953 (1979) 15. Middleton, L.: The co-occurrence matrix in square and hexagonal lattices. In: ICARCV 2002, vol. 1, pp. 90–95 (2002) 16. Middleton, L., Sivaswamy, J.: Hexaonal Image Processing: A Practical Approach. Springer, Heidelberg (2005) 17. Overington, I.: Computer Vision: a unified, biologically-inspired approach. Elsevier Science Pub. Co., Amsterdam (1992) 18. Prewitt, J.M., Mendelsohn, M.L.: The analysis of cell images. Advances in Biomedical Computer Applications 128, 1035–1053 (1996); doi:10.1111/j.17496632.1965.tb11715.x 19. Roberts, L.G.: Machine Perception Of Three-Dimensional Solids. Massachusetts Institute of Technology lexington lincoln lab (1963) 20. Shima, T., Saito, S., Nakajima, M.: Design and evaluation of more accurate gradient operators on hexagonal lattices. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(6), 961–973 (2010)
Optimal Differential Filter on Hexagonal Lattice
87
21. Sobel, I.: Camera models and Machine perception. Stanford University computer science (1970) 22. Staunton, R.: A one pass parallel hexagonal thinning algorithm. In: Seventh International Conference on Image Processing and Its Applications 1999, vol. 2, pp. 841–845 (1999) 23. Staunton, R.C.: The design of hexagonal sampling structures for image digitization and their use with local operators. Image and Vision Computing,162–166 (1989) 24. Staunton, R.C., Storey, N.: Comparison between square and hexagonal sampling methods for pipeline image processing. In: Svetkoff, D.J. (ed.) SPIE Proc. Optics, Illumination, and Image Sensing for Machine Vision IV, vol. 1194, pp. 142–151 (1989) 25. Thiem, J., Hartmann, G.: Biology-inspired design of digital gabor filters upon a hexagonal sampling scheme. In: 15th International Conference on Pattern Recognition, vol. 3, pp. 445–448 (2000) 26. Tremblay, M., Dallaire, S., Poussart, D.: Low level segmentation using cmos smart hexagonal image sensor. In: Proceedings of Computer Architectures for Machine Perception, CAMP 1995, pp. 21–28 (1995) 27. Ulichney, R.: Digital Halftoning. MIT Press (1987) 28. Ville, D.V.D., Blu, T., Unser, M., Philips, W., Lemahieu, I., de Walle, R.V.: Hex-splines: a novel spline family for hexagonal lattices. IEEE Transactions on Image Processing 13(6), 758–772 (2004)
Chapter 6
Graph Image Language Techniques Supporting Advanced Classification and Cognitive Interpretation of CT Coronary Vessel Visualizations Mirosław Trzupek AGH University of Science and Technology Faculty of Electrical Engineering, Automatics, Computer Science and Electronics Department of Automatics al. A. Mickiewicza 30 30-059 Krakow
Abstract. The aim of this chapter is to present a graph image language techniques to the development of a syntactic semantic description of spatial visualizations of coronary artery system. The proposed linguistic description makes it possible to intelligently model the examined structure and then to advanced classification and cognitive interpretation of coronary arteries (automatically find the locations of significant stenoses and identify their morphometric diagnostic parameters). This description will be correctly formalised using ETPL(k) (Embedding Transformation-preserved Production-ordered k-Left nodes unambiguous) graph grammars, supporting the search for stenoses in the lumen of arteries forming parts of the coronary vascularisation. ETPL(k) grammars generate IE graphs (indexed edgeunambiguous) which can unambiguously represent 3D structures of heart muscle vascularisation visualised in images acquired during diagnostic examinations with the use of spiral computed tomography.
1 Introduction Coronary Heart Disease (CHD) is the leading cause of death in the industrialized world. Early diagnosis and risk assessment are widely accepted strategies to combat CHD [25]. Recent years have seen a rapid development of stereovision as well as algorithms for the 3D visualisation and reconstruction of 3D objects, which has become noticeable in modern medical diagnostics as well [5][7][17]. It has become possible not just to reliably map the structures of a given organ in 3D, but also to accurately observe its morphology. Such modern visualisation techniques are now used in practically all types of image diagnostics as well as in many other medical problems. Below, the author discusses the results of his research on the M.R. Ogiela and L.C. Jain (Eds.): Computational Intelligence Paradigms, SCI 386, pp. 89–111. springerlink.com © Springer-Verlag Berlin Heidelberg 2012
90
M. Trzupek
opportunities for using selected artificial intelligence methods to semantically analyse selected medical images. In particular, he will present attempts at using linguistic methods of structural image analysis to develop systems for the cognitive analysis and understanding of medical images, and this will be illustrated by the recognition of lesions in coronary arteries of the heart. The development of such systems is aimed at supporting the early diagnostics of selected heart disorders using the automatic semantic interpretation of lesions. This goal can be achieved if the tools developed (e.g. using linguistic formalisms) allow the computer to penetrate the contents of the image, and not just its form. The image semantics produced by applying the above formalisms support a deeper reasoning about disease factors and the type of therapy. The problem undertaken is important because the identification and location of significant stenoses in coronary vessels is a widespread practical task. A broad range of opportunities for obtaining image data, e.g. by diagnosing the heart using computer tomography (CT), coronary artery angiography or intravascular ultrasonography (IVUS), as well as the 3D reconstruction of the examined coronary vessels produced by rendering provide huge amounts of information to the diagnostician [6][8][13][15] (fig. 1.).
Fig. 1 The 3D reconstruction of a beating heart produced using a Somatom Sensation Cardiac 64 spiral CT scanner [4]
Such a large volume of information is, on the one hand, useful as it allows the proposed right diagnosis to be formulated very competently, but on the other, too much information frequently makes it difficult to take an unambiguous decision, because a data overload can be as much of a problems as a deficit of it. This means that regardless of using state-of-the-art, expensive medical apparatuses, the images produced do not contribute to improving the accuracy of diagnoses to the extent expected when the decision was made to buy the apparatus. This paradoxical effect is mainly due to the fact that image data acquired using very advanced methods is usually assessed just visually and qualitatively by a physician. If the specialist is a doctor with high qualifications, this intelligent, although not completely formalised interpretation method can yield splendid results. However, if the image is to be assessed by someone with less experience and possibly poorer
Graph Image Language Techniques Supporting Advanced Classification
91
intuition, the diagnostic decision taken may be far from the best. The steps of the visual assessment of the analysed image are frequently preceded by some manual modifications of the image structure and its presentation method to improve the ability to visually assess the specific case. A large number of computer tools have been developed for this purpose and their operation is satisfactory [9][10][17]. However, there are no similar IT tools to be used for the automatic support of intellectual processes of physicians when they analyse an image and interpret it. Obviously this is not about eliminating humans from the decision-making process, as this diagnostic stage requires the in-depth consideration by a physician, during which he/she analyses the premises for the decision to be made, accounts for many facts that the computer could not grasp, and he/she is also legally and morally responsible for the diagnosis made and the treatment undertaken based on it. However, at the stage of collecting premises for taking the decision, doctors should make better use of the opportunities, offered by IT, to intelligently analyse complex image data and attempt to automatically classify, interpret and even understand specific images and their components using computer technology. So far, there are no such intelligent IT systems to support the cognitive processes of doctors who analyse complicated cases. However, it should be noted that bolder and more productive attempts at developing them are being made [1][10][12][20][23][24], including by the author. The reason for this is that there are still a number of unsolved scientific and technical problems encountered by designers of intelligent cognitive systems. What is more, these problems multiply greatly if we move from computer programs supporting the analysis of data provided by relatively simple measuring devices to second generation IT systems [6][8][15] coupled to medical diagnostic apparatuses, e.g. the very 3D images of coronary vascularisation considered here (fig. 2.).
Fig. 2 The operator’s panel of the computer system coupled to SOMATOM Sensation Cardiac 64 (source: [4])
92
M. Trzupek
2 The Classification Problem One of the main difficulties in developing universal, intelligent systems for medical image diagnostics is the huge variety of forms of images, both healthy and pathological, which have to be taken into account when supporting physicians that interpret them. For this reason, the analysis should be made independent from the orientation and location of the examined structure within the image. In addition, every healthy person has an individual structure of their internal organs, everyone is somewhat different, which prohibits us from unambiguously and strictly saying what a given organ should look like, as it may look different and still be healthy (i.e. fall within the so-called physiological forms). In particular, the aforementioned varied shapes of morphological elements make it difficult to set a universal standard defining the model shape of a healthy organ, or a pathological one. All of this means that attempts to effectively assess the morphology using computer software are very complicated and frequently outright impossible, because there are too many cases that would have to be analysed to unambiguously determine the condition of the structure being examined. Neither do classical methods of image recognition (i.e. simple classification) [9][18] produce satisfactory results – a comprehensive analysis, the complete recognition and interpretation of the disease symptoms looked for -- every time when supporting medical diagnostics (fig. 3.).
Fig. 3 A simple pattern classification methods rely strongly on quantitative measurements and are not well qualified for all problems
Consequently, it becomes necessary to introduce a somewhat more advanced reasoning aimed at recognising lesions and interpreting their meanings. The author’s research shows that mathematical linguistic formalisms, and in particular graph image grammars [12][23], can be quite successful in this area. However, using mathematical linguistic formalisms in the form of graph image grammars is not free of shortcomings, either. One of the main difficulties is the need to define the linguistic apparatus, i.e. develop a grammar so that there are deterministic syntax analysers for it which will allow the lesions looked for to be recognised [3][11][20][22]. Since, as a rule, it is very difficult to define some ideal, universal pattern, e.g. an image showing some model shape of a healthy or diseased organ, we are dealing with a situation in which we cannot define a complete and at the same time finite set containing all possible forms of the pathology that can occur. In addition, when there is a huge variety of shapes of the structures identified for the purpose of their proper recognition (classification), it may become necessary
Graph Image Language Techniques Supporting Advanced Classification
93
to define a grammar very broad in terms of the number of productions introduced. However, this problem can be solved by using grammars with greater generating capacities. Still, one has to remember that for some grammars of this type there may be problems with building deterministic syntax analysers. On the other hand, a computer using the well-known and frequently used techniques of automatic image recognition needs such a pattern to be provided to it. This is because the information technologies applied rely to a significant extent on intuition to determine the measure of similarity between the currently considered case and such an abstract pattern, so if the shapes of examined organs change unexpectedly as a result of a disease or individual differences, these technologies often fail. For this reason it is necessary to use advanced artificial intelligence techniques and computational intelligence techniques that can generalise the recorded image patterns. What is particularly important is to use intelligent description methods for medical images that would ignore the individual characteristics of the patient examined and the characteristics dependent on the specific form of the disease unit considered. Linguistic descriptions of this type, created using new image grammars modelling the shapes of healthy coronary vascularisation and the morphology of lesions, form the subject of the rest of this publication.
3 Stages in the Analysis of CT Images under a Structural Approach Utilising Graph Techniques As 3D reconstructions of the coronary vascularisation can come in many shapes and be presented in various projections (angles of observation), modelling such arteries requires using suitably advanced description and classification techniques. One of such techniques consists in image languages based on tree and graph formalisms [3][16][18]. Further down in this chapter, these very methods will be used to discuss the basic stages in the analysis and recognition (classification) of lesions in CT scans of coronary vascularisation. Under the syntactic (structural) approach [3][11][16][18][20][21][22], a complex image is treated as a hierarchical structure made up of simpler sub-images which can be broken down into even simpler ones until we get down to picture primitives. Then, depending on the relations between these primitives and using the appropriate formal grammar, the structure of the image can be represented as a series, a tree or a graph. Defining and identifying simple components of an image makes it possible to describe the shape of the analysed structure, and then to create a generalised, holistic description defining the shapes of lesions looked for in the item analysed. Thus, for the given image (showing a healthy structure or lesions), we obtain certain significant sequences describing the lesions of interest to us. Then, the process of recognising (classifying) the image boils down to a syntactic/semantic analysis (parsing) whose purpose is to identify whether the analysed input series is an element of the language generated by the grammar (fig. 4.).
94
M. Trzupek
Fig. 4 Diagram of a syntactic image recognition (classification) system
This publication presents the use of graph grammars, as they are a more robust tool for describing images than sequential or tree grammars. The basic assumption behind these methods is that it is possible to define a mechanism generating graph representations of the images considered. This mechanism is the appropriate graph grammar, whereas the set of all graph representations of images that it generates is treated as a certain language. We therefore have to build an automaton recognising elements of this language. This automaton, or more exactly, its software implementation – the syntactic analyser (parser) -- is responsible for the recognition procedure and allows the image description written using the proposed language to be converted into a description reaching down to the semantic sphere, enabling all material medical facts associated with the image examined to be understood. Creating a graph model of the 3D structure of the analysed vessels and its linguistic description makes it possible for the computer to analyse the structure obtained in order to automatically detect the location of the stenosis, its extent and type (concentric or eccentric). This representation yields a brief, unambiguous description of all elements of the vascular structure, thus supporting the further reasoning about its correct function or functional irregularities. In the future, it will be possible to combine this type of a description with the haemodynamic modelling of the flow in vessels affected by the disease, and this will help to link the lesions found in the morphology of coronary vessels with pathological blood supply to and the hypoxia of individual fragments of the heart muscle. In addition, using such semantic descriptions in integrated modules of intelligent medical diagnostics systems can help in the early detection of pathological stenoses leading to hearth hypoxia or anginas. The road to finding the appropriate languages necessary to describe the semantic part of 3D coronary vascularisation reconstructions is long and hard. The description of semantically important aspects of an image or its part cannot depend on details which are of secondary significance from the point of view of understanding the image contents and which thus produce additional information surplus which does not contribute anything to the final assessment of the given image. This is why, apart from developing the linguistic description methods for 3D images and from coming up with intelligent methods of using experts’
Graph Image Language Techniques Supporting Advanced Classification
95
knowledge, it makes sense to use pre-processing and analysis techniques of 3D medical images suitable for the specific nature of this problem. In the research work, attempts were made to find such methods of extracting and describing features of medical images that would ignore the individual features characteristic for the patient examined, but instead be geared towards extracting and correctly representing morphological features significant for understanding the pathology portrayed in the image. Only correctly identified elements of the image and their interrelations as well as the suitably selected components of the descriptions of these elements can form the basis for writing the linguistic description of the image, which would then, at the parsing stage, enable the semantics to be analysed and symptoms of the disease to be detected. The above elements of the 3D description, treated as letters of the alphabet (symbols) later used to build certain language formulas, must in particular be geared towards detecting lesions, thus allowing these lesions not only to be located, but also their essence to be interpreted and their medical significance defined.
4 Parsing Languages Generated by Graph Grammars The use of graph grammar to describe 2D or 3D images is mentioned in many scientific publications [10][16][18][20]. On the contrary, publications dealing with the syntactic analysis of 3D images are sparse. This is due to the computational complexity of the problem of parsing which for the overwhelming majority of graph grammar classes is an NP-complete problem [3][16][18]. As the methodology of recognising a specific type of images should be usable in practical applications, the grammar used for the description and then the recognition (classification) should ensure effective parsing. In this study, an ETPL(k) (Embedding Transformationpreserved Production-ordered k-Left nodes unambiguous) graph grammar has been proposed because this class offers a strong descriptive capacity and a known, effective parsing algorithm of a multinomial O(n2) complexity [3][16][18]. These grammars constitute a sub-class of edNLC (edge-labelled directed Node-Label Controlled) graph grammars, which represent a given image using EDG (Indexed Edge-unambiguous Graph) graphs [3][16][18]. ETPL(k) grammars generate IE graphs (indexed edge-unambiguous) – graphs with oriented and labelled edges as well as indexed vertices, allowing images to be unambiguously represented without deformations. However, distortions can sometimes occur at the image preprocessing state (e.g. picture primitives or their interrelations are incorrectly located), in consequence prohibiting the further analysis of such a case. This is because a standard parser treats such an image as one not belonging to the language generated by the given graph grammar. This problem can be solved by defining a certain probabilistic model for the recognised image using random IE graphs, as proposed in publication [16]. The most difficult job, particularly for graph grammars, is to design a suitable parser. In a structural analysis of graph representations, this parser automatically provides the complete information defining the 3D topology of the analysed graph. The difficulty in implementing a syntactic analyser stems from the lack of ready grammar compilers like those available for context-free grammars [20], and
96
M. Trzupek
this means that the syntactic analysis procedures have to be derived independently. For the ETPL(k) graph grammar presented here, an effective parsing algorithm of a multinomial complexity is known [3]. This allows us to develop very productive analysers that make it possible to verify the graph representations analysed to check if they constitute elements of the language defined by the graph grammar introduced. A significant benefit of using an ETPL(k) graph grammar is the possibility of introducing derivational rules with simple semantic actions. This makes it possible, in addition, to determine significant morphometric parameters of the analysed 3D reconstructions of coronary arteries. The entire syntactic/semantic analysis is carried out in a multinomial time, both for patterns unambiguously defined and for fuzzy, ambiguous patterns, as the above grammar classes can be extended into probabilistic forms [16]. This is a very desirable property, particularly when it is necessary to analyse cases not considered before.
5 Picture Grammars in Classification and Semantic Interpretation of 3D Coronary Vessels Visualisations 5.1 Characteristics of the Image Data Research work was conducted on images from diagnostic examinations made using 64-slice spiral computed tomography [4] in the form of animations saved as AVI (MPEG4) files with the 512x512 pixel format. Such sequences were obtained for various patients during diagnostic examinations of the heart and present in a very clear manner all morphologic changes of individual sections of arteries in any plane. Coronary vessels were visualized without the accompanying muscle tissue of the heart.
5.2 Preliminary Analysis of 3D Coronary Vascularisation Reconstructions To enable creating linguistic representations of 3D reconstructions of coronary vascularisation, images showing the coronary arteries being examined undergo a series of operations as part of the image pre-processing stage. The first step in the preliminary analysis of the images analysed is segmentation, which allows areas meeting certain homogeneity criteria (e.g. brightness, colour, texture) to be delimited, which usually boils down to distinguishing individual objects making up the image. In this case, segmentation consists in extracting coronary arteries while obscuring needless elements from the image background so that later it is possible to span the appropriate graph modelling the analysed structure. What is important is that this pre-processing stage is executed using dedicated software integrated with the CT scanner [4] and allows high quality images showing the coronary vascularisation of the examined patient, free from marginal elements, to be obtained. For this reason we can skip this pre-processing stage and focus straight away on
Graph Image Language Techniques Supporting Advanced Classification
97
the morphology of the examined arteries. This is an indisputable advantage of computed tomography over other diagnostic methods for acquiring these types of images, in which one cannot avoid using advanced techniques for segmenting the acquired images, and those may still contain fewer details and be of worse quality in the end. The heart vascularisation reconstructions analysed here were acquired using a SOMATOM CT scanner [4]. This apparatus offers a number of functionalities for data acquisition and creating 3D reconstructions on its basis. It also has predefined procedures built in, which allow vascularisation to be quickly extracted from the visible structures of the cardiac muscle. Since image data has been saved in the form of animations showing coronary vessels in various projections, for the further analysis we should select the appropriate projection which will show the examined coronary vessels in the most transparent form most convenient for describing and interpreting. In our research we have attempted to automate the procedure of finding such a projection by using selected geometric transformations during image processing. Using the fact that the spatial layout of an object can be determined by projecting it onto the axes of the Carthesian coordinate system, values of horizontal Feret diameters [19], which are a measure of the horizontal extent of the diagnosed coronary artery tree, are calculated for every subsequent animation frame during the image rotation (fig. 5.).
Fig. 5 The projection of the coronary arteries with the longest Feret diameter, obtained from an animation stored in the MPEG4 format
The projection for which the horizontal Feret diameter is the greatest is selected for further analyses, as this visualisation shows both the right and the left coronary artery in the most convenient take. In a small number of analysed images, regardless of selecting the projection with the longest horizontal Feret diameter, vessels may obscure one another in space, which causes a problem at subsequent stages of the analysis. The best method to avoid this would be to use advanced techniques
98
M. Trzupek
for determining mutually corresponding elements for every subsequent animation frame based on the geometric relations in 3D space.
5.3 Graph-Based Linguistic Formalisms in Spatial Modelling of Coronary Vessels As the structure of coronary vessels is characterized by three basic types of artery distribution on the heart surface, the proposed methods should include three basic cases: balanced artery distribution, right artery dominant and left artery dominant. For the purposes of this article, in further considerations we will focus on the balanced distribution of coronary arteries which is the most frequent type seen by diagnosticians (60-70% of all cases) [2]. To help represent the examined structure of coronary vascularisation with a graph, it is necessary to define primary components of the analyzed image and their spatial relations, which will serve to extract and suitably represent the morphological characteristics significant for understanding the pathology shown in the image. It is therefore necessary to identify individual coronary arteries and their mutual spatial relations. To ease this process, the projection selected for analyzing was skeletonised. This made it possible to obtain
Fig. 6 Coronary vascularisation projection and its skeleton produced using the Pavlidis skeletonising algorithm
Graph Image Language Techniques Supporting Advanced Classification
99
the centre lines of examined arteries. These lines are equidistant from their external edges and one unit wide (fig. 6.). This gives us the skeleton of the given artery which is much thinner than the artery itself, but fully reflects its topological structure. Of several skeletonising algorithms used to analyse medical images, the Pavlidis skeletonising algorithm [14] turned out to be one of the best. It facilitates generating regular, continuous skeletons with a central location and one unit width. It also leaves the fewest apparent side branches in the skeleton and the lines generated during the analysis are only negligibly shortened at their ends. Skeletonising is aimed only at making it possible to find branching points in the vascularisation structures and then to introduce an unambiguous linguistic description of individual coronary arteries and their branches. Lesions will be detected in a representation defined in this way, even though their morphometric parameters have to be determined based on a pattern showing the appropriate vessel, and not just its skeleton. The centre lines of analyzed arteries produced by skeletonising them is then searched for informative points, i.e. points where artery sections intersect or end. These points will constitute the vertices of a graph modelling the spatial structure of the coronary vessels of the heart. The next step is labelling them by giving each located informative point the appropriate label from the set of vertex labels. In the case of terminal points, the set of vertex labels comprises abbreviated names of arteries found in coronary vascularisation. They have been defined as in the table 1. Table 1 The set of vertex labels
For the left coronary artery LCA - left coronary artery LAD - anterior interventricular branch (left anterior descending) CX - circumflex branch L - lateral branch LM - left marginal branch
For the right coronary artery RCA - right coronary artery RM - right marginal branch PI - posterior interventricular branch RP - right posterolateral branch
If a given informative point is a branching point, then the vertex will be labelled with the concatenation of names of the vertex labels of arteries which begin at this point. This way, all initial and final points of coronary vessels as well as all points where main vessels branch into lower level vessels have been determined and labelled as appropriate. After this operation, the coronary vascularisation tree is divided into sections which constitute the edges of a graph modelling the examined coronary arteries. This makes it possible to formulate a description in the form of edge labels which determine the mutual spatial relations between the primary components, i.e. between subsequent arteries shown in the analysed image. These labels have been identified according to the following system. Mutual spatial relations that may occur between elements of the vascular structure represented by a graph are described by the set of edges. The elements of this set have
100
M. Trzupek
been defined by introducing the appropriate spatial relations: vertical - defined by the set of labels {α, β,…, μ} and horizontal - defined by the set of labels {1, 2,…, 24} on a hypothetical sphere surrounding the heart muscle. These labels designate individual final intervals, each of which has the angular spread of 15°. Then, depending on the location, terminal edge labels are assigned to all branches identified by the beginnings and ends of the appropriate sections of coronary arteries. The presented methodology draws upon the method of determining the location of a point on the surface of our planet in the system of geographic coordinates, where a similar cartographic projection is used to make topographic maps. The use of the presented methodology to determine spatial relations for the analysed projection is shown below (fig. 7.).
Fig. 7 Procedure of identifying spatial relations between individual coronary arteries
To determine the appropriate label for the vector W, its beginning should be placed at the zero point of the coordinate system, and then its terminal point location should be established. For this purpose, two angles have been defined: the azimuth angle A to identify the location of the given point as a result of rotating around the vertical axis and the elevation angle E which identifies the elevation of a given point above the horizon. This representation of mutual spatial relations between the analysed arteries yields a convenient access to the unambiguous description of all elements of the vascular structure. At subsequent analysis stages, this description will be correctly formalised using ETPL(k) graph grammars defined in [3][16][18], supporting the search for stenoses in the lumen of arteries forming parts of the coronary vascularisation. ETPL(k) grammars generate a language L(G) in the form of IE graphs which can unambiguously represent 3D structures of heart muscle vascularisation visualised in images acquired during diagnostic examinations with the use of spiral computed tomography. Quoted below is the formal definition of the IE graph [3][16][18]. H=(V, E, Σ, Γ, Φ)
Graph Image Language Techniques Supporting Advanced Classification
101
where: V is a finite, non-empty set of graph nodes with unambiguously assigned indices Σ is a finite, non-empty set of node labels Γ is a finite, non-empty set of edge labels E is a set of graph edges in the form of (v, λ,w), where v, w∈V, λ∈Γ and the index v is smaller than the index w ϕ:V→Σ is a function of node labelling Before we define the representation of the analysed image in the form of IE graphs, we have to introduce the following order relationship in the set of Г edge labels: 1 ≤ 2 ≤ 3 ≤ … ≤ 24 and α ≤ β ≤ γ ≤ … ≤ μ. This way, we index all vertices according to the ≤ relationship in the set of edge labels which connect the main vertex marked 1 to the adjacent vertices and we index in the ascending order (i = 2, 3, …, n). After this operation every vertex of the graph is unambiguously assigned the appropriate index which will later be used when syntactically analysing the graph representations examined. IE graphs generated using the presented methodology, modelling the analysed coronary vascularisation with their characteristic descriptions (in tables), are presented in the figure below (fig. 8.).
Fig. 8 The representation of the right and the left coronary artery using IE graphs
102
M. Trzupek
The graph structure created in this way will form elements of a graph language defining the spatial topology of the heart muscle vascularisation including its possible morphological changes. Formulating a linguistic description for the purpose of determining the semantics of the lesions searched for and identifying (locating) pathological stenoses will support the computer analysis of the structure obtained in order to automatically detect the number of stenoses, their location, type (concentric or eccentric) and extent. For IE graphs defined as above, in order to locate the place where stenoses occur in the case of a balanced artery distribution, the graph grammar may take the following form: a)
for the right coronary artery: GR=(Σ, Δ, Γ, P, Z)
Σ = {ST, RCA, RM, RP_PI, RP, PI, C_Right, C_Right_post_int} is a finite, nonempty set of node labels Δ = {ST, RCA, RM, RP_PI, RP, PI} is a set of terminal node labels Γ = {15ι, 6ο, 10ο, 4λ, 10π, 12ξ} is a finite, non-empty set of edge labels Start graph Z and set of productions P are shown in fig. 9.
Fig. 9 Start graph Z and set of productions for grammar GR
b) for the left coronary artery: GL=(Σ, Δ, Γ, P, Z) Σ = {ST, LCA, L_LAD, CX, L, LAD, C_Left, C_Left_lad_lat} Δ = {ST, LCA, L_LAD, CX, L, LAD} Γ = {7κ, 2ο, 12ο, 14ξ, 12μ, 17μ, 15ν}
Graph Image Language Techniques Supporting Advanced Classification
103
Start graph Z and set of productions P are shown in fig. 10.
Fig. 10 Start graph Z and set of productions for grammar GL
This way, we have defined a mechanism in the form of ETPL(k) graph grammars which create a certain linguistic representation of each analysed image in the form of IE graphs. The set of all representations of images generated by this grammar is treated as a certain language. Consequently, we can build a syntax analyser based on the proposed graph grammar which will recognise elements of this language. The syntax analyser is the proper programme that will recognise the changes looked for in the lumen of coronary arteries.
5.4 Detecting Lesions and Constructing the Syntactic Analyser The problem of representing the analysed image using graph structures, presented in the previous subsection, which is of key importance for syntactic methods, constitutes the preliminary stage of the recognition process. Defining the appropriate mechanism in the form of a graph grammar G yielded IE graph representations for the analysed images, and the set of all image representations generated by this grammar is treated as a certain language. The most important stage which corresponds to the recognition procedure for the pathologies found is the implementation of a syntactic analyser which would allow the analysis to be carried out using cognitive resonance [20], which is key to understanding the image. This is the hardest part of the entire recognition process, in particular for grammars which describe the set of rules of a language using graph grammars. One of the procedures of algorithms for parsing IE graphs for ETPL(k) graph grammars is the comparison of descriptions of characteristic subsequent vertices of the analysed IE graph
104
M. Trzupek
and the derived IE graph which is to lead to generating the analysed graph. A onepass generation-type parser carries out the syntactic analysis, at every step examining the characteristic description of the given vertex. If the vertex of the analysed and the derived graphs is terminal, then their characteristic descriptions are examined. However, if the vertex of the derived graph is non-terminal, then a production is searched for, after the application of which the characteristic descriptions are consistent. The numbers of productions used during the parsing form the basis for classifying the recognised structure. This methodology makes use of theoretical aspects of conducting the syntactic analysis for ETPL(k) grammars, described in [3][16][18]. Due to the fact that in visualisations of coronary vascularisation, we can distinguish three different types of topologies, characteristic for these vessels, therefore, for each of the three types of topology, we can propose appropriate type of ETPL(k) graph grammar. Each grammar generates IE graphs language, modelling particular types of coronary vascularisation. This representation was then subjected to a detailed analysis, to find the places of morphological changes indicating occurrence of pathology. This operation consists of several stages, and uses, among others context-free sequential grammars [11][20]. Next steps in the analysis on the example of coronary artery are shown in fig. 11. Arteries with vertices ST1 – RCA2 and L_LAD3 – LAD6 represented by the edges 15ι and 17μ of the IE graph have been subjected to the operation of the straightening transformation [11], which allows to obtain the width diagrams of the analyzed arteries, while preserving all their properties, including potential changes in morphology. In addition, such representation allows to determine the nature of the narrowing (concentric or eccentric). Concentric stenoses appear on a cross-section as a uniform stricture of the whole artery and present symptoms characteristic for a stable disturbance of heart rhythm, whereas eccentric stenoses occur only on one vascular wall, and characterize an unstable angina pectoris [2]. Analysis of morphological changes was conducting based on the obtained width diagrams and using context-free attributed grammars [11][20]. As a result of carried out operations profiles of the analyzed coronary arteries with marked areas of existing pathologies, together with the determination of the numerical values of their advancement level were obtained (fig. 11.). Methodology presented above was implemented sequentially to the individual sections of coronary vascularisation represented by the particular edges of the introduced graph representation.
5.5 Selected Results In order to determine the operating efficiency of the proposed methods, a set of test data, namely visualisations obtained during diagnostic examinations using 64slice spiral computed tomography was used. This set consisted of 20 complete reconstructions of coronary vascularisation (table 2.) obtained during diagnostic examinations of various patients, mainly suffering from coronary heart disease at different progression stages. The test data also consisted of visualisations previously used to construct the grammar and the syntactic analyser. However, to avoid analysing identical images, from the same sequences frames were selected that
Graph Image Language Techniques Supporting Advanced Classification
105
Fig. 11 Next steps in the analysis and recognition of morphological changes occurring on the example of the right (a) and the left (b) coronary artery
106
M. Trzupek
were several frames later than the projections used to construct the set of grammatical rules, and these later frames were used for the analysis. Due to the different types of topologies of coronary vascularisation, the set of analyzed images was as follows: Table 2 Number of images with particular coronary arteries topologies
Balanced artery distribution 9
Right artery dominant 6
Left artery dominant 5
Structure of the coronary vascularisation was determined by a diagnostician at the stage of acquisition of image data. This distinction was intended to obtain additional information about the importance of providing health risks of the patient depending on the place where pathology occurs and the type of coronary vascularisation (e.g. stenosis occurring in the left coronary artery will constitute a greater threat to the health of patients having a left artery dominant structure in comparison to the patients having a right artery dominant structure). The above set of image data was used to determine the percentage efficiency of correct recognitions of the stenoses present, using the methodology proposed here. The recognition consists in identifying the locations of stenoses, their number, extent and type (concentric or eccentric). For the research data included in the experiment, 85% of recognitions were correct. This value is the percentage proportion of the number of images in which the occurring stenoses were correctly located, measured and properly interpreted to the number of all analysed images included in the experimental data set. No indication of major differences in the effectiveness evaluation, depending on the structure of the coronary vascularisation are noticed. The following figure 12 shows the image of CT coronary vascularisation, together with a record describing the pathological changes occurring in the right and left coronary arteries. In order to assess whether the size of the stenosis was correctly measured, we used comparative values from the syngo Vessel View software forming part of the HeartView CI suite [4]. This programme is used in everyday clinical practice where examinations are made with the SOMATOM Sensation Cardiac 64 tomograph [4]. In order to confirm or reject the regularity of the stenosis type determination (concentric or eccentric) shown in the examined image, we decided to use a visual assessment, because the aforementioned programs did not have this functionality implemented. As the set of test data was small - a dozen or so elements the results obtained are very promising, and this effectiveness is due, among other things, to the strong generalising properties of the algorithms applied. Further research on improving the presented analysis techniques of lesions occurring in the morphology of coronary vessels might bring about a further improvement in the effectiveness and the future standardisation of these methods, obviously after they have first been tested on a much more numerous image data set.
Graph Image Language Techniques Supporting Advanced Classification
107
Fig. 12 The result of the analysis in the search for pathological changes occurring in the CT image of coronary arteries
Results obtained in the research conducted show that graph languages for describing shape features can be effectively used to describe 3D reconstructions of coronary vessels and also to formulate semantic meaning descriptions of lesions found in these reconstructions. Such formalisms, due to their significant descriptive power (characteristic especially for graph grammars) can create models of both examined vessels whose morphology shows no lesions and those with visible lesions bearing witness to early or more advanced stages of the ischemic heart disease. By introducing the appropriate spatial relations into the coronary vessel reconstruction, it is possible to reproduce their biological role, namely the blood distribution within the whole coronary circulation system, which also facilitates locating and determining the progression stage of lesions. All of this makes up a process of automatically understanding the examined 3D structure, which allows us to provide the physician with far more and far more valuable premises for
108
M. Trzupek
his/her therapeutic decisions than we could if we were using the traditional image recognition paradigm.
6 Conclusions Concerning the Advanced Classification and Cognitive Interpretation of CT Coronary Vessel Visualizations The research so far has shown that one of the hardest tasks leading to the computer classification and then the semantic interpretation of medical visualisations is to create suitable representations of the analysed structures and propose effective algorithms for reasoning about the nature of lesions found in these images. Visualisations of coronary vascularisation are difficult for computers to analyse due to the variety of projections of the arteries examined. The methods developed made use of ETPL(k) graph grammars and the IE graphs they generate as formalisms modelling coronary vascularisation structures. Such solutions make it possible to detect all symptoms important from the point of view of diagnostics, appearing as various types of stenoses of the coronary arteries of the heart. An important stage in achieving this goal is to construct the right syntactic/semantic analysis procedure (i.e. a parser) which, when analysing graph representations created for 3D reconstructions of coronary vascularisation, automatically provides complete information, defining the 3D topology of the analysed graph that describes the coronary vessels including their individual components. Difficulties in implementing a syntactic analyser are due to the lack of ready (i.e. software) grammar compilers like those available for context-free grammars, for instance [11][20]. This means that syntactic analysis procedures have to be derived independently for the proposed grammars. A significant benefit of using ETPL(k) graph grammars is the possibility of extending them to the form of probabilistic grammars and the ability to introduce derivational rules with simple semantic actions [11][16]. This makes it possible, in addition, to determine significant morphometric parameters of the analysed spatial reconstruction of coronary vessels. Carrying out semantic actions assigned to individual productions generates certain values or information which results from the syntactic analysis completed. In the case of analyses of 3D coronary vascularisation reconstructions, semantic actions of some productions will be aimed at determining numerical values defining the location and degree of the stenosis as well as its type (concentric or eccentric). These parameters will then be utilised as additional information useful in recognising doubtful or ambiguous cases or symptoms. Even though every analysed image has a different representation describing its morphology, a syntactic analysis is conducted based on defined grammars and this reduces the number of potential cases to the recognition and inclusion of the examined structure in one of the classes representing individual categories of disease units. If the completed analysis does not end in recognising any pathological symptom, then the analysing system generates information on the output that the analysed structure is a healthy one. If symptoms which are ambiguous from the medical point of view are
Graph Image Language Techniques Supporting Advanced Classification
109
recognised, the final diagnosis can only be made by applying additional analysing algorithms. The development of new methods of computer analysis and detection of stenoses inside coronary arteries helps not only to significantly improve diagnostic actions, but also greatly broadens the application spectrum of artificial intelligence in computer understanding of diagnostic images and determining the medical significance of pathologies shown in them. The linguistic formalisms developed also add new types of grammars and their applications to the fields of artificial intelligence and image recognition. Such techniques are of major importance as they allow lesions not just to be recognised, but also their semantics defined, which in the case of medical diagnostic images can lead to the computer understanding their significance. This is of key importance for detailing the best therapeutic possibilities and if the proposed methods are perfected, they can significantly improve the ability to support the early recognition and diagnostics of heart lesions. This is of practical significance as the identification of locations of stenoses in coronary vessels is performed very widely, but manually, by an operator or a diagnostician, and as the research has shown, such key stages of the diagnostic process based on the analysis of 3D images can, in the future, be successfully executed by the appropriately designed computer system. It is also worth noting that the methods presented in this publication are not just an attempt at assigning the examined image to an a’priori defined class, but are also an attempt to imitate and automate the human process of the medical understanding of the significance of a shape found in the analysed image. This approach allows numerous medical conclusions to be drawn from the examined image, which, in particular, can lead to making the right diagnosis and recommending a specific type of therapy depending on the shape and location of the pathological stenosis described in the sets of productions. There is a deep analogy between the operation of the structural analysis model and the cognitive interpreting mechanisms occurring in the human mind. The analogy consists in using the interference between the expectations (knowledge collected in the set of productions in the form of graph grammatical rules) and the stream of data coming from the system analysing the examined image. This interference is characteristic for the human visual perception model [20]. Problems related to automating the process of generating new grammars for cases not included in the present language remain unsolved in the on-going research. However, it is worth noting that generally, the problem of deriving grammatical rules is considered unsolvable, particularly for graph grammars. It can appear if the image undergoing the analysis shows a coronary vascularisation structure different from the so far assumed three cases of vessel topologies occurring the most often, i.e. the balanced distribution of arteries, the dominant right artery or the dominant left artery. In those cases it will be necessary to define a grammar taking this new case into account. The processes of creating new grammars and enriching existing ones with new description rules will be followed in further directions of research on the presented methods. Another planned element of further research is to focus on using linguistic artificial intelligence method to create additional, effective mechanisms which can be used for indexing and quickly finding specialised image data in medical databases. Such searches
110
M. Trzupek
will use semantic keys and will allow finding cases meeting specified substantive conditions related to image contents. This can significantly contribute to solving at least some of the problems of intelligently archiving this type of data and finding semantic image data fulfilling semantic criteria set using example image patterns from medical multimedia databases. Acknowledgments. This work has been supported by the National Science Centre, Republic of Poland, under project number N N516 478940.
References [1] Breeuwer, M., Johnson, P., Kouwenhoven, K.: Analysis of volumetric cardiac CT and MR image data. MEDICAMundi 47(2), 41–53 (2003) [2] Faergeman, O.: Coronary Artery Disease. Elsevier Science B.V (2003) [3] Flasiński, M.: On the parsing of deterministic graph languages for syntactic pattern recognition. Pattern Recognition 26, 1–16 (1993) [4] SOMATOM Sensation Cardiac 64 Brochure, Get the Entire Picture. Siemens medical (2004) [5] Higgins, W.E., Reinhardt, J.M.: Cardiac image processing. In: Bovik, A. (ed.) Handbook of Video and Image Processing, pp. 789–804. Academic Press (2000) [6] Katritsis, D.G., Pantos, I., Efstathopoulos, E.P., et al.: Three-dimensional analysis of the left anterior descending coronary artery: comparison with conventional coronary angiograms. Coronary Artery Disease 19(4), 265–270 (2008) [7] Lewandowski, P., Tomczyk, A., Szczepaniak, P.S.: Visualization of 3-D Objects in Medicine - Selected Technical Aspects for Physicians. Journal of Medical Informatics and Technologies 11, 59–67 (2007) [8] Meijboom, B.W., Van Mieghem, C.A., Van Pelt, N., et al.: Comprehensive Assessment of Coronary Artery Stenoses: Computed Tomography Coronary Angiography Versus Conventional Coronary Angiography and Correlation With Fractional Flow Reserve in Patients With Stable Angina. Journal of the American College of Cardiology 52(8), 636–643 (2008) [9] Meyer-Baese, A.: Pattern Recognition in Medical Imaging. Elsevier-Academic Press (2003) [10] Ogiela, M.R., Tadeusiewicz, R.: Modern Computational Intelligence Methods for the Interpretation of Medical Images. Springer, Heidelberg (2008) [11] Ogiela, M.R., Tadeusiewicz, R.: Syntactic reasoning and pattern recognition for analysis of coronary artery images. Artificial Intelligence in Medicine 26, 145–159 (2002) [12] Ogiela, M.R., Tadeusiewicz, R., Trzupek, M.: Picture grammars in classification and semantic interpretation of 3D coronary vessels visualisations. Opto-Electronics Review 17(3), 200–210 (2009) [13] Oncel, D., Oncel, G., Tastan, A., Tamci, B.: Detection of significant coronary artery stenosis with 64-section MDCT angiography. European Journal of Radiology 62(3), 394–405 (2007) [14] Pavlidis, T.: Algorithms for graphics and image processing. Rockville Computer Science Press (1982)
Graph Image Language Techniques Supporting Advanced Classification
111
[15] Sirol, M., Sanz, J., Henry, P., et al.: Evaluation of 64-slice MDCT in the real world of cardiology: A comparison with conventional coronary angiography. Archives of Cardiovascular Diseases 102(5), 433–439 (2009) [16] Skomorowski, M.: A Syntactic-Statistical Approach to Recognition of Distorted Patterns. Jagiellonian University Krakow (2000) [17] Sonka, M., Fitzpatrick, J.M.: Handbook of Medical Imaging, vol. 2. Medical Image Processing and Analysis SPIE, Belligham, Washington (2004) [18] Tadeusiewicz, R., Flasiński, M.: Pattern Recognition. PWN Warsaw (1991) (in Polish) [19] Tadeusiewicz, R., Korohoda, P.: Computer Analysis and Image Processing. Foundation of Progress in Telecommunication Kraków (1997) (in Polish) [20] Tadeusiewicz, R., Ogiela, M.R.: Medical Image Understanding Technology. Springer, Heidelberg (2004) [21] Tadeusiewicz, R., Ogiela, M.R.: Structural Approach to Medical Image Understanding. Bulletin of the Polish Academy of Sciences – Technical Sciences 52(2), 131–139 (2004) [22] Tanaka, E.: Theoretical aspects of syntactic pattern recognition. Pattern Recognition 28, 1053–1061 (1995) [23] Trzupek, M., Ogiela, M.R., Tadeusiewicz, R.: Image content analysis for cardiac 3D visualizations. In: Velásquez, J.D., Ríos, S.A., Howlett, R.J., Jain, L.C. (eds.) KES 2009. LNCS (LNAI), vol. 5711, pp. 192–199. Springer, Heidelberg (2009) [24] Wang, Y., Liatsis, P.: A Fully Automated Framework for Segmentation and Stenosis Quantification of Coronary Arteries in 3D CTA Imaging. In: Dese 2009 Second International Conference on Developments in eSystems Engineering, pp. 136–140 (2009) [25] Yusuf, S., Reddy, S., Ounpuu, S., Anand, S.: Global burden of cardiovascular diseases, Part I. General Considerations, the Epidemiologic Transition, Risk Factors, and Impact of Urbanization. Circulation 104, 2746–2753 (2001)
Chapter 7
A Graph Matching Approach to Symmetry Detection and Analysis Michael Chertok and Yosi Keller Bar-Ilan University, Israel {michael.chertok,yosi.keller}@gmail.com
Abstract. Spectral relaxation was shown to provide an efficient approach for solving a gamut of computational problems, ranging from data mining to image registration. In this chapter we show that in the context of graph matching, spectral relaxation can be applied to the detection and analysis of symmetries in n-dimensions. First, we cast symmetry detection of a set of points in Rn as the self-alignment of the set to itself. Thus, by representing an object by a set of points S ∈ Rn , symmetry is manifested by multiple self-alignments. Secondly, we formulate the alignment problem as a quadratic binary optimization problem, solved efficiently via spectral relaxation. Thus, each eigenvalue corresponds to a potential self-alignment, and eigenvalues with multiplicity greater than one correspond to symmetric selfalignments. The corresponding eigenvectors reveal the point alignment and pave the way for further analysis of the recovered symmetry. We apply our approach to image analysis, by using local features to represent each image as a set of points. Last, we improve the scheme’s robustness by inducing geometrical constraints on the spectral analysis results. Our approach is verified by extensive experiments and was applied to two and three dimensional synthetic and real life images.
1 Introduction Symmetry is all around us. In nature it is commonly seen in living creatures such as butterflies, in still life like flowers, and in the unseen world of molecules. Humans have been inspired by nature’s symmetry for thousand of years in countless fields. In the fields of art and architecture, the scientific fields of mathematics and even the humanitarian fields such as philosophy - symmetry is a prominent factor. From airplanes to kitchen cups, symmetry ideas are copied from nature in many manmade objects. The presence of visual symmetry in everyday life makes its detection and analysis one of the fundamental tasks in computer vision. When dealing with computer vision applications, the detection of symmetry is mainly used to reach a more advanced level of recognition, like object detection or segmentation. It is known that the human visual system is more prone to detect symmetric patterns in M.R. Ogiela and L.C. Jain (Eds.): Computational Intelligence Paradigms, SCI 386, pp. 113–144. c Springer-Verlag Berlin Heidelberg 2012 springerlink.com
114
M. Chertok and Y. Keller
a given scenario than any other pattern. A person will first focus his attention on an object with symmetry than other objects in a picture [EWC00]. Rotational and reflectional symmetries are the most common types of symmetries. An object is said to have rotational symmetry of order K if it is invariant under rotations of 2Kπ k, k = 0 . . . K − 1 about a point denoted the symmetry center. Whereas an object has reflectional symmetry if it is invariant under a reflection transformation about a line, denoted the reflection axis. Figure 1 presents both types of symmetry.
(a)
(b)
Fig. 1 Rotational and reflectional symmetries. (a) Rotational symmetry of order eight without reflectional symmetry. (b) Reflectional symmetry of order one.
The problem of symmetry detection and analysis has been studied by many researches [CPML07, DG04, KCD+ 02, KG98, LE06, Luc04, RWY95, SIT01, SNP05]. Most of them deal with two-dimensional symmetries, while a few analyze three-dimensional data [KCD+ 02]. A recent survey [PLC+ 08] by Chen et al. found that despite the significant research effort made, there is still a need for a robust, widely applicable “symmetry detector”. We propose an effective scheme for detection and analysis of rotational and reflectional symmetries in n-dimensions. The scheme is based on the self-alignment of points using a spectral relaxation technique as proposed by Leordeanu in [LH05]. Our core contribution is to show that the symmetry of a sets of points S ∈ Rn is manifested by a multiplicity of the leading eigenvalues and corresponding eigenvectors. This leads to a purely geometric symmetry detection approach that only utilizes the coordinates of the set S, and can thus be applied to abstract sets of points in Rn . Given the eigendecomposition, we show how to recover the intrinsic properties of reflectional and rotational symmetries (center of rotation, point correspondences, symmetry axes). In our second contribution, we derive a geometrical pruning measure, by representing the alignments as geometrical transforms in Rn and enforcing a norm constraint. This allows us to reject erroneous matchings, and analyze real data, where symmetries are often embedded in clutter. In our third contribution, we
A Graph Matching Approach to Symmetry Detection and Analysis
115
analyze the case of perfect symmetry and show it results in a degenerate eigendecomposition. We then resolve this issue and explain why perfect symmetry rarely appears in real data such as images. The proposed scheme requires no apriori knowledge of the type of symmetry (reflection/rotation) and holds for both. Our scheme, denoted as Spectral Symmetry Analysis (SSA), can detect partial symmetry and is robust to outliers. In our last contribution we apply the SSA to the analysis of symmetry in images, for which we utilize local image features [Low03, SM97] to represent images as sets of points. For that we utilize image descriptors to reduce the computational complexity. The chapter is organized as follows: we start by presenting the geometrical properties of symmetries in Section 2 and then survey previous results on symmetry detection, local features and spectral alignment in Section 3. Our approach to symmetry analysis is presented in Section 4 and experimentally verified in Section 5. Concluding remarks are given in Section 6.
2 Symmetries and Their Properties In this work we study the symmetry properties of sets of points S = {xi }, such that xi ∈ Rn . The common types of symmetries are the rotational and reflectional symmetries. Sets having only the first type are described by the cyclic group CK , while others have both rotational and reflectional symmetry, and are described by the dihedral group DK , where K is an order of the respective symmetry. In this section we define the rotational (cyclic) and reflectional (dihedral) symmetries, C D denoted CK and DK , respectively. By considering the subsets of points SI , SI ⊂ S that are invariant under the corresponding symmetry transforms TCK and TDK , we are able to recover the rotation centers and reflection axes. These invariant sets are shown to be related to the spectral properties of TCK and TDK . Finally, we derive an analytical relationship between TCK and TDK in a two-dimensional case, which allows us to infer the rotational symmetry transform TCK , given two reflectional transforms TDK .
2.1 Rotational Symmetry Definition 1 (Rotational symmetry). A set S ∈ Rn is rotationally symmetric with a rotational symmetry transform TCK , of order K if ∀xi ∈ S, ∃ x j ∈ S, s.t. x j = TCK xi .
(1)
For S ∈ R2 , TCK is given by ⎛
⎞⎛ ⎞ cos βk − sin βk 0 x TCK (x, y) = ⎝ sin βk cos βk 0 ⎠ ⎝ y ⎠ , 1 0 0 1
(2)
116
M. Chertok and Y. Keller
where βk = 2Kπ k, k = 0, . . . , K − 1. Thus, for a symmetry of order K, there exists a K set of K transformations Rβk 1 for which Eq. 1 holds. The use of homogeneous coordinates in Eq. 2 allows us to handle non-centered symmetries. An example of rotational symmetry is given in Fig. 1a. Equation 2 implies that det(TCK ) = 1
(3)
where the invariant set SCI , is the center of rotational symmetry. Given a rotation operator TCK , the center of rotation (that is also the center of rotational symmetry) Xc , is invariant under TCK , and can be computed as the eigenvector of TCK corresponding to the eigenvalue λ = 1 TCK Xc = Xc .
(4)
2.2 Reflectional Symmetry Definition 2 (Reflectional symmetry). A set S ∈ Rn is reflectionally symmetric with respect to the vector (reflection axis) cos α0 , sin α0 with a reflectional transform TDK , if ∀xi ∈ S, ∃ x j ∈ S, s.t. x j = TDK xi . (5) where for xi ∈ R2 , TDK is given by ⎛ ⎞⎛ ⎞ x cos 2α0 sin 2α0 0 TDK (x, y) = ⎝ sin 2α0 − cos2α0 0 ⎠ ⎝ y ⎠ . 0 0 1 1
(6)
A set S has reflectional symmetry of order K, if there are K angles αk that satisfy Eq. 5. An example of reflectional symmetry is given in Fig. 1b. where α0 is the angle of the reflection axis, and Eq. 6 implies that det (TDK ) = −1.
(7)
Similar to the rotational symmetry case, the points on the symmetry axis form an invariant set XR that corresponds to the eigenspace of TDK XR = XR .
(8)
Conversely to Eq. 4, the eigenspace corresponding to λ = 1 is of rank 2, in accordance to SID being a line.
A Graph Matching Approach to Symmetry Detection and Analysis
117
2.3 Interrelations between Rotational and Reflectional Symmetries Theorem 1. If a set S has rotational symmetry of order K, then it either has reflectional symmetry of order K or has no reflectional symmetry at all [Cox69, Wey52]. If S has both rotational and reflectional symmetry then the axes of reflectional symmetry are given by 1 αn = α0 + βk , 2
k = 0, . . . , K − 1,
(9)
where α0 is the angle of one of the reflection axes, and βk are the angles of rotational symmetry. Theorem 2. Given two distinct reflectional transforms TD1 and TD2 , one can recover the corresponding symmetrical rotational transforms TCK TCK = TD1 · TD2
(10)
The proof is given In Appendix A.
2.4 Discussion The geometrical properties presented above, pave the way for a computational scheme for the analysis of a given set of prospective symmetry transforms {Ti }. By imposing Eqs. 3 and 7, erroneous transforms can be discarded, and the spectral analysis, given in Eqs. 4 and 8 can be used to recover the center of rotational symmetry and the axis of the reflectional one. Moreover, based on Theorem 2, one can start with a pair of reflection transforms {TD1 , TD2 } and recover the rotational transform TCK . Note that in practice, the prospective transforms {Ti }, are computed by the spectral matching algorithm discussed in Section 4. That procedure provides an alternative geometrical approach to recovering the symmetry centers and axes. Using both methods, we are able to cross-validate the results. The norm tests of the symmetrical transforms (Eqs. 3 and 7), can be applied to higher dimensional data sets, as well as the spectral analysis in Eqs. 4 and 8. The equivalent in three-dimensional data is given by Euler’s theorem [TV98].
3 Previous Work This section overviews previous work related to our scheme. Section 3.1 provides a survey of recent results in symmetry analysis, while Section 3.2 presents the notion of Local Features that is used to represent an image as a set of salient points. A combinatorial formulation of the alignment of sets of points in Rn is discussed in Section 3.3, and a computationally efficient solution, via spectral relaxation is
118
M. Chertok and Y. Keller
presented. The latter approach paves the way for the proposed Spectral Symmetry Analysis (SSA) algorithm presented in Section 4.
3.1 Previous Work in Symmetry Detection and Analysis Symmetry has been thoroughly studied in literature from theoretical, algorithmic, and applicative perspectives. Theoretical analysis of symmetry can be found in [Mil72, Wey52]. The algorithmic approaches to its detection can be divided into several categories, the first of which consists of intensity-based schemes that compute numerical moments of image patches. For instance, detection of vertical reflectional symmetry using a one-dimensional odd-even decomposition is presented in [Che01]. The authors assume that the symmetry axis is vertical and thus scans each horizontal line in the image. Each such line is treated as a one-dimensional signal that is normalized and decomposed into odd and even parts. From odd and even parts, the algorithm constructs a target function that achieves its maximum at the point of mirror symmetry of the one-dimensional signal. When the image has a vertical symmetry axis, all symmetry points of the different horizontal lines lie along a vertical line in the image. A method that estimates the relative rotation of two patterns using the Zernike moments is suggested in [KK99]. This problem is closely related to the problem of detecting rotational symmetry in images. Given two patterns, where one pattern is a rotated replica of the other pattern, the Zernike moments of the two images will have the same magnitude and some phase differential. The phase differential can be used to estimate the relative rotation of the two images. In order to detect large symmetric objects, such schemes require an exhaustive search over all potential symmetry axes and locations in the image, requiring excessive computation even for small images. An efficient search algorithm for detecting areas with high local reflectional symmetry that is based on a local symmetry operator is presented in [KG98]. It defines a two-dimensional reflectional symmetry measure as a function of four parameters x, y, θ , and r, where x and y are the center of the examined area, r is its radius, and θ is the angle of the reflection axis. Examining all possible values of x, y, r, and θ is computational prohibitive; therefore, the algorithm formulates the search as a global optimization problem and uses a probabilistic genetic algorithm to find the optimal solution efficiently. A different class of intensity-based algorithms [DG04, KS06, Luc04] utilizes the Fourier transform to detect global symmetric patterns in images. The unitarity of the Fourier transform preserves the symmetry of images in the Fourier domain: a symmetric object in the intensity domain, will also be symmetric in the Fourier domain. Derrode et al. [DG04] analyze the symmetries of real objects by computing the Analytic Fourier-Mellin transform (AFMT). The input image is interpolated on a polar grid in the spatial domain before computing the FFT, resulting in a polar Fourier representation. Lucchese [Luc04] provides an elegant approach to analyzing the angular properties of an image, without computing its polar DFT. An angular histogram is computed by detecting and binning the pointwise zero crossings of the
A Graph Matching Approach to Symmetry Detection and Analysis
119
difference of the Fourier magnitude in Cartesian coordinates along rays. The histogram’s maxima correspond to the direction of the zero crossing. In [KS06], Keller et al. extended Lucchese’s work, by applying the PseudoPolar Fourier transform to computing algebraically-accurate line integral in the Fourier domain. The symmetry resulted in a periodic pattern in the line integral result. This was detected by spectral analysis (MUSIC). These algorithms are by nature global, being able to effectively detect fully symmetric images, such as synthetic symmetric patterns. Yet, some of them [KS06], struggle at detecting small localized symmetric objects embedded in clutter. The frequency domain was also utilized by Lee et al. in [LCL08], where Friezeexpansions were applied to the input image. Thus converting planar rotational symmetries into periodic one-dimensional signals, whose period corresponds to the order of the symmetry. This period is estimated by recovering the maxima of the Fourier spectrum. Recent work emphasizes the use of local image features. The local information is then agglomerated to detected the global symmetry. Reisfeld et al. [RWY95] suggested a low-level, operator for interest points detection where symmetry is considered a cue. This symmetry operator constructs the symmetry map of the image by computing an edge map, where the magnitude and orientation of each edge depend on the symmetry associated with each of its pixels. The proposed operator is able to process different symmetry scales, enabling it to be used in multi-resolution schemes. A related approach was presented in [LW99], where both reflectional and rotational symmetries can be detected, even under a weak perspective projection. A Hough transform is used to derive the symmetry axes from edge contours. A refinement algorithm discards erroneous symmetry axes by imposing geometrical constraints using a voting scheme. An approach related to our work was introduced in [ZPA95], where the symmetry is analyzed as a symmetry of a set of points. For an object, given by a sequence of points, the symmetry distance is defined as the minimum distance in which we need to move the points of the original object in order to obtain a symmetric object. This also defines the symmetry transform of an object as the symmetric object that is closest to the given one. This approach requires finding point correspondences, which is in often difficult, and an exhaustive search over all potential symmetry axes is performed. Shen et al. [SIT01] used an affine invariant feature vector, computed over a set of interest points. The symmetry was detected by analyzing the cross-similarity matrix of this vectors. Rotational and reflectional symmetries can be analyzed by finding the loci corresponding to its minima. The gradient vector flow field was used in [PY04] to compute a local feature vector. For each point, its location, orientation, and magnitude were retained. Local features in the form of Taylor coefficients of the field were computed and a hashing algorithm is then applied to detect pairs of points with symmetric fields, while a voting scheme is used to robustly identify the location of symmetry axis.
120
M. Chertok and Y. Keller
Three-dimensional symmetry was analyzed in [KCD+ 02, MSHS06]. The scheme computes a reflectional symmetry descriptor that measures the amount of reflectional symmetry of 3D volumes, for all planes through the center of mass. The descriptor maps any 3D volume to a sphere, where each point on the sphere represents the amount of symmetry in the object with respect to the plane perpendicular to the direction of the point. As each point on the sphere also represents the integration over the entire volume, the descriptor is resilient to noise and to small variations between objects. We show that our approach is directly applicable to three-dimensional meshes. SIFT local image features [Low03] were applied to symmetry analysis by Loy and Eklundh in [LE06]. In their scheme, a set of feature points is detected over the image, and the corresponding SIFT descriptors are computed. Feature points are then matched in pairs by the similarity of their SIFT descriptors. These local pairwise symmetries are then agglomerated by a Hough voting space of symmetry axes. The vote of each pair in the Hough domain is given by a weight function that measures the discrepancy in the dominant angles and scales [Low03] of the feature points. As the SIFT descriptors are not reflection invariant, reflections are handled by mirroring the SIFT descriptors. In contrast, our scheme is based on a spectral relaxation of the self alignment problem. It recovers the self assignment directly. Thus, we avoid the quantization the Hough space, and our scheme can be applied, as is, to analyzing higher dimensional data and will not suffer the curse dimensionality manifested by a density (voting) estimation scheme, such as the Hough transform. Also, our scheme does not require a local symmetry measure, such as the dominant angle, and is purely geometric. It can be applied with any local feature, such as correlators and texture descriptors [OPM02]. The work of Hays et al. in [HLEL06] is of particular interest to us, as it combines the use of local image descriptors and spectral high-order assignment for translational symmetry analysis. Translational symmetry is a problem in texture analysis, where one aims to identify periodic or near-regular repeating textures, commonly known as lattices. Hays et al. propose to detect translational symmetry, by detecting feature points and computing a single, high order, spectral self-alignment. The assignments are then locally pruned and regularized using thin-plate spline warping. The corresponding motion field is elastic and nearly translational, hence the term translational symmetry. In contrast, our scheme deals with rotational and reflectional symmetries where the estimated self-alignments relate to rotational motion. The core of our work is the analysis of multiple self assignments and their manifestation via multiple eigenvectors and eigenvalues. Moreover, based on the spectral properties of geometric transform operators, we introduce a global assignment pruning measure, able to detect erroneous self-assignments. This comes out to be essential in analyzing symmetries in real images, which are often embedded in clutter.
A Graph Matching Approach to Symmetry Detection and Analysis
121
3.2 Local Features The use of local features is one of the corner stones of modern computer vision. They were found to be instrumental in a diverse set of computer vision applications such as image categorization [ZMLS07], mosaicking [BL03] and tracking [TT05], to name a few. Originating from the seminal works of Cordelia Schmid [SM97] and David Lowe [Low03], local features are used to represent an image I by a sparse set of salient points {xi }, where each point is represented by a vector of parameters Di denoted a descriptor. The salient set {xi } is detected by a detector. The detector and descriptor are designed to maximize the number of interest points that will be redetected in different images of the same object, and reliably matched using the descriptors. The descriptor characterizes a small image patch surrounding a pixel. Due to their locality, local features are resilient to geometrical deformations and appearance changes. For instance, a global complex geometrical deformation, can be normalized locally by estimating a dominant rotation angle and characteristic scale per patch, or by estimating local affine shape moments [MS04]. The choice of the geometrical normalization measures depends on the geometrical deformation we aim to handle. The set of local features {Di } is then denoted the image model of a particular image. Definition 3 (Image model). The Image model M is made of the set of N interest points S = {xi }N1 , the corresponding set of local descriptors {Di }N1 and a set of local attributes {θi , σi }N1 . θi and σi are the local dominant orientation and scale, respectively, of the point i. Denote M = {Mi }N1 , where Mi = {xi , di , θi , σi }. A myriad of region detectors and descriptors can be found in literature [MTS+ 05]. One of the most notable, being David Lowe’s SIFT descriptor [Low03]. Various objects might require different combinations of local detectors and descriptors [MTS+ 05], depending on the object’s visual properties. For instance, the SIFT [Low03] excels in detecting and describing naturally textured images, while large piecewise constant objects are better detected by the affine covariant MSER
(a)
(b)
Fig. 2 Feature point detectors. (a) A Hessian-based scale-invariant detector is more suitable for non structured scenes (b) MSER responds best to structured scenes.
122
M. Chertok and Y. Keller
[MCUP02], as shown in Fig. 2b. The ellipses in the figure represent the second moment matrix of the detected regions. In contrast, non-structured object are better characterized by affine adapted Hessian-like detectors as depicted in Fig. 2a. The common solution is to use multiple descriptors simultaneously [NZ06]. In the context of symmetry detection in contrast to object recognition, the local features are all extracted from the same image. Hence, one can assume that symmetric points within the same image, would respond well to the same type of local detector/descriptor. This allows us to use one detector-descriptor pair at a time.
3.3 Spectral Matching of Sets of Points in Rn N2 N Given two sets of points in Rn , such that S1 = x1i 1 1 and S2 = x2j , where 1 N1 k n x j ∈ R , k = 1, 2, we aim to find a correspondence map C = cik jk 1 , such that cik jk implies that the point x1ik ∈ S1 corresponds to the point x2j ∈ S2 . Figure 3 presents k an example of two sets being matched. Spectral point matching was first presented in the seminal work of Scott and Longuet-Higgins [SLH91], who aligned point-sets by performing singular value decomposition on a point association weight matrix. In this work we follow a different formulation proposed by Berg et al. [BBM05] and its spectral relaxation introduced by Leordeanu et al. in [LH05]. We start by formulating a binary quadratic optimization problem, where the binary vector Y ∈ {0, 1}, represents all possible assignments of a point x1ik ∈ S1 to the points in set S2 . The assignment problem is then given by: Y ∗ = arg max Y T HY , Y ∈ {0, 1} (11) Y
Fig. 3 Toy example for matching two sets of points
where H is an affinity matrix, such that H (k1 , k1 ) is the affinity between the matchings cik jk and cik jk . H (k1 , k1 ) → 1 implies that both matchings are consistent, and 1 1 2 2 H (k1 , k1 ) → 0, implies that the matchings are contradictory. In practice, we use
2 1 1 1 2 2 H (k1 , k1 ) = exp − d xik , xik − d x jk , x jk (12) 1 2 1 2 σ
A Graph Matching Approach to Symmetry Detection and Analysis
123
where σ is a scale factor. This binary quadratic optimization is known to be np-hard [SM00], and its solution can be approximated by spectral relaxation Z ∗ = arg max Z
Z T HZ , Z ∈ R. ZT Z
(13)
Thus, Z ∗ is given by the eigenvector corresponding to the largest eigenvalue λ1 , as this maximizes Eq. 13. This approach can be considered as normalized cut cluster N N ing applied to the set of correspondences cik jk 1 1 2 . Namely, we assume that the M true correspondences cik jk 1 (M N1 N2 ) form a tightly connected cluster based on the affinity measure in Eq. 12. In [CSS07] Cour and Shi proposed a doublystochastic normalization of the affinity matrix H, and adding an affinity constraint over the solution Z to enforce one-to-one matchings. Given the relaxed solution Z ∗ , we apply the discretization procedure given in [LH05] to derive an approximation Y to the binary vector Y ∗ . Note that since we are interested in symmetry analysis, we expect to recover multiple solutions (eigenvec K tors) Yi , where K is the order of symmetry. In the following section we show 1
how to estimate K based on the eigenvalues of the affinity matrix H, and how to recover the symmetry correspondences. We then compute the symmetry centers for rotational symmetry, and axes of reflection for reflectional symmetry.
4 Spectral Symmetry Analysis In this section we apply the spectral matching to symmetry analysis and derive the spectral symmetry analysis scheme. We start in Section 4.1, by presenting a general computational approach for the detection and analysis of symmetries of sets of points in n-dimensional spaces. We then elaborate on the analysis of symmetry in two-dimensional images in Section 4.2.
4.1 Spectral Symmetry Analysis of Sets in Rn Given a set of points S ∈ Rn , with a symmetry of order K, it follows by Section 2, that there exists a set of symmetry transformations, {TCK } and {TDK }, that map S to itself. The main issue is how to detect these multiple transformations simultaneously. In terms of numerical implementation, this implies that one has to look for a set local solutions of the corresponding optimization problem. Most matching and alignment schemes, such as RANSAC and ICP, lock on to a single optimal alignment, and it is unclear how to modify them to search for multiple solutions. Spectral relaxation provides an elegant solution to this issue. The multiple selfalignments will be manifested by the multiple maxima of the binary formulation in Eq. 11 and its corresponding relaxation in Eq. 13. The maxima of the Reighley quotient in Eq. 13 are the leading eigenvalues of the affinity matrix H, and the corresponding arguments are the corresponding eigenvectors. This beautiful
124
M. Chertok and Y. Keller
property, allows to recover multiple assignments, simultaneously, and independently by computing the eigendecomposition of H. Such an example is given in Fig. 4, where the four self alignments are manifested by four dominant eigenvalues. Note, that the largest eigenvalue corresponds to the identity transform, that maps each point to itself. Hence, given the set of interest points S ∈ Rn , we apply the spectral alignment algorithm in Section 3.3 and compute the eigen-decomposition {ψi , λi }K1 of Eq. 13. The overall number of symmetry axes is given by the number of large eigenvalues K, and the correspondence maps {Ci }K1 are derived by the discretized binary eigenvectors {ψi }. As the spectral alignment matches Euclidean distances between points, it can be used to compute multiple non-parametric alignments. This implies that we do not have to predefine which symmetry type to analyze; those that do exist, will be detected. But, this also implies that the scheme might detect erroneous self-alignments, which are unrelated to the symmetry. This phenomenon would become evident during the analysis of symmetric sets embedded in clutter. The problem is resolved in the next section, by incorporating the geometrical constraints of the symmetry transforms, discussed in Section 2. This allows us to prune the erroneous self-alignments. 4.1.1
Perfect Symmetry and Spectral Degeneracy
When analyzing perfectly symmetric sets of points, multiple alignments might be manifested by the same eigenvalue, and the corresponding eigenvector becomes degenerate. Each eigenvectors is then a linear combination of several assignments. This phenomenon, never occurs with data sources other then synthetic sets of points. For instance, in images, the feature points detectors have a certain subpixel accuracy, and corresponding feature points do not create perfect symmetries. This applies to synthetic images and even more so, to real images and three-dimensional objects that are never perfectly symmetric. In order to generalize our approach to perfectly symmetric sets of points, we propose adding Gaussian random noise N (0, σn ) to the non zero elements of the affinity matrix. This breaks down the perfect symmetry, if it exists, and does not influence the analysis of regular data. In order to retain the symmetricity of the affinity we add a symmetric pattern of noise. As the non zeros affinities are matrix of O 10−1 for a well chosen value of σ in Eq. 12, we used σn = 10−3 .
4.2 Spectral Symmetry Analysis of Images An image I ∈ R2 is a scalar or vector function defined over R2 . As such, the spectral matching scheme can not be used, as it applies to sets of points. Hence, we turn to image modeling by means of local features, as discussed in Section 3.2. This allows us to represent the input image as a set of salient points. The rotation invariance of the detectors, guaranties that corresponding symmetric points would be detected as salient points simultaneously. We then present in Section 4.2.2 a scheme for pruning erroneous spectral alignments, and identifying valid matchings as CK or DK . Last,
A Graph Matching Approach to Symmetry Detection and Analysis
125
given the pruned valid transforms we show in Section 4.2.3 how to recover the intrinsic geometrical properties (center of rotation and axis of reflection for CK or DK , respectively) of each detected symmetry. 4.2.1
Image Representation by Local Features
Given an input image I, we compute an image model M for each type of local detector/descriptor. A reflected replica of the features is then added to M [LE06]. This allows us to handle reflections, recalling that the local features are rotationally, but not reflectionally invariant. The features are then progressively sampled [ELPZ97] to reduce their number to a few thousands. The progressive sampling spreads the points evenly over the image, thus reducing the number of image regions with high numbers of features. These are prone to produce local partial matches. We also utilize the dominant scale property of the local features [Low03] to prune false pairwise assignments and further sparsify the affinity matrix H. As we analyze self-alignments within the same image, corresponding features will have similar dominant scales. Hence, we prune the pairwise affinities in Eq. 12 by ⎧ 1 ⎪ Sc x /Sc x2jk ⎪ log > |log (Δ S)| i ⎪ k1 1 ⎪ ⎨ or 0 (k1 , k1 ) = H (14) 1 ⎪ Sc x /Sc x2jk ⎪ log > |log (Δ S)| ik2 ⎪ 2 ⎪ ⎩ H (k1 , k1 ) else where Sc (x) is the dominant scale of the point x, and Δ S is a predefined scale differential. We use the |log (.)| to symmetrize the scale discrepancy with respect to both points. This implies that if a pair of corresponding points is scale-inconsistent, all of its pairwise affinities are pruned. In order to effectively utilize the sparsity of the affinity matrix H, we threshold it by T = 10−5 . Namely, the affinity in Eq.12 is always nonzero, but for geometrically inconsistent pairs,the affinity is of O 10−7 , while the consistent pairs are of O 10−1 . 4.2.2
Symmetry Categorization and Pruning
Given the image model M we apply the spectral matching scheme described in Section 3.3, and derive P tentative self alignments of the image denoted {Ci }P1 . Typically P K, K being a typical order of image symmetry. In practice K < 10 and P ≈ 15. The reason being that in real images, the symmetric patterns might be embedded in clutter, resulting in the recovery of spurious self-alignments (eigenvectors) unrelated to the symmetry. To address this issue we propose an assignment pruning scheme, that is based on the norm property of symmetry transforms in R2 . Namely, by computing the projective transform corresponding to the recovered matching, and recalling that the norm of a symmetry transform is ±1 (Section 2), erroneous matchings can
126
M. Chertok and Y. Keller
be pruned. Moreover, this provides a mean for categorizing transforms to either CK or DK . We analyze each correspondence map Ci , by applying a normalized DLT algorithm [HZ04], and fitting a projective motion model Ti ⎡ ⎤⎡ ⎤ ⎡ ⎤ t11 t12 t13 x1 x2 Ti X1 = X2 = ⎣ t21 t22 t23 ⎦ ⎣ y1 ⎦ = ⎣ y2 ⎦ , (15) t31 t32 1 1 1 X1 and X2 being the spatial coordinates of corresponding points in Ci . Equation 15 can also be solved for an affine motion model, where the choice of the model (projective vs. affine) depends on the expected distortion within the image. The correspondence map Ci can be pruned for erroneous point matchings by applying a robust Least-Squares scheme, such as RANSAC [FB81]. Given the transform Ti , we can now apply Eqs. 3 and 7, and classify a transform Ti as a cyclic symmetry CK if det(Ti ) ≈ 1, a reflectional symmetry DK if det (Ti ) ≈ −1, or discard it as an erroneous self-alignment otherwise. Algorithm 1 summarizes the first two steps of the SSA. We emphasize that the spectral matching and pruning schemes can also be applied to sets of points in higher dimensions. An example of symmetry analysis of a three-dimensional object is provided in Section 5. 4.2.3
Computing the Geometrical Properties of the Symmetry
The center of symmetry and axis of rotation can be computed by complementary geometrical and analytical approaches. The axis of reflection can be computed analytically given the corresponding transform TDK , by applying Eq. 8. The reflection axis is the line connecting the two points, corresponding to the two eigenvectors of TDK with an eigenvalue of λi = 1. We denote this the analytical solution. In addition, one can apply a geometrical solution, where we connect the corresponding points in DK found by the spectral matching. These are the points, which were used to estimate TDK in the previous section. The reflection axis is the line that fits through the middle point of each such line segment. Given the transform TCK corresponding to some rotational symmetry CK , the center of rotation can be recovered by applying Eq. 4 and computing the eigenvector of TCK , corresponding to λi = 1. The center of rotation can also be computed geometrically, by connecting matching points. For each such line, consider the normal passing through its middle, as all such normals intersect at the center of rotation. Thus, the center of rotation is derived by solving an overdetermined set of equations in the least squares sense, similar to the robust fitting of the reflection axes. Theorems 1 and 2 provide the foundation for inferring the complete set of symmetries, given a subset of them, detected by the spectral analysis. Given two reflectional symmetry transforms {DK1 , DK2 }, the order of symmetry can be derived by computing the angle between the reflection axis Δ α . The order of symmetry is then give by solving:
A Graph Matching Approach to Symmetry Detection and Analysis
127
Algorithm 1. Spectral Symmetry Analysis of Image 1: Create an image model M of the input image. Suppose it contains N points. 2: Set P, the number of eigenvectors to analyze and ε , the norm error of the symmetric transform. 3: Progressively sample the interest regions 4: Reflect all the local descriptors {Di }N 1 in the image while preserving the originals 5: Compute an affinity matrix based on putative correspondences drawn by similar descriptors {Di }N 1 and the dominant scale pruning measure 6: Add random noise to the affinity matrix. 7: Solve Eq. 13 and compute the eigendecomposition {ψi , λi }. 8: while i < P do 9: Derive correspondence map Ci from ψi 10: Estimate transformation matrix Ti 11: if |det (Ti ) − 1| < ε then 12: Rotation detected - use Eq. 4 to find the center 13: else if |det (Ti ) + 1| < ε then 14: Reflection detected - use Eq. 8 to find the reflection axis 15: end if 16: end while
2π Z = K, Z, K ∈ Z. Δα For instance, given two reflectional axes with a relative angle Δ α = π2 , implies that the object has at least two reflectional symmetry axes. But, there might also be four or even eight symmetry axes. That would imply that the spectral scheme identified only a subset of the axes. Hence, the symmetry order can be estimated up to scale factor Z. Given more than two reflection axes, one can form a set of equations over the integer variables {Zi } 2π Z1 = K Δ α1 .. . 2π Zn = K Δ αn
(16)
As K < 8 for most natural objects, Eq. 16 can be solved by iterating over K = 1..8 and looking for the value of K for which all of the {Zi } are integers.
5 Experimental Results In this section we experimentally verify the proposed Spectral Symmetry Analysis scheme by applying it to real images and volumes. In Section 5.1 we apply the SSA to a set of real images, where the detection of the symmetries becomes
128
M. Chertok and Y. Keller
increasingly difficult. This allows us to exemplify the different aspects of the SSA, as the simple examples point out the core of the SSA, and the more difficult ones require the more elaborate components. In Section 5.2 we apply our scheme to the BioID face database [Bio01]. This dataset consists of 1521 face images with ground truth symmetry axes. This allows to asses the scheme’s accuracy and compare it to Loy and Eklundh [LE06], whose results are considered state-of-the-art. Last, we detect symmetries in three-dimensional objects in Section 5.3.
5.1 Symmetry Analysis of Images Figure 4 presents the analysis of a synthetic image with rotational symmetry of order four. Applying Algorithm 1 produces the sets of corresponding eigenvalues and eigenvectors λi and ψi , respectively. The spectral gap in Fig. 4a is evident and allows to estimate the order of symmetry. Note that the different self-alignments are manifested by non-equal eigenvalues, despite the image being synthetic. We attribute that to the imperfectness of the feature point detectors, detecting the feature points at slightly different locations, and to the addition of noise to the feature points coordinates. The transformation TC1 , corresponding to the leading eigenvalue λ1 , is
(a)
(b)
(c)
(d)
Fig. 4 Rotational symmetry. (a) Eigenvalues λi (b) The rotation corresponding to λ2 (c) The rotation corresponding to λ3 (d) The rotation corresponding to λ4
A Graph Matching Approach to Symmetry Detection and Analysis
(a)
129
(b)
Fig. 5 Reflectional symmetry (a) Eigenvalues λi (b) The reflection corresponding to λ2
found to correspond to the identity transform and is thus discarded. By estimating TC2 , the symmetry is found to be a rotational symmetry as det(TC2 ) ≈ 1. Applying Eq. 4 to TC2 recovers the center of symmetry, that is marked by a red dot in the Fig. 4b. Eigenvalues λ3 , λ4 uncover the remaining symmetrical self-alignments, which are drawn in Figs. 4c and 4d, respectively. The analysis of {λi }i>4,i
CC BR ( A, B) and if CC ( A, B) error < 0 it means that CC AR ( A, B) < CC BR ( A, B) . The registration algorithm that has highest value of CC ( A, B ) error will be chosen as the al-
gorithm for the considered problem. The author brain atlas constructed for purpose of labeling two dimensional CT images was consisted of 11 brain templates and 11 different brain tissues labels (that complexity showed to be sufficient). Labels for those templates were created by adaptation of Talairach Atlas labels [23] that is popular and broadly used standard brain atlas template. The comparison of CC ( A, B ) error of three previously presented algorithm with different parameters (Table 3) was performed on set of 93 CT images. The average CC ( A, B ) error for considered algorithms is presented in Fig 4. Because CCerror distribution appeared to be normal the minimally important difference (MIR) criterion was used in order to compare algorithms (Fig 4).
CCerror between
156
T. Hachaj
Table 3 Image registration algorithms and their parameters ( σ is a value of variation coefficient in Gaussian kernel) Algorithm designation
Registration algorithm
A
Affine algorithm
B
TDA,
C D E F G
σ TDA, σ TDA, σ TDA, σ TDA, σ TDA, σ
= 0.5 , 2000 iterations = 1 , 2000 iterations = 2 , 2000 iterations = 5 , 2000 iterations = 10 , 2000 iterations = 15 , 2000 iterations
H
FFD, initial grid size 6x6, 12 iterations
I
FFD, initial grid size 8x8, 12 iterations
J
FFD, initial grid size 10x10, 12 iterations
K
FFD, initial grid size 16x16, 12 iterations
L
FFD, initial grid size 32x32, 12 iterations
M
FFD, initial grid size 6x6, 12 iterations, add
N
FFD, initial grid size 8x8, 12 iterations, affine rotation and scaling
O
FFD, initial grid size 10x10, 12 iterations, affine rotation and scaling
P
FFD, initial grid size 16x16, 12 iterations, affine rotation and scaling
Q
FFD, initial grid size 32x32, 12 iterations, affine rotation and scaling
R
FFD, initial grid size 6x6, 12 iterations, 1 mesh refinement
S
FFD, initial grid size 8x8, 12 iterations, 1 mesh refinement
T
FFD, initial grid size 10x10, 12 iterations, 1 mesh refinement
U
FFD, initial grid size 6x6, 12 iterations, 1 mesh refinement, affine rotation and scaling
V
FFD, initial grid size 8x8, 12 iterations, 1 mesh refinement, affine rotation and scaling
W
FFD, initial grid size 10x10, 12 iterations, 1 mesh refinement, affine rotation and scaling
X
FFD, initial grid size 6x6, 12 iterations, 2 mesh refinement
Y
FFD, initial grid size 8x8, 12 iterations, 2 mesh refinement
Z
FFD, initial grid size 10x10, 12 iterations, 2 mesh refinement
1
FFD, initial grid size 6x6, 12 iterations, 2 mesh refinement, affine rotation and scaling
2
FFD, initial grid size 8x8, 12 iterations, 2 mesh refinement, affine rotation and scaling
3
FFD, initial grid size 10x10, 12 iterations, 2 mesh refinement, affine rotation and scaling
Pattern Classification Methods for Analysis and Visualization
157
Fig. 4 The minimally important difference, standard error and the average value of CC ( A, B) error for all considered algorithms
The highest value of
CCerror (0.521) was reached with algorithm (1) that is
FFD with initial grid size 6 x 6, 12 iterations and two adaptive mesh refinement with additional affine rotation and scaling. According to the MIR test there are no significant difference between algorithms (1), (2), (3), (v) and (w). All of these algorithms are FFD with mesh refinement and additional rotation and scaling. The worst result was achieved for affine transform ( CCerror = 0.220 ) and FFD with dense mesh (32 x 32)
CCerror = 0.134 . That happens because affine transform
cannot model local difference between registered images. Dense mesh at the beginning of the registration process with FFD does not allow to model global differences between objects (scaling / translations / rotations). The MIR test of Thirion’s algorithm (cases (b) – (g)) did not show important differences between algorithms with different value of variation parameter σ in Gaussian kernel. The largest variety of
CCerror was observed between different configurations
of FFD. Neither sparse ((h), (m)) nor dense mesh ((l), (q)) leads to satisfactory results. The significant improvement of algorithms performance can be achieve with simultaneous utilizing mesh refinements, affine rotation and scaling. Finally, algorithm (1) was chosen for the author’s solution because the highest value of CCerror was reached with it.
158
T. Hachaj
The example results of AA labeling process are presented in Fig 5. Each row presents two pairs of CT images. Left figure of each pair is static CT image superimposed on moving CT image. Right figure of each pair is static CT image with tissues labels. First row is a state before registration, next free rows are results of affine register with Nelder-Mead Simple method, Thirion’s algorithm and FFD algorithm (initial grid 6x6, 12 iterations, 2 mesh refinement).
Fig. 5 The example results of AA labeling process
5 Classification of Detected Abnormalities After detection of asymmetry regions (lesions) in CBF and CBV maps semantic interpretation of extracted features is performed (a diagnosis based on the detected symptoms [4]). The algorithm decides what kind of lesion (ischemic / hemorrhagic) and in which hemisphere was detected. It is done by comparing averaged CBF / CBV with normal values from Table 1 (Fig 6). In the next step algorithm analyze both perfusion maps simultaneously in order to detect: 1. Tissues that can be salvaged (tissues are present in CBF and CBV asymmetry map and values of rCBF did not drop beyond 0.48 – Table 2)
Pattern Classification Methods for Analysis and Visualization
159
2. Tissues that will eventually become infarcted (tissues are present in CBF and CBV asymmetry map and values of rCBF did drop beyond 0.48) 3. Tissues with an auto regulation mechanism in ischemic region (decreased CBF with correct or increased CBV) The visualized effect of previous rule is presented in Fig 7.
Fig. 6 The process of determination lesion type and position (top row CBF map, bottom row CBV map). (A) input images, (B) CBF and CBV images with dissimilarity detected. The value of perfusion in right region of interest - ROI (left side of image) is 8.96 ml/100 g/min in left ROI 69.01 ml/100 g/min. According to Table 1 it is ischemic lesion in right hemisphere. CBV in right ROI equals 0.88 ml / 100g, CBV in left ROI is 3.54 ml / 100g. According to Table 1 it is ischemic lesion in right hemisphere. (C) input image with potential lesion marked.
Fig. 7 The visualized prognosis for brain tissues superimposed on CT image. Red region – infarcted tissues, blue region – tissues with an auto regulation mechanism.
160
T. Hachaj
6 System Validation and Results The DMD system was validated on set of 37 triplets of medical images acquired from 30 different adult patients (man and woman) with suspicious of ischemia / stroke. Each triplet was consisted of perfusion CBF and CBV map and “plain” CT image (one of the image from perfusion treatment acquired before contrast arrival became visible). The algorithm response was compared to image description done to each case by radiologist. The hypothesis to verify was if there are any lesions in perfusion map and if the algorithm found correct position of them (if the algorithm gave a wrong answer for any of those conditions the case was considered as “error”). Tests results were summed up according to terminology presented in Table 4. Table 4 Classification of test results according to presence / absence of errors (according to [2]) Actual condition Present Test result
Absent
Positive
True positive
False positive
Negative
False (invalid) negative
True (accurate) negative
• True positive (TP) – approvement of hypothesis (presence / position of lesions), algorithm results matched radiologist description. • False positive (FP) - the error of rejecting a hypothesis (presence / position of lesions) when it is actually true (algorithm did not find lesions or situated them in wrong place). • True negative (TN) - rejecting a hypothesis (presence of lesions) when perfusion maps did not show lesions. • False negative (FN) - the error of failing to reject a hypothesis (presence of lesions) when perfusion maps did not show lesions. The obtained results are presented in tables. Table 5 contains results of full automatic detection (without manual correction of position of symmetry axis). Table 6 contains results of semi automatic detection (with manual correction of position of symmetry axis). Table 5 Full automatic detection results (without correction of position of symmetry axis) Actual condition Test result
Present
Absent
Positive
11
21
Negative
14
28
Pattern Classification Methods for Analysis and Visualization
161
The false positive rate (the proportion of negative instances that were erroneously reported as being positive) is:
FPR =
FP ⋅ 100% ≈ 65.6% FP + TP
The false negative rate (the proportion of positive instances that were erroneously reported as negative) is:
FNR =
FN ⋅100% ≈ 33.3% FN + TN
Total error rate (the proportion of error instances to all instances):
TER =
FN + FP = 47.3% FP + TP + FN + TN
It can be clearly see that automatic detection of symmetry axis (ADOSM) can be use only to simplify further manual detection of symmetry axis. The huge FPR factor is mostly caused by overestimation of asymmetry regions (i.e. additional regions between brain hemispheres or in the top / bottom of the perfusion map). ADOSM can be very helpful in early estimation of visible lesions because of computation speed and because “real” axis differs only by few degreases, but still further manual correction may be necessary. Table 6 Semi automatic detection results (with correction of position of symmetry axis) Actual condition Present Test result
Absent
Positive
23
9
Negative
8
34
The false positive rate (the proportion of negative instances that were erroneously reported as being positive) is:
FPR =
FP ⋅ 100% ≈ 28.1% FP + TP
The false negative rate (the proportion of positive instances that were erroneously reported as negative) is:
FNR =
FN ⋅100% ≈ 19.0% FN + TN
162
T. Hachaj
Total error rate (the proportion of error instances to all instances):
TER =
FN + FP = 23.0% FP + TP + FN + TN
Semi automatic detection gives far better results then full automatic. 77.0% of tested maps were rightly classified and the visible lesions were detected and described identically to radiologist diagnosis. The FPR error appears more often then FNR. This situation happens because its more complicated task to estimate lesion presence and its position then only reject the hypothesis of presence of lesion. Most of the errors were caused – similarly to full automatic detection case - by overestimation of asymmetry regions (i.e. additional regions between brain hemispheres or in the top / bottom of the perfusion map). The errors might be eliminated by more accurate detection algorithm with additional adaptive factors i.e. taking into account the brain volume (for adaptation the minimal lesion size factor), the noise ration in perfusion map (adaptive median filter size) or average perfusion in whole hemisphere (adaptive threshold for lesion acceptance).
7 Data Visualization – Augmented Reality Environment Augmented reality (AR) is a technology that allows the real time fusion of computer generated digital content with the real world. Unlike virtual reality (VR), that completely immerses users inside a synthetic environment, augmented reality allows the user to see three dimensional virtual objects superimposed upon the real word [11]. Analyzing the articles and conference proceedings of the leading AR research symposium several significant research directions on this field can be identified; those are: tracking techniques, display technologies, mobile augmented reality and interaction techniques. Tracking techniques is a group of computer methods that are used to achieve robust and accurate overlay of virtual imagery. There are various optical tracking techniques including fitting a projection of a 3D model onto detected features in the video image, matching a video frame with photos and real time image registration algorithms [35]. One of the most popular methods is a calculation of the 3D pose of a camera relative to a single square tracking marker. The task performed by tracking software is detection of presence of tracking marker and determination of its position based on interpretation of special pattern that is printed on it. The commonly used display technologies are head mounted, handheld and projection displays for AR. Many contemporary mobile devices like iPods and cell phones can execute sophisticated algorithms and are equipped with high resolution color displays and cameras. For those devices AR applications for outdoor usage are developed. The last type of methods is interaction techniques that enable user interaction with AR content. Augmented reality shows its usefulness especially in the field of the medicine [37]. The most notable examples are deformable body atlases, AR surgical navigation
Pattern Classification Methods for Analysis and Visualization
163
systems [21], interfaces and visualization systems. Augmented Reality aims at lifting the support to pre, intra and post – operative visualization of clinical data a new level by presenting more informative and realistic three dimensional visualizations [14]. Not many years ago medical AR systems required expensive hardware configuration and dedicated OS to handled real time superimposing and visualization process [35]. In this paragraph the author will present portable augmented reality interface for visualization of medical data (also volumetric) for Windows OS environment that can be run on off – the – shelf computer and test its capability. The main advantage of proposed solution is speed, quality of rendering of virtual object and low hardware cost.
7.1 Augmented Reality Environment For identification of the position of virtual object in the real scene the author used size - known square markers. The transformation matrices from these marker coordinates to the camera coordinates are estimated by image analysis. After thresholding of the input image, regions whose outline contour can be fitted by four line segments are extracted. Parameters of these four line segments and coordinates of the four vertices of the regions found from the intersections of the line segments are stored for later processes. The regions are normalized and the subimage within the region is compared by template matching with patterns that were given the system before to identify specific user ID markers. The further details of the algorithm can be found in [11]. The markers for the author’s system are cardboards with image patterns printed on one side (Fig 8).
Fig. 8 Set of image patterns (ARToolkit markers) used for AR software printed on one side of cardboard
164
T. Hachaj
The author used NyARTToolkit CS [40] that is the implementation of well known ARToolkit [15] in C# programming language. The important fact is that this module uses no native code (only managed).
7.2 Real Time Rendering of 3D Data There are two most popular methods capable of real time rendering 3D volume data: texture based volume rendering and volume ray casting. The author has chosen the ray casting GPU accelerated algorithm described in [17]. Algorithm uses standard front to back blending equations in order to find color and opacity of rendered pixels:
Where
Cdst = Cdst + (1 − α dst )α srcCsrc
(7.1)
α dst = α dst + (1 − α dst )α src
(7.2)
Cdst and α dst are the color and opacity values of the rendered pixels and
Csrc and α src are the color and opacity values of the incoming fragment. The approach proposed in [17] includes standard acceleration techniques for volume ray casting like early ray termination and empty-space skipping. By means of these acceleration techniques, the framework is capable of efficiently rendering large volumetric data sets including opaque structures with occlusions effects and empty regions. For many real world data sets, the proposed method is significantly faster than previous texture based approaches, yet achieving the same image quality.
7.3 Augmented Desktop - System Performance Test One of possible AR visualization interfaces is the Augmented Desktop (the module for AR visualization of clinical data created by author). It is capable to render not only 2D perfusion images but also all types of visual data that are used during perfusion treatment (in example 3D volume data form angio CT) in real time (Fig 9). The image data can be intuitively zoomed in and out just by changing its distance from the ones eyes. If image is near enough to camera in the left side of the screen description of image done by DMD algorithm is displayed (Fig 9 (B), (C), (D)). The rendering algorithm was implemented in XNA Framework with High Level Shader Language (HLSL). XNA Framework was originally designed as a set of tools that facilitates computer game development and management for Windows an Xbox. The basic window frame of XNA Framework is called “Game” and author used this notation in the paragraph.
Pattern Classification Methods for Analysis and Visualization
165
Fig. 9 Example visualization of data acquired from patience with suspicious of brain stroke after identification of potential brain lesions with algorithm described in this article. (A) Set of different images. (B) Volumetric data from head – neck protocol. (C) Detailed view of CBV map with perfusion lesions marked as white regions, in the left side of the screen description of image done by DMD algorithm. (D) CT image and the same CT image with prognosis for brain tissues.
Author’s solution was tested on off – the – shelf computer with Intel Core 2 Duo CPU 3.00 GHz processor, 3.25 GB RAM, Nvidia GeForce 9600 GT graphic card with 32 – bit Windows XP Professional. Creative pd 1120 USB web cam was used as a video capture device. Because volume rendering is the most time demanding algorithm system performance tests was done for AR visualization of single volume (the time of rendering 2D square image is relatively small). The speed of marker detection algorithm was 30 frames per second (fps) for 320 x 240 resolution of the camera and 15 fps for 640 x 480. Those are the maximal speed of image capture for tested web camera so the fps limit of marker detection software could not be found without better (faster) camera. The speed of 3D rendering depends of the size of rendered object. The smaller the object is (less number of pixel to compute) the rendering process run faster. In author’s research virtual object was localize in the nearest position to the camera as it is possible in order to generate the maximal possible size of the rendered model. Tests were performed on three 3D models generated from computed tomography (CT) data acquired from three different patients. Data was stored in collection of DICOM files. The size of rendered volume was 256 x 256 x 248 pixels,
166
T. Hachaj
256 x 256 x 240 pixels and 256 x 256 x 207 pixels. Each model was rendered in two modes: with no semi transparent pixels and with semi transparent pixels. Semi transparent pixels were used for simulation tissues with low density (in example skin). The speed of rendering was measured (in frames per seconds) for different popular screen resolution. The results were summarized in Table 7 and in Fig 10. Table 7 Dependents of speed of rendering three different 3D models on XNA Game window size (megapixels) Rendering speed (fps) Without semi transparent pixels
With semi transparent pixels
XNA Game resolution (megapixels)
Model 1
Model 2
Model 3
Model 1
Model 2
Model 3
800 x 600 (0.48)
69
72
83
45
47
45
832 x 624 (0.52)
66
67
76
42
44
43
960 x 600 (0.54)
67
71
81
44
47
45
1088 x 612 (0.67)
57
69
79
44
44
44
1024 x 768 (0.79)
51
53
62
32
33
32
1280 x 720 (0.92)
54
55
65
34
36
35
1280 x 768 (0.98)
49
54
59
33
33
33
1152 x 864 (1.00)
43
43
51
27
27
28
1280 x 800 (1.02)
47
47
55
30
29
31
1280 x 960 (1.23)
38
39
45
24
23
24
1280 x 1024 (1.31)
34
35
42
21
21
22
Fig. 10 Data from Table 7 presented in the charts. Dependents of speed of rendering three different 3D models (fps) on XNA Game window size (megapixels) (A) models with no semi transparent pixels, (B) models with semi transparent pixels
Pattern Classification Methods for Analysis and Visualization
167
As can be seen rendering speed decreases with size of rendered object but not necessarily with resolution of the screen. In example for model 1 with no semi transparent pixels for 1280 x 768 we have 49 fps, for 1152 x 864 43 fps and for 1280 x 800 47 fps. This is because model in 1152 x 864 is bigger (visualized with higher number of pixels) then in two other cases. For author’s hardware the best configuration of application window size / camera resolution and capture speed is 1024 x 768 (or 1280 x 768) for window 320 x 240 (30 fps) for camera captured image size or 1280 x 1024 for 640 x 480 (15 fps). In those two configurations all detected markers position will be instantly computed to proper linear transformation of virtual 3D model and rendered. The speed of rendering is higher for models with no semi transparent pictures because they required less computation time in the shaders (GPU processors). The example visualization performed by author’s solution is presented in Fig 11.
Fig. 11 Example augmented reality visualization of three different models constructed from CT data. From left to right: visualization of dense tissues (bones), model without semi transparent pixels, model with semi transparent pixels for tissues with low density.
8 Summary In this chapter the author had presented application of pattern classification methods in the task of automatic interpretation of dynamic PCT maps. The process of identification is consisted of three steps that utilize different pattern classification approach. Firstly author uses heuristic algorithm that decides if the considered perfusion map consists perfusion abnormality and determines its position (image processing and abnormality detection). In the second step brain tissues are labeled with brain AA with FFD algorithm. The last step is distinguishing between normal and abnormal perfusion regions that might be detected in both hemispheres and classification abnormal region to ischemic or hemorrhagic class. This classification task is based on prior gathered medical knowledge. The author also presents the test result of the proposed solution that was performed on real diagnostic data. Advanced pattern classification methods can also be utilized in the process of the AR visualization of medical data. Tracking algorithm detects the presence of markers, determines its type with template matching algorithm and measures its relation to camera (translation and rotation). Author describes one of many AR solutions that can be easily attached to nearly all medical and non medical application. AR gives support of a new level by presenting more informative and realistic
168
T. Hachaj
two and three dimensional visualizations for low cost. It should be considered that usage AR tracking software often requires not only knowledge of pattern classification methods but also advanced skills in the field of computer graphic. Acknowledgments. This work has been supported by the Ministry of Science and Higher Education, Republic of Poland, under project number N N516 511939.
References [1] Aksoy, F., Lev, M.: Dynamic Contrast-Enhanced Brain Perfusion Imaging: Technique and Clinical Applications. Semin Ultrasound CT MR 21, 462–477 (2000) [2] Allchin, D.: Error Types. Perspectives on Science 9, 38–59 (2001) [3] Bardera, A., Boada, I., Feixas, M.: A Framework to Assist Acute Stroke Diagnosis, Vision, Modeling, and Visualization (VMV 2005), Erlangen (2005) [4] Bodzioch, S., Ogiela, M.R.: New Approach to Gallbladder Ultrasonic Images Analysis and Lesions Recognition. Computerized Medical Imaging and Graphics 33, 154– 170 (2009) [5] Eastwood, J.D., et al.: CT perfusion scanning with deconvolution analysis: pilot study in patients with acute middle cerebral artery stroke. Radiology 222(1), 227–236 (2002) [6] Hachaj, T.: An algorithm for detecting lesions in CBF and CBV perfusion maps. BioAlgorithms and Med-Systems / Collegium Medicum - Jagiellonian University (6) (2007) [7] Hachaj, T.: Artificial Intelligence Methods for Understanding Dynamic Computed Tomography Perfusion Maps. In: International Conference on Complex, Intelligent and Software Intensive Systems (cisis), pp. 866–871 (2010) [8] Hachaj, T.: The unified algorithm for detection potential lesions in dynamic perfusion maps CBF, CBV and TTP. Journal of Medical Informatics & Technologies 12 (2008) [9] Hachaj, T.: The registration and atlas construction of noisy brain computed tomography images based on free form deformation technique. Bio-Algorithms and MedSystems, Collegium Medicum - Jagiellonian University (6) (2007) [10] Hachaj, T., Ogiela, M.R.: Automatic detection and lesion description in cerebral blood flow and cerebral blood volume perfusion maps. Journal of Signal Processing Systems for Signal, Image, and Video Technology 61, 317–328 (2010) [11] Haller, M., Billinghurst, M., Thomas, B.: Emerging Technologies of Augmented Reality: Interfaces and Design. Idea Group Publishing (2006) [12] Hoeffner, E.G., et al.: Cerebral Perfusion CT: Technique and Clinical Applications. Radiology 231(3), 632–644 (2004) [13] Koenig, M., Klotz, E., Heuser, L.: Perfusion CT in acute stroke: characterization of cerebral ischemia using parameter images of cerebral blood flow and their therapeutic relevance. Clinical experiences, Electromedica 66, 61–67 (1998) [14] Kalkofen, D., et al.: Integrated Medical Workflow for Augmented Reality Applications. In: International Workshop on Augmented environments for Medical Imaging and Computer-aided Surgery, AMI-ARCS (2006)
Pattern Classification Methods for Analysis and Visualization
169
[15] Kato, H., Billinghurst, M.: Marker Tracking and HMD Calibration for a video-based Augmented Reality Conferencing System. In: Proceedings of the 2nd International Workshop on Augmented Reality (IWAR 1999), pp. 85–94 (1999) [16] Koenig, M., Kraus, M., Theek, C., Klotz, E., Gehlen, W., Heuser, L.: Quantitative assessment of the ischemic brain by means of perfusion-related parameters derived from perfusion CT. Stroke; a Journal of Cerebral Circulation 32(2), 431–437 (2001) [17] Krüger, J., Westermann, R.: Acceleration Techniques for GPU-based Volume Rendering. IEEE Visualization (2003) [18] Latchaw, R.E., Yonas, H., Hunter, G.J.: Guidelines and recommendations for perfusion imaging in cerebral ischemia: a scientific statement for healthcare professionals by the writing group on perfusion imaging, from the Council on Cardiovascular Radiology of the American Heart Association. Stroke 34, 1084–1104 (2003) [19] Lev, M.H., et al.: Utility of Perfusion-Weighted CT Imaging in Acute Middle Cerebral Artery Stroke Treated With Intra-Arterial Thrombolysis: Prediction of Final Infarct Volume and Clinical Outcome. Stroke 32, 2021 (2001) [20] Lucas, E., Sánchez, E., et al.: CT Protocol for Acute Stroke: Tips and Tricks for General Radiologists. Radiographics 28(6), 1673–1687 (2008) [21] Marmulla, R., et al.: An augmented reality system for image-guided surgery. Medical Image Analysis 34 (2005) [22] Muir, K.W., Buchan, A., von Kummer, R., Rother, J., Baron, J.-C.: Imaging of acute stroke. Lancet Neurol. 5, 755–768 (2006) [23] Nowinski, W.L., et al.: Analysis of Ischemic Stroke MR Images by Means of Brain Atlases of Anatomy and Blood Supply Territories. Acad. Radiol. 13, 1025–1034 (2006) [24] Nowinski, W.L., et al.: Fast talairach transformation for magnetic resonance neuroimages. Journal of Computer Assisted Tomography 30 (2006) [25] Przelaskowski, A., Ostrek, G., Sklinda, K.: Ischemic stroke monitor as diagnosis assistant for CT examinations. Elektronika: konstrukcje, technologie, zastosowania 49, 104–114 (2008) [26] Przelaskowski, A., Ostrek, G., Sklinda, K., Walecki, J., Jóźwiak, R.: Stroke Slicer for CT-based Automatic Detection of Acute Ischemia. Computer Recognition Systems 3, 447–454 (2009) [27] Przelaskowski, A., Sklinda, K., Ostrek, G., Jóźwiak, R., Walecki, J.: Computer Aided Diagnosis in Hyper-acute Ischemic Stroke. Progress in Neuroradiology, 69–78 (2009) [28] Rueckert, D., Sonoda, L.I., Hayes, C., Hill, D.L.G., Leach, M.O., Hawkes, D.J.: Nonrigid Registration Using Free-Form Deformations: Application to Breast MR Images. IEEE Transaction on Medical Imaging 18(8) (1999) [29] Sasaki, M., et al.: CT perfusion for acute stroke: Current concepts on technical aspects and clinical applications. International Congress Series 1290, 30–36 (2006) [30] Sasaki, M.: Joint Committee for the Procedure Guidelines for CT/MR Perfusion Imaging (2006), http://mr-proj2.umin.jp/data/guidelineCtpMrp2006-e.pdf [31] Thirion, J.-P.: Image matching as a diffusion process: an analogy with Maxwell’s demons. Medical Image Analysis 2(3), 243–260 (1998) [32] Thompson, P., Mega, M., Narr, K., Sowell, E., Blanton, R., Toga, A.: Brain image analysis and atlas construction. In: Handbook of Medical Imaging, SPIE. ch. 17, pp. 1066–1119 (2000)
170
T. Hachaj
[33] Tietke, M., Riedel, C.: Whole brain perfusion CT imaging and CT angiography with a 64-channel. CT system Medica Mundi 52/1(07), 21–23 (2008) [34] Wang, H., et al.: Validation of an accelerated ’demons’ algorithm for deformable image registration in radiation therapy. Phys. Med. Biol. (2005) [35] Warfield, S., et al.: Advanced Nonrigid Registration Algorithms for Image Fusion, Brain Mapping: The Methods,, 2nd edn., pp. 661–690. Academic Press, San Diego (2002) [36] Wintermark, M., Reichhart, M., Thiran, J.P., et al.: Prognostic accuracy of cerebral blood flow measurement by perfusion computed tomography, at the time of emergency room admission, in acute stroke patients. Ann. Ann. Neurol. 51, 417–432 (2002) [37] Yang, G., Jiang, T.: Medical Imaging and Augmented Reality. In: Yang, G.-Z., Jiang, T.-Z. (eds.) MIAR 2004. LNCS, vol. 3150. Springer, Heidelberg (2004) [38] Young, I.T., Vliet, L.J.: Recursive implementation of the Gaussian filter. Signal Processing 44(2), 139–151 (1995) [39] Zierler, K.L.: Equations for measuring blood flow by external monitoring of radioisotopes. Circ. Res. 16, 309–321 (1965) [40] NyARTToolkit, CS home page, http://nyatla.jp/nyartoolkit/wiki/index.php?NyARToolkitCS [41] Brilliance Workspace for CT, MedicaMundi 50/3 (12) pp. 21-23 (2006) [42] Siemens, A.G.: Clinical Applications. Application Guide. In: Software Version syngo CT 2007, Siemens Medical (2006)
Chapter 9
Inference of Co-occurring Classes: Multi-class and Multi-label Classification Tal Sobol-Shikler Ben-Gurion University of the Negev
Abstract. The inference of co-occurring classes, i.e. multi-class and multi-label classification, is relevant to various aspects of human cognition, human-machine interactions and to the analysis of knowledge domains and processes that have traditionally been investigated in the social sciences, life sciences and humanities. Human knowledge representations usually comprise multiple classes which are rarely mutually exclusive. Each instance (sample) can belong to one or more of these classes. However, full labeling is not always possible, and the size of the consistently labeled is often limited. The level of existence of a class often varies between instances or sub-classes. The features that distinguish the classes are not always known, and can be different between classes. Hence, methods should be devised to perform multi-class and multi-label classification, and to approach the challenges entailed in the complex knowledge domains. This chapter surveys current approaches to multi-class and multi-label classification in various knowledge domains, and approaches to data annotation (labeling). In particular, it presents a classification algorithm designed for inferring the levels of co-occurring affective states (emotions, mental states, attitudes etc.) from their non-verbal expressions in speech.
1 Introduction Large volumes of domain knowledge are available but they are not always constructed in a manner that can be processed by machines. The selection of classes and the relations between the classes have an immense effect on the classification goals, design and capabilities [22, 46, 57, 61]. A large number of labels often means that each label is represented by only a small number of samples. Manual annotation and the number of samples per label in the training data pose limitation on the robustness and generality of the “ground truth” for the entire classification system. The consistency of the annotation defines the reliability of the system and its applicative scope. The data samples are represented mostly by a single modality or by multiple (synchronized or aligned) modalities, such as text, images, audio, video, data from multiple sensors and from various measurement equipments [1, 3, 50, 59, 62]. The selection of the application and of the modality to be used M.R. Ogiela and L.C. Jain (Eds.): Computational Intelligence Paradigms, SCI 386, pp. 171–197. springerlink.com © Springer-Verlag Berlin Heidelberg 2012
172
T. Sobol-Shikler
often also dictates the choice of representation manner of the information or knowledge domain, the type and number of features or attributes that serve as inputs to the classification system, and the classification method. This chapter describes in short the common classification process, reviews challenges in annotation of data with multiple class-labels, and surveys recent approaches to multi-class and multi-label classification in various knowledge domains. The chapter then presents the knowledge and behavioral domain of affective states (emotions, mental states, attitudes and the like), and a classification algorithm which was used for inferring the levels of co-occurring affective states from their non-verbal expressions in speech. Unlike other fields, the field of affective states has no definite “ground truth” for verification of the annotation. In addition, the choice of taxonomy has an important effect on the design. In this case of multi-label classification, the annotation of the data includes only one label per sample, therefore the algorithm performs semi-blind multi-label classification. In addition, there is an inherent sparsity because different sets of features distinguish different pairs of classes.
2 Applications The most commonly researched applications of multi-class and multi-label techniques relates to annotation, search and retrieval of documents in huge repositories of data of various modalities, such as text [13, 23, 43, 49, 67, 77], image [8, 18, 29, 33, 56, 68, 77, 79], video [7, 73] and music [31, 39, 71]. Annotation means assigning items to categories, or labeling data samples with class-labels (the semantic term associated with the class concept). Annotation in the context of applications refers to problems in which the manually labelled sets have to be extended. Otherwise, most classification problems can be also referred to as annotation, i.e. automatically assigning labels to samples. Search of items that belong to the required categories in large datasets, and retrieval of relevant items that belong to these categories. However, multi-class and multi-label classification schemes apply also to a very wide variety of current and foreseeable applications that relate to other knowledge domains and applications, for example, for the analysis of human behavioral cues [17, 37, 60, 73, 78]. These include not only the analysis of the verbal speech but also the analysis of cues such as non-verbal expressions in speech (prosody and paralinguistic cues), facial expressions, posture, hand and full body gestures and actions, as well as analysis of physiological cues from various measurement equipments, ranging from Electro-cardiograms (ECG) to functional magnetic resonance (FMRI) images. These behavioral cues can be analyzed for a single person or for groups. Multi-class and multi-label classification of human behavioral cues can be used for identifying and tracking individuals, as well as for analysis of affective and social cues of emotional and mental states, moods, attitudes, and the like. This has a wide scope of other applications, such as human-machine interactions (computers, robots, smart and mobile environments, etc), content-based context retrieval, video summarization, for assistive technology, technology aimed at improving the
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
173
abilities of specific groups in the population, medical diagnostic, surveillance, extended speech recognition and synthesis, animation, feedback systems, personalization of services, i.e. for adjusting their parameters to individual requirements and preferences, for authentication of a user identity for security applications, and for scientific purpose, i.e. for extending the knowledge of physiological, neurological and cognitive processes, and more. Other applications refer to the analysis of the behavior of masses which are relevant to sociology and to biology. This can be extended to the general case of representation, inference or mining, and analysis of human knowledge. Another example is the contribution of various scientific fields, with and without practical applications. Multi-class and multi-label classification methods have also been extensively researched in the field of Bioinformatics, in applications such as gene function categorization [5, 77]. They are also applicable to fields such as the analysis of geographical information [50]. Search and recognition of objects and scene analysis, can be useful for intelligent systems and robots, for applications that range from classifying products quality as well as for autonomous and semi-autonomous mobile robots and unmanned vehicles, for exploratory applications and for domestic, public and emergency services which require situation awareness, retrieval of objects and avoidance of others, and more [1, 62]. Additional research fields can benefit from analysis which is based on multi-class and multi-label classification methods. These could eventually develop to enable the analysis of complex knowledge domains and processes that have traditionally been investigated in different fields of social sciences, life sciences, arts and humanities. Examples from the art domains include the analysis of music genre and mood [39], color retrieval [25], and analysis of aesthetics characteristics and genres, with applications for research, annotation of art archives and retrieval from them, and also for applications in fields such as architecture and commercial multi-media design.
3 The Classification Process The classification process does not start with the classification, but is rather preceded by several preliminary stages that may affect the classification results. These include the definition, recording or selection of the input data, annotation or labeling, i.e. defining the classes to be found, and giving each data sample one or more of the class names or class labels, and pre-processing, extraction of features from the raw data that can be automatically processed. The first stage of the classification process usually concerns the recording and gathering of data or the choice between existing datasets. The type and quality of the gathered information and the definition of the available classes in it dictate, to a large extent, the capabilities of the classification system (“garbage in - garbage out” being the undesired case). There are several issues that cannot always be chosen, or that overcoming them can be achieved at a great cost (time, space, etc), and therefore methods have to be devised in order to overcome them. These include the gathering of enough accurate data, accurate annotation or labeling of the data, and the combination, i.e. gathering enough data to represent each class. Alternatively,
174
T. Sobol-Shikler
methods are presented for automatically or semi-automatically extend the number of annotated samples,, by using the relatively few manually annotated samples. In most of the applications and for most types of the input modalities, such as in audio and visual information, the classification is preceded by a pre-processing stage of the data. Pre-processing refers to the extraction of processable features from the raw data. As described in Figure 1. The definition of the knowledge domain, the collected data and its annotation influence most of the processing and pre-processing stages of the classification. In some cases, a second level of abstraction exists between the basic extracted features and the classification system or algorithm. Sometimes this level also implies an intermediate level of abstraction. This involves a group of features that define a domain of the modality which can be described by itself, and define aspects of the knowledge domain, such as rhythm and melody in music analysis [39, 31], color and shape in computer vision [73], and the like. In other cases, the intermediate processing stage is required in order to describe spatial and /or temporal properties of the extracted features. For example, extending the features extracted per-pixel in images or per-time-frame in audio samples, to patches, segments or regions that share characteristic values, analysis of the features in these areas, and of the relations between these areas. In most cases the number of available features is large, this “dimensionality curse” may cause over-fitting and extend the classification processing complexity and time. Many classification processes involve a stage of selection of a reduced set of features, which are found to be relevant according to the available data. Sometimes though, different sets of features distinguish different pairs or sets of classes. In this case, finding a single set of features for distinguishing all the classes is suboptimal [60].
Definition of the knowledge domain
Data collection & annotation
Feature extraction
High-level feature Extraction Feature selection
Fig. 1 A schematic description of the classification process
Classification
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
175
There are researchers that focus mainly on classification methods and apply them to various available datasets, which present certain types of complexity, others focus on a certain knowledge domain and devise methods that are appropriate or suited to the complexity presented by the specific knowledge domain and data. As can be seen from the discussion above, in many cases, “Getting intimate with the data” [70] is a ground rule that may affect the classification results. Therefore many of the classification solutions, although usually based on a few common basic methods, differ by the treatment which is tailored to the processed data. However, these tailored solutions are often suitable for various types of data and applications that share similar properties. The similarities may include the number of available classes, the relations between classes (taxonomic method), desired number of output classes per sample, the amount of available data per class, the connections between adjacent samples or between different parts of a single sample, the number of features that distinguish between classes, the number of sets of features that can be used for classification, the levels of feature abstraction or the number of pre-processing stages, and the like. The next section discusses a few of the considerations related to annotation or labeling methods. These have a considerable affect on the classification process, defining the inputs, outputs and therefore also the demands of the classification algorithm.
4 Data and Annotation Large volumes of domain knowledge are available but they are not always constructed in a manner that can be processed by machines. The selection of classes and the association of items with these classes have an immense effect on the classification goals, design and capabilities [22, 46, 57, 61]. Ontologies and taxonomies are used in order to label the datasets, i.e. assigning concepts (labels, groups, categories) to instances (individual concrete objects or samples), which are the inputs to the classification system, and its outputs. Therefore, they also define many of the demands of the classification system properties. Much effort is put into building domain-specific ontologies. The ontologies present a limited vocabulary specific to the knowledge domain, and specify “what exists” and therefore also what could be derived. These often also include synonyms and nested terms or conjunctures. In other words, ontology is a formal explicit specification of a shared conceptualization of a domain of interest [24]. The term taxonomy refers to the manner in which the terms are organized and presented. These represent concepts and the relations between them. Taxonomies usually refer to a formal organization, such as the organization of families and species in Biology. For the knowledge domains which are characterized, there are often several taxonomic models for representing and organizing the knowledge. These range from a few mutually exclusive and easily defined categories (the categorical approach), through representations of the knowledge space and the individual categories on a system of few dimensions or facets (the dimensional approach), to prototypes, a hierarchical organization of groups, or tree-like taxonomies in which the trees have several root categories (the prototypical approach) [25, 57]. For example, in the case of affective states [60], the categorical approach refers to
176
T. Sobol-Shikler
basic emotions [16], such as anger, happiness, fear, sadness, surprise and disgust, the dimensional approach refers to dimensions such as active-passive, positivenegative. In the Mind Reading taxonomy [6], which is an example of the prototypical approach, comprises 24 meaning groups such as unfriendly, kind, happy, romantic, thinking and more, each meaning group comprises various affective states that share a concept or a meaning, such as comprehending, considering, choosing, thinking, thoughtful, and more [6]. For multi-disciplinary fields and for dynamic processes the issue of taxonomic representation is still to be developed. The manner, in which knowledge is intuitively grasped and organized by humans is called Schema or mental schema. For example, the manner in which students generate schemas at the end of a course is a way to assess their understanding of the course material and the interconnections between the various taught subjects, or the manner by which different people annotate data in data repositories and pose queries for data search and retrieval. The prototypical taxonomic approach most resembles mental schemas. However, for simplicity of the models, most taxonomic methods associate each item with a single category or class, and each class at a single location in the defined space. This is a rather simplistic representation of the knowledge domain, which is not always suitable for the complexity required by real-world applications. Taxonomic methods are often used in order to reduce the large number of terms into a manageable and processable knowledge, both for people and for machines. On the other hand, methods have been devised for dealing with irreducible and even extendable huge numbers of labels, such as the ever extending personal images archives, and for all of the documents on the web. However, ontologies can also bridge the gap between datasets which are based on different Schemas or taxonomic representations [22]. For scene analysis, i.e. the existence of independent objects, the term “bag of words” is commonly used to describe the scene content [7]. Finding statistical relations or co-occurrences between these objects is one of the explored approaches to multi-label classification in these cases. In “real world” applications, the issue of annotation or labeling poses various problems. Manual labeling is time consuming. In addition, manual annotation of data is subject to subjective schemas and to human error. This brings forward the question of reliability and consistency of the “ground truth”, i.e. the training set, upon which the entire classification is built. This arises when the represented knowledge is implicit, i.e. is not directly represented, and people or annotators have to draw on their representations of the knowledge domain in order to annotate. This is the case in most cases of human non-verbal behavior, which is used to express various types of information and most of the inter-personal communication, and for mining of scientific knowledge, such as in Geography [50]. Manual labeling [15, 57, 71] comprises either on a relatively large group of human annotators, either from the wide population, such as using world wide web (Internet) users in various web domains, using a closed set of labels taken from a limited vocabulary set (taxonomy or ontology) or free labeling, or to use a relatively small group of annotators, mostly experts, to either verify the previous labeling or to generate the initial set of labels, or rules. Software tools have been devised in
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
177
order to facilitate manual annotation. Some of them present various hierarchical levels of taxonomy, such as the tool presented by Woitek et al. [71]. The size of the set of labels can be limited and pre-defined or extendable [11, 14, 28, 56, 78]. With the existence of multiple classes and multiple labels, the number of data samples with these labels is not always statistically significant. The problem of relatively small number of annotated samples is also associated with the problem of consistent labeling of the training data, or the “ground truth”. The issue of obtaining training data for meta-level classifiers, such as classifiers that combine binary classifiers, has also been addressed. The simple method, if there is enough data, is to split the data so that a part of it is used for the training, testing and validation of the binary classifiers, and another part of the data is used for training or for examination of the combining or meta classifier. Shiraishi and Fukumizu [55] discuss two other methods: reusing the data used for the binary classifiers also for the combining algorithm, which can lead to over-fitting, and stacking via cross-validation, bootstrap or bagging [41]. In most cases, the output for each label is binary in nature, i.e. a label either exists or not. However, there are cases in which the level of recognition of a class is important for tracking spatial and temporal processes and tendencies, such as the level of occurrence of each label between successive data samples recorded during sustained human-computer interactions, or in the analysis of geospatial information [1, 50, 59]. In such cases, the classes are not always mutually exclusive, and the data samples are not necessarily independent. There are fields in which the situation is reversed, such as in geography, in which “everything is related to everything else but nearby things are more related than distant things” [50]. All these considerations affect the classification process, they define the input, and the output of the classification system and pose requirements on the classification method.
5 Classification Approaches This section summarizes the definitions of binary, multi-class and multi-label classifications in order to establish a common ground before getting into more details in the following sections. Schematic descriptions of these classification methods are presented in Fig 2.
5.1 Binary Classification In binary classification, a classifier has to decide between two possible choices: YES/NO answers, i.e. a sample either belongs to a class or not, or choose between two disjoint classes. This is the most common behavior of well known classifiers [41].
5.2 Multi-class Classification In multi-class classification, a classifier or a classification system has to choose between more than two classes, but the sample must be still assigned to one target class only. In other words, each sample is assigned a single class-label from a set of n labels, and n>2. In comparison to binary classification, the set of possible classes increases (larger than two).
178
T. Sobol-Shikler
(a)
(b)
(c)
Fig. 2 A schematic description of the relations between classes in (a) binary classification, (b) multi-class, and (c) multi-label classification. The class boundaries can be of any shape
One approach to multi-class classification is to use classifiers which are specifically designed for multi-class classification. Another choice is to reduce the multi-class problem into a set of binary classifiers [2] and then return the class with the highest classification status or membership value (CSV), or optimize the margin-based loss function, i.e., return the closest class or the class best associated with the examined sample. This is a two-stage process, the first stage comprises the binary classifiers, the second stage is the combination or the comparison of the classification results in order to find the best candidate. The approach of binary classifiers can be divided in turn into two main approaches: one-against-all, in which each class is compared to all others, and one-against-one [54], in which each class is compared to each of the other classes [63]. A schematic description of these two approaches can be seen in Figure 3. A common challenge to both of these approaches is how to combine the binary classification results to a meaningful and coherent inference result. An issue that arises from dividing the classification problem into several smaller classifiers is if the same features have to participate in all the classifiers or if different sets of features can yield better results.
5.3 Multi-label Classification In multi-label classification, a classifier or a classification system has to choose between more than two classes, and each sample can be associated with a variable number of classes, i.e. each sample is assigned at least one class-label. In comparison to multi-class classification, the output of the classification result, i.e. the set of target classes, increases. Each sample can be associated with a variable number of classes at the same time. Therefore, the number of target classes has to be selected. As in multi-class classification, multi-label classification can be either constructed of binary classifiers, or be specifically tailored for multi-label classification.
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
179
The case of multi-label classification increases the challenges involved in consistent annotation or labeling of the data used for training and testing. Labels can be ranked, nested, statistically and logically related and simultaneously occur, in many cases these connections are not known in advance. Various methods of multi-label classification refer to the relations between labels, because in various modalities and applications these relations may affect the number of labels, and the classification method in use. Note: The term multi-label classification sometimes refers to co-occurring sets of features and of feature values that construct high-level features, or to different sets of values of features, used as an intermediate level of abstraction prior to the actual classification, and serve as input to classifiers such as Hidden Markov Model (HMM) classifiers and the like. This chapter does not approach this kind of problems, unless the output of the entire system comprises the assignment of multiple labels (concepts) to a sample. (a)
(b)
Fig. 3 A schematic description of the relations of a class to the other classes in the two paradigms of multi-class classification based on binary classifiers: (a) one-against-all, (b) oneagainst-one
6 Multi-class Classification Multi-class classification refers to classification methods in which there are more than two classes to represent the data, yet the classification process associates each sample with a single class-label. One approach to multi-class classification is to dissolve them in to multiple binary classifiers. The second approach aims to directly solve the multi-class problem. This section reviews some of the more commonly used approaches and a few of the more recent methods suggested for multi-class classification.
180
T. Sobol-Shikler
Note: Several of the methods based on “decision by a committee” for binary classification problems have been extended to accommodate the multi-class case. There are many books and reviews of these algorithms for the binary case [41, 45].
6.1 Multiple Binary classifiers The two common approaches to multi-class classification based on binary classifiers are “one-against-all” and “one-against-one”. 6.1.1 One-Against-All Classification Each class is compared to all the other classes, using a single classifier or a single decision, i.e. for n classes, there are n classifiers. Each of the n classifiers can be built of a number of weak classifiers, as in the AdaBoost algorithm [52]. The problem is the size and content of the training and testing sets of the “all” group. This has to fully characterize all the “other” classes, and at the same time be of a manageable size. In many cases it has to be of a similar size to the groups of samples that represents the examined class, in order to avoid bias. In addition, the second stage implies that the membership value or the classification status must be comparable, which is not always true. In addition, contradictory decisions or cases of no decision may occur, because the binary classifiers solve separate classification problems [63]. Many of the methods based on binary classifiers also assume that a single set of features can characterize all the border or transitions between the examined class and all the other classes. 6.1.2 One-Against-One (Pair-Wise) Classification One-against-one classification refers to methods in which each class is compared to each of the other classes, i.e. for n classes, there are n(n-1)/2 classifiers or comparisons [27, 54, 60]. The common method or methods for combining the results of the binary classifiers or comparisons is to use various ranking or voting paradigms. This approach can also be regarded as a graph in which each class has its own connection or connections to each of the other classes. This means that the transitions between classes can be emphasized rather than getting a complete characterization of the class or its center. This becomes especially important when the transitions between classes are continuous and only a threshold distinguishes between them. In this case, the direction of transition can also be significant. This approach allows optimization of each pair-wise classifier in terms of both feature set and classification algorithm. In addition, it is easier to construct balanced datasets for training and testing. More classifiers are required in one-against-one classification in comparison to the one-against-all classification. This can be a problem if the number of classes is large. On the other hand, such classifiers can be more simple in terms of the number of feature, and the construction of the training and testing data, and they have a potential to be more accurate for each pair of classes.
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
181
6.1.3 Combining Binary Classifiers This section presents a few of the methods designed to combine binary classifiers. Wu et al. [72] review several of the methods for combining pair-wise classifiers. The most commonly used set of methods is based on voting [21, 30], i.e. first all the pair-wise comparisons are constructed and then the class that won in the highest number of these comparisons is selected. Majority vote, as in politics, comprises of various methods which enable choosing one candidate or more and aim to resolve conflicts [40, 60, 72]. Voting only selects the class label, not the probability of this selection. Another approach comprises various methods for summation of the pair-wise probabilities [26, 72]. Allwein et al [2], suggest an exponential loss decoding. Wu et al. [72], present two methods, the first obtains the probability estimates via an approximation solution to an identity, based on finite Markov Chains theory. The second is a coupling method. Hastie and Tibshirani [26] use the Bradley-Terry model for combining binary classifiers. Shiraishi and Fukumizu [55] review combination methods of binary Support Vector Machines (SVMs). Combination of binary SVMs is computationally more feasible and not inferior to direct multi-class SVMs. Shiraishi and Fukumizu propose a method for combining relatively strong binary classifiers, based on statistical techniques, such as penalized logistic regression, stacking, and a sparsity promoting penalty, while the binary classifiers do not have to return probabilistic values. This method is affective both for one-against-one and one-against-all paradigms. The benefit is that an estimate of the conditional probability for each class can be obtained. They also propose selecting only the relevant binary classifiers by adding the group lasso type penalty while training the combining method. Fernandeza et al. [19] present fuzzy rule based pair-wise classification, in which the combination stage is treated as a decision-making problem, solved by rules based on the maximal nondominance criterion. Fuzzy based classification systems use a combination between Fuzzy Logic and statistical machine learning. They are widely used to solve classification problems (medicine, unmanned vehicles, battle-field analysis and intrusion detection), because of their interpretable models based on linguistic variables, which are easier to understand for the experts or end-users.
6.2 Direct Multi-class Classification Schapire [52] proved that a strong classifier can be generated by combining weak classifiers through boosting. This originated the suite of AdaBoost algorithms [20]. The design of the classifier is iterative. In each iteration a higher weight is assigned to the data samples not yet accurately classified. AdaBoost can identify outliners: i.e. examples that are either mislabeled or that are inherently ambiguous and hard to categorize, and is robust against over-fitting. However, the actual performance of boosting depends on the data and on the performance of the basic learning algorithm. AdaBoost. MH and AdaBoost.MR were devised in order to deal with multi-class problems. AdaBoost. MH is base on minimization of the loss
182
T. Sobol-Shikler
function, and uses real or discrete values for the classification cost status (CSV) of every weak learner and Hamming loss as bound error. AdaBoost.MR relies on ranking loss minimization, was designed for ranking correct classes at the top, ranking all classes by their CSV and selecting the ones with the highest rank. Zhu et al. [80] presented a stage-wise additive modeling using a multi-class exponential loss function, an algorithm that directly extends the AdaBoost algorithm to the multi-class case without reducing it to multiple two-class problems. Their algorithm combines weak classifiers and only requires the performance of each weak classifier be better than random guessing (rather than 1/2). Another approach is base on Support Vector Machines (SVM) [41, 65] with various kernels. This approach considers loss functions treating more than two classes and minimizes them directly with various algorithms (10, 12, 32, 79]. These methods make it easier to analyze properties such as the consistency of Bayes error rate, but they are not always feasible for a large number of classes and samples [76]. Zhang et al. [78] present a combination of adaBoost.MH and co-EM semi-supervised multiclass labeling scheme, for classification of human actions, from video, using spatio-temporal features in two “views”: optical flow and histograms of oriented gradients. They refer to human actions such as kiss, sit-down, answer phone, jogging, hand clapping, and more. In this case the labeling is both time consuming and prone to errors. The data are described as a finite hierarchical Gaussian mixture model (GMM). Weighted multiple discriminant analysis (WMDA) is used in order to enable the co-EM work efficiently, regardless of the number of features, and to generate sufficient training data. A set of weak hypotheses are generated and combined using linear combination.
6.3 Associative Classification Another approach to multi-class classification is Associative Classification which combines terms and techniques from the field of Data Mining with classification methods. Data mining has dealt with finding association rules in data. Originally, classifications aimed to find a single target class, while data mining aimed to find any attribute in the data. In recent years, integrative methods have been presented called associative classification, such as CBA [36], CMAR [34], CPAR [75], and more.
7 Multi-label Classification Multi-label learning refers to problems where a sample can be assigned to multiple classes. It is an extension of the multi-class classification problem, where classes are not mutually exclusive or separately occur. This differs from multi-class learning where every sample can be a signed to only one class even though the number of classes is more than two. Multi-label learning tasks are common in real-world problems. For instance, in text categorization, each document in a corpus may
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
183
belong to several (predefined) topics, such as government and health [42, 44, 63, 77]; in computational biology, such as in the area of functional genomics, each gene may be associated with a set of functional classes, such as metabolism, transcription and protein synthesis [5, 77]; in computer vision, in applications such as scene classification, each scene image may belong to several semantic classes, such as beach and urban [8, 68, 77]; in social computing, such as in the case of affective computing, people simultaneously experience and express emotions, mental states, attitudes and moods, such as thinking and experiencing uncertainty at the same time. Such affective states can be expressed and analyzed in various behavioral cues, such as facial expressions [17] and non-verbal speech [60]. In all these cases, each sample in the training set is associated with a set of labels, and the task is to output a label set. The size of the output label-set for each unseen sample or instance is unknown in advance. Multi-label classification techniques are mostly extensions and adaptations of the multi-class classification techniques (and these were extensions of methods for the binary problem). Some of these methods deal directly with the multi-label problem. For example, variants of AdaBoost, such as AdaBoost.MH and AdaBoost.MR [4, 54]. Zhang [77] adapts the K-nearest Neighbor model by utilizing the statistical information from the neighborhood of the examined sample. McCallum [42] uses Bayesian mixture models to select the most probable classes. Wang et al. [67] use a hierarchical generative probabilistic model. Others are variations of the methods which are based on combinations of binary classifiers, and as in multi-class classification, but with variations of the combination methods. The methods of the first type are sometimes called adaptation methods, while the methods which dissolve the classification problem into binary classifiers are also called transformation methods [64]. The multi-label classification methods have to deal with issues such as how to choose the most probable labels and how to use the inherent semantic or probabilistic (probability of co-occurrences) connections or relations between the probable sets of labels. Semantic relations can be synonyms, hierarchical connections (for example: vehicle, car, the car model; flora, flower, daisy, etc), ranking [13], or the fact the probability of an appearance of both an office and a car in a single image is relatively low. For example, Qi et al. [47] review three paradigms, of individual concept annotation, in which each concept is detected independently of other concepts; context-based conceptual fusion annotation, which explores the relations between the results of the individual detectors, and propose an integrated multilabel annotation paradigm for correlative multi-label video annotation, in which the learning of the individual concepts and their correlation are modeled together. Wang [67] present a model for correlated multi-labeling. Tsoumakas et al. [64], presents label power-sets (LP) which represent frequent combinations of labels in the training sets as new labels. There are three types of approaches that can be recognized from review of techniques for image annotation, and can be seen or extended to other modalities and applications. The first is indeed multi-label classification on the entire sample,
184
T. Sobol-Shikler
i.e. finding the classes most associated with each given sample. The second is multi-instance classification, identifying the (annotated) samples most similar to the examined sample (in terms of the feature values), and extract from the labels associated with them the relevant set of labels for the examined sample. The third is to use arbitrary patches or automatically and semi-automatically extracted segments (69, 74, 79] of the samples, find the labels relevant to each of them according to their features and then combine these findings to describe the entire sample. Instead of area patches (images) or temporal segments (music, video), an intermediate level of abstraction can be used, in which parameters of the domain can be defined, such as rhythm in music or shape in images, and characterize classes according to them. Another approach to multi-label classification is to compromise accuracy for efficiency, using various methods to reduce the number of classes and labels by removing redundant labels which rarely appear in the training set [29]. A large number of labels mean both longer processing time and a more complicated taxonomy for human users to deal with. The next paragraphs review a few of the methods published in recent years for various knowledge domains, modalities and applications. Applications of annotation and retrieval and of gene function recognition have many similarities, because a set of gene functions or a “bag of words” can have a binary representation, they either appear or not. Dekel et al. [13] present four variations of a boosting-based algorithm, ranking the goodness of labels. Montejo-Raez et al. [44], suggest an Adaptive Selection of Base Classifiers approach, for multi-label classification of text documents based on independent binary classifiers, which provides the possibility to choose the best of a given set of binary classifiers. They give a higher weight when only a few documents are available for a class. The factor overweighs positive samples in comparison to negative samples of a class. In order to make the classification process fast, they add a threshold for the minimum performance allowed for a weak binary classifier, below this value both the classifier and the class are discarded, hence discarding rare classes. For the base algorithms they use either the Rocchio or the PLAUM algorithms [45]. Rak et al. [49] present an associative classification method which they try on medical documents. Thabatah et al. [63] present MMAC, a method based on associative classification. This assumes that for each instance that frequently occurs in the training data and passes a certain frequency and confidence threshold, there is a rule associated with each of the class labels. Hence, each instance is associated with a ranked list of labels. The algorithm involves a single pass on the training data in order to generate the rules, and then a regression process that generates more rules, and sets the relations between them. Three evaluation methods are also presented: top-label, which examines only the label best associated with the data as in the multi-class case; any-label, how many times all the expected labels where recognized for all the instances in the test set, and label-weight, each label in the classification results is assigned a rank according to the number of times it is associated with the same instance in the training set.
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
185
For scene analysis, the goal is to understand what a scene contains. Automatic image annotation is the process of automatically producing words to describe the content for a given image. In other words, the goal of automatic image annotation is to describe a previously unseen image with a subset of words from a given vocabulary [38]. Reviews of various aspects of image annotation are presented in many recent papers [7, 18, 29, 67, 68, 77]. There is no simple mapping from raw images or videos to dictionary terms. One approach builds a dictionary using vector quantization over a large set of visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. A visual similarity does not guarantee semantic similarity, for example, visual similarities exist between images of bears in snow and airplanes in the sky. Wang et al. [68] cluster correlated keywords into topics in the local neighborhood in order to reduce the number of labels, and develop an iterative method to maximize the margins between classes in both the visual and the semantic spaces. Bengio et al. [7] use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. Many methods have been suggested in recent years to solve the multi-instance and multi-label scene annotation problem. For example, Zhou et al. [80] present two methods for multi-instance and multi-label classification. Feng and Xu [18] review previous work, and present a transductive method (TMIML) for automatic multi-label and multi-instance image annotation, which is combined with graphbased semi-supervised learning. In most multi-label classification methods, all the labels come from the same category type (a general pool), and a detector is built to differentiate between classes. Zhu [79] uses semantic scene concept learning for autonomous agents that interact with human users, and robotic systems that aim to navigate in real environments, recognize and retrieve objects based on natural language commands. In such cases, the agents have to learn to deal with mutually non-exclusive scene labels such as red, ball and cat, without being told their category types in advance. Zhu adds an intermediate level of concept category (label), such as color and shape, built of joint probability density functions of the visual features. Objects such as a Pepsi can are then associated with these concepts categories, using a two-level Bayesian inference network. However, each object is associated with only a single color and a single shape of the concept categories. Video recordings pose additional challenges due to the conceptual spatialtemporal relations between entities in consecutive frames [35, 47, 73]. ElKaliouby and Robinson [17] present a relatively simple continuous monitoring of six real-time HMMs, each recognizes a certain affective state from the recent string of head gestures and facial expressions. Continuous monitoring of the summed combinations of binary classifiers, used for tracking of subtle changes and nuances of vocal non-verbal expressions of affective states during sustained interactions is also presented [59].
186
T. Sobol-Shikler
Similar methods can be seen in music genre and mood annotation. Lukashecitch et al. [39], review methods for genre annotation, such as binary SVM-based classification, and a modified a k-Nearest Neighbors classifier that directly handle multi-label data. They suggest a method to disambiguate complex and labels like “world music” which refer in practice to a large variety of genres, using an intermediate abstraction level, they call domains, which are based on sets of features that represent a certain aspect of the music, such as rhythm and melody.
7.1 Semi-supervised (Annotation) Methods Semi-automated methods, and semi-supervised methods, such as active learning [56], have been devised in order to deal with the labeling of large volumes of data using only a relatively small number of manually labeled samples. Another type of semi-supervised methods refers to cases in which the samples are only partially annotated, i.e. each sample is associated with a single label or a few labels, but not with all the relevant labels [60]. The first type is discussed in this section. An example of the second type is presented in the next section. An example is the multi-class problem of identifying the people in surveillance video recordings [73, 78]. One method is co-training, in which the relatively small training set of manually annotated samples is extended automatically and iteratively. At each iteration, the un-annotated samples which are the most similar to the already annotated samples are given the labels of the nearest classes, respectively, and added to the training set [2, 73]. Other methods are co-EM and boosting, or combinations of co-training and boosting. Co-EM does not commit to a label for the unlabeled data, instead it uses probabilistic labels that may change from one iteration to the other. Co-EM converges as quickly as EM does, but it is applicable to situations where there are plenty of training data, especially when the feature space is of high dimensionality [78]. Feng and Xu [18] present a method for semisupervised multi-label and multi-instance annotation, in which each label in an image is associated with a different instance or an area of the image, rather than with the entire image. They review semi-supervised graph-based learning algorithms.
8 Inference of Co-occurring Affective States from Non-verbal Speech This section describes a classification method of multi-class and semi-blind multilabel classification with feature sparsity, which was designed for inferring complex affective states from their non-verbal expressions in speech. The domain of affective states and human behavior has many common characteristics with other human knowledge and behavior domains. The requirements of the classification process, its implementation and the further analysis which was based on it may provide a relatively simple answer to a wide range of classification problems in various knowledge domains. More details can be found in [58-60].
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
187
The term Affective states represents a wide range of human emotional and social behavior, including emotions, moods, attitudes, mental and knowledge states, beliefs, desires and more. This means that several affective states are likely to occur simultaneously, and change asynchronously over time. Nuances of affective state and subtle dynamic changes also frequently occur. The recording and annotation of affective data pose various challenges. Although it is desirable to annotate data recorded in real settings, most settings evoke only a limited range of affective states. The majority of affective states and affective behavior comprises nuances of expressions that relate to subtle affective states, often threshold values of various features distinguish between the recognition of affective states when gradual transitions are tracked over time. The various affective states can occur simultaneously and change asynchronously during the course of a sustained interaction, and even during the course of a single utterance or a sentence. Therefore, the annotation of affective states is more complicated and less reliable than the annotation of objects in an image for example. The annotation depends not only on the variety of affective behavioral expressions due to personality and context, but also on the perception of the annotator, which is affected by personality, cultural background, emotional and social intelligence and so on. So the problem is not only related to finding synonyms within an extended ontology, but rather to getting an agreement regarding the conceptual content, this is extenuated because of the co-occurring nature of affective states. Various methods have been devised for annotation of affective states. These usually rely on a pre-defined taxonomy, and then associating data samples with the given labels, or recording in advance samples for each label and then verifying that indeed the recorded samples represent their labels. The desire for the technology though is to represent a large variety of affective states, their variations, transitions and nuances. As described in the section on annotation, there are various taxonomic methods for describing affective states and the relations between them [57]. The taxonomic method which was chosen is the prototype method. This provides the widest range of affective states, in a manner which is comprehensible both for humans and machines. The Mind Reading taxonomy and database comprise 514 affective states grouped into 24 meaning groups (the results presented here are based on a beta version which comprised of 712 affective states). Each affective state in the database is represented by six sentences. This is not enough for statistical learning. On the other hand, each meaning group comprises many affective states that share a meaning. Using groups of affective states as the classes has several advantages [17, 31, 59, 60]: It increases the number of samples per class. The annotation of affective states is not accurate, gathering several affective states that share a meaning and the respective samples increase the overall accuracy of the wider class. This also means that the application can be shared between cultures that share the broader class definitions if not the single affective state concepts, i.e. the term thinking exists in most cultures, but various types of thinking sometimes belong only to a specific culture, for example wool-gathering in UK English.
188
T. Sobol-Shikler
Recognizing the entire set of 712 affective states is not necessarily desirable for human-computer interfaces, in addition, getting into such a detailed description level may reduce the accuracy of the annotation and reduces the number of labeled samples per label. Therefore: • A set of classes to be recognized should be defined to facilitate various applications of the classification system. • The system should be easily adapted to new speakers Instead, an extension of the prototype approach was used. Nine affective-state groups were chosen for the described system. These were recognized as common and relevant to the analysis of various human-computer interactions and in everyday life. These included: confidence, concentration, disagreement, excitement, interest, joy, stress, thinking and uncertainty. Different applications may need different sets of affective states, therefore: • The classification system should be built in a modular and adaptive manner. The dynamic nature and co-occurrences of affective states are translated into two demands of the classification system: • Multi-label classification is required • For tracking of dynamic changes, the classification output should present multiple levels of each of the inferred affective states All the taxonomic methods, including the chosen Mind Reading taxonomy and annotated database, associate each speech sample (sentence) with a single label, therefore: • The classification should be semi-blind multi-label classification, i.e. the input is a single label for each sample, while the output can comprise of several labels for each sample. Initial observations, on a database of naturally evoked sustained interactions, Doors, revealed two characteristics that affected the classification [58]: • Different sets of features distinguish between different pairs of affective states. • Transitions between affective states are gradual, i.e. there is a continuous transition along the features that distinguish them, and a threshold signifies the actual transition (when listeners recognize a different affective state). All these issues define the requirements of the classification process. The classification process and its relations to the choice of taxonomy and to the requirements posed by the desired range of applications and by the data characteristics in schematically described in Figure 4.
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
189
Fig. 4 A schematic description of the connections between the various stages of the classification process, the chosen taxonomy and the acquisition of data of training, testing and validation.
Pre-processing As in most other modalities, the first stage of the pre-processing included the extraction of vocal features from each sentence or utterance (a meaningful speech unit, whose duration is limited by the length of the sentence or by breathing stops). The features that were extracted include the fundamental frequency (pitch, the vibration rate of the vocal chords whose modulation creates the intonation), spectral content, using Bark filter-bank, which divides the spectrum into bands of increasing width according to loudness perception models, intensity of the speech signal (energy), and harmonic properties such as consonance and dissonance, combinations of frequencies that are either pleasant or unpleasant, respectively. These harmonic properties define the voice quality. The features were extracted for short overlapping time frames [58]. The values of these features change asynchronously during an utterance, as can be seen in Figure 5, which shows intonation curves and spectral content of sentences with the same text, by the same speaker but with different affective content. Each utterance was therefore parsed into different temporal regions according to combinations of the extracted features. For classification, temporal metrics were defined, based on definitions from linguistics and musicology, to describe: the values of the features in the temporal regions; the size and shape of these regions; size and shape relations between the different regions, and the relations between feature values in different regions. A few of these metrics correspond to properties
190
T. Sobol-Shikler
such as tempo and melody, others correspond to linguistic definitions such as syllables and stress. In total, 173 statistical characteristics (median, standard deviation, etc) of these metrics were used as input attributes to the classification system.
uncertain
Frequency
uncertain testing cheered down Time
Time
Fig. 5 Curves of features extracted over over-lapping time frames for the duration of a sentence. The sentences were uttered by the same speaker with the same text but with different affective states. (a) Spectral content of 5 sentences (short-time Fourier Transform over overlapping time frames, red regions mean higher energy in specific frequency ranges along different parts of each sentence) (b) Intonation curves of 4 sentences, the fundamental frequency calculated over overlapping time frames.
Although these metrics can be grouped according to intermediate level labels, such as tempo and melody, these do not supply enough information for direct conceptual categorization such as in music genre annotation, object conceptual recognition, or for generating rules and HMM classifiers, as in the case of facial expressions, because such direct relations were not accurately defined in the psycho-acoustic literature and could not be directly recognized in the available data. Normalization (μ=0, σ=1) of each metric according to the speaker was performed. This allowed to use metrics with values of various magnitudes and avoids bias. This also reduced inter-speaker variability. This means that the changes between expressions could be compared between speakers, and the machine could be used for different speakers, with different speech characteristics (language, accent, gender, age), without re-training of the classification system. Classification The classification process was based on one-against-one (pair-wise) classification, as schematically described in Figure 6.
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
191
Different features distinguish different pairs of affective-state groups. Therefore, each binary classifier was independently trained and optimized in terms of both feature set and classification algorithm. An iterative process of feature selection and training was repeated until a “good enough” classification, in terms of cross-validation, precision and symmetry of the true-positive recognition between the two compared classes, was achieved [60]. The two classification algorithms that were finally used for the binary classifiers were linear SVM [65] and C4.5 decision tree [48]. These yielded results which are comparable to many other methods and are simple to implement. In addition, these two methods are based on finding the border between classes rather than focusing on class centers, which is in accordance with the observation that thresholds distinguish between classes. The feature selection process confirmed the observation that different sets of features distinguish different pairs of classes. On average, ten features were required for the binary classifiers, but overall nearly all the features were used. The average tenfold cross-validation for the 36 machines was 75%.
Fig. 6 A schematic description of the one-against-one classification process.
Because several affective states can co-occur, a threshold was defined as a selection criteria, i.e. all the affective-state groups (labels) whose ranking was above the threshold were associated with the expression in the examined sentence or utterance. The threshold for selection was set over a standard deviation above the mean number of machines, i.e. a label was selected by at least six of the eight binary classifiers. From a set of samples of 93 affective states which were chosen to represent the nine affective-state groups, 60 percent were used for training and testing of the binary classifiers, and 40 percent were used for the verification of the combined classifier. The overall accuracy was 83 percent true-positive recognition for the original class-labels. The combination of inferred labels was compared to the lexical definition of the examined affective states. Combinations of labels (multi-label classification) were found also for affective states and samples that were used for training with a single label.
192
T. Sobol-Shikler
Verification and Generalization The validation and generalization were an important part of the methodology, because the data was not fully annotated with multiple labels, and there is no inherent connection between the labels, such as nested hierarchical connection or an expected logical co-occurrence that could be statistically analyzed. In order to check the validity of the classifier for a wide range of affective states, the system was applied to the entire Mind Reading database, which comprises 4400 sentences, of 712 affective states, grouped in 24 meaning groups. A Friedman test was used to check the ranking resulting from the inference results of all the six sentences that represent each of the affective states. This requires that all the labels are ranked in a consistent manner. This test yielded significant results for about 360 affective states. The Friedman test is not general enough for the case of co-occurring affective states, because not all the labels can be consistently recognized in all the samples that represent an affective state, because each sentence represents a variation of the conveyed concept, as perceived by the speaker, by the annotators, and depends on the context in which the concept appeared. Therefore, a second threshold was applied, in order to locate the labels which were consistently recognized and the labels that consistently were not recognized over the majority of the sentences that represent an affective state. The threshold for selection was set again at over a standard deviation above the mean number of sentences, i.e. the double threshold procedure picked the labels that were selected by at least six machines in at least four of the six sentences. A complementing criterion was also applied, choosing the labels that were recognized by less than three machines in at least for of the six sentences. Although this seems like statistics over small numbers, the double threshold was applied to the inference results of the entire Mind Reading database, and around 570 affective states were consistently characterized with at least one label of affective-state group. The label combinations inferred by the double threshold procedure were compared to the lexical definitions of the characterized affective states, and was found to be correct in over 80 percent of the 570 cases. These are not small numbers. This analysis also revealed that concepts of affective states that can be described as combinations of other affective states are also expressed as combinations of their expressions, in addition to other connections between affective states and their expressions. The distinguishing capabilities of the system were compared to human performance on an independent test.[60]. The system that was trained on an English database was later applied to the analysis of sustained human-computer interactions in Hebrew [59]. In this case, the sum of the pair-wise comparisons was considered directly to monitor the level of the nine affective-state groups along the consecutive utterances. For verification that indeed the system can be used in this case, test repetitions by each speaker during the interactions were analyzed and were found statistically correlated to events during the interaction. In addition, simultaneous occurrences were found between the verbal content of spontaneous speech utterances and the inference results, and between the inference results and major changes in the values of physiological cues such as skin conductivity.
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
193
9 Summary This chapter presented the main methods of multi-class and multi-label classification. These methods can be applied to a large variety of applications and research fields that relate to human knowledge, cognition and behavior. An example was given of a system that infers co-occurring affective states from their non-verbal expressions in speech. Unlike other fields, the field of affective states has no definite “ground truth” for verification of the annotation. In addition, the choice of taxonomy has an important effect on the design. The classification method had to accommodate two more requirements. Firstly, different features distinguish different pairs of classes (sparsity). Secondly, several levels of recognition were required. These requirements are common to various cues of human behavior and knowledge, and to a very wide variety of applications in various modalities. The classification results reflect shades of affective states and nuances of expressions and not only their detection. They also reveal connections between complex concepts and complex behavioral expressions, which may contribute to the understanding of human affective and social behavior and it development. This example shows that indeed such methods can contribute to a wide range of research and applications. Acknowledgments. The author thanks Deutsche Telekom Laboratories at Ben-Gurion University of the Negev for their partial support of this research.
References [1] Aggarwal, C.C.: Data streams: An overview and scientific applications. In: Gaber, M.M. (ed.) Scientific Data Mining and Knowledge Discovery: Principles and Foundations. Springer, Heidelberg (2010) [2] Allwein, E.L., Schapire, R., Singer, Y.: Reducing multiclass to binary: a unifying approach for margin classiers. J. of Machine Learning Research 1, 113–141 (2001) [3] Al-Naymat, G.: Data mining and discovery of astronomical knowledge. In: Gaber, M.M. (ed.) Scientific data mining and knowledge discovery: Principles and foundations. Springer, Heidelberg (2010) [4] Amit, Y., Dekel, O., Singer, Y.: A boosting algorithm for label covering in multilabel problems. In: Proc. AISTATS (2007) [5] Barutcuoglu, Z., Schapire, R.E., Troyanskaya, O.G.: Hierarchical multi-label prediction of gene function. Bioinformatics 22(7), 830–836 (2006) [6] Baron-Cohen, S., Golan, O., Wheelwright, S., Hill, J.J.: Mindreading: The interactive guide to emotions. Jessica Kingsley Limited London (2004), http://www.jkp.com [7] Bengio, S., Pereira, F., Singer, Y., Strelow, D.: Group Sparse Coding (2009) [8] Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recognition 37(9), 1757–1771 (2004) [9] Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: I. The methods of paired comparisons. Biometrika 39, 324–345 (1952) [10] Bredensteiner, E.J., Bennett, K.P.: Multicategory classification by support vector machines. Computational Optimization and Applications 12, 53–79 (1999)
194
T. Sobol-Shikler
[11] Chen, M.Y., Christel, M., Hauptmann, A., Wactlar, H.: Putting active learning into multimedia applications: dynamic definition and refinement of concept classifiers. In: Proc. of the 13th Annual ACM International Conference on Multimedia, MULTIMEDIA 2005, pp. 902–911. ACM Press, New York (2005) [12] Crammer, K., Singer, Y.: On the algorithmic implementation of multi-class kernelbased vector machines. J. Machine Learning Research 2, 265–292 (2001) [13] Dekel, O., Manning, C.D., Singer, Y.: Log linear models for label ranking (2003) [14] Dekel, O., Shamir, O.: Multiclass-multilabel learning when the label set grows with the number of examples. Technical Report MSR-TR-2009-163, Microsoft Research multi-label (2009) [15] Douglas-Cowie, E., Campbell, N., Cowie, R., Roach, P.: Emotional speech: towards a new generation of databases. Speech Communication 40, 33–60 (2003) [16] Ekman, P.: Basic emotion. In: Power, M., Dalgleish, T. (eds.) Handbook of cognition and emotion, Wiley, Chihester (1999) [17] el Kaliouby, R., Robinson, P.: Real-time inference of complex mental states from facial expressions and head gestures. In: Real-time vision for HCI, pp. 181–200. Springer, Heidelberg (2005) [18] Feng, S., Xu, D.: Transductive multi-instance multi-label learning algorithm with application to automatic image annotation. Expert Systems with Applications 37, 661–670 (2010) [19] Fernándeza, A., Calderónb, M., Barrenecheab, E., Bustinceb, H., Herrera, F.: Solving multi-class problems with linguistic fuzzy rule based classification systems based on pairwise learning and preference relations. Fuzzy Sets and Systems 161, 3064–3080 (2010) [20] Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55 (1997) [21] Friedman, J.: Another approach to polychotomous classication. Technical report, Department of Statistics, Stanford University (1996), http://www-stat.stanford.edu/reports/friedman/poly.ps.Z [22] Grimm, S.: Knowledge representation and ontologies. In: Gaber, M.M. (ed.) Scientific data mining and knowledge discovery: Principles and foundations. Springer, Heidelberg (2010) [23] Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proc CIKM 2005, Bremen, Germany (2005) [24] Gruber, T.R.: J. Knowledge Acquisition 6(2), 199–221 (1993) [25] Grundland, M., Dodgson, N.A.: Color search and replace. In: Computational Aesthetics, EUROGRAPHICS, Girona, Spain, pp. 101–109 (2005) [26] Hastie, T., Tibshirani, R.: Classication by pairwise coupling. The Annals of Statistics 26(1), 451–471 (1998) [27] Hsu, C.W., Lin, C.J.: A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13(2), 415–425 (2002) [28] Hsu, D., Kakade, S., Langford, J., Zhang, T.: Multi-label prediction via compressed sensing. In: NIPS (2009) [29] Hu, J., Lam, K.M., Qiu, G.: A hierarchical algorithm for image multi-labeling. In: Proc. IEEE 17th International Conference on Image Processing, Hong Kong (2010) [30] Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: a stepwise procedure for building and training a neural network. In: Fogelman, J. (ed.) Neurocomputing: Algorithms, Architectures and Applications, Springer, Heidelberg (1990)
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
195
[31] Laurier, C., Meyers, O., Serra, J., Blech, M., Herrera, P.: Music mood annotator design and integration. In: Proc. 7th International Workshop on Content-Based Multimedia Indexing, pp. 156–161 (2009) [32] Lee, Y., Lin, Y., Wahba, G.: Multi category support vector machines: theory and application to the classification of micro array data and satellite radiance data. J. American Statistical Association 99, 67–81 (2004) [33] Lellmann, J., Becker, F., Schnorr, C.: Convex optimization for multi-class image labeling with a novel family of total variation based regularizers. In: Proc. IEEE 12th International Conference on Computer Vision, ICCV (2009) [34] Li, W., Han, J., Pei, J.: CMAR: Accurate and efficient classification based on multiple class association rule. In: Proc ICDM 2001, San Jose, CA, pp. 369–376 (2001) [35] Li, Y., Tian, Y., Duan, L.Y., Yang, J., Huang, T., Gao, W.: Sequence multi-labeling: a unifiedvideo annotation scheme with spatial and temporal context. IEEE Trans. Multimedia 12(8), 814–828 (2010) [36] Liu, B., Hsu, H., Ma, Y.: Integrating Classification and association rule mining. In: Proc. KDD 1998, New York (1998) [37] Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: Proc. CVPR (2009) [38] Liu, J., Li, M., et al.: Image annotation via graph learning. Pattern Recognition 42, 218–228 (2009) [39] Lukashevich, H., Abeßer, J., Dittmar, C., Grossmann, H.: From multi-labeling to multi-domain-labeling: A novel two-dimensional approach to music genre classification. In: Proc. 10th International Society for Music Information Retrieval Conference (ISMIR 2009), pp. 459–464 (2009) [40] Malkevitch, J.: The process of electing a president. AMS, American Mathematical Society (April 2008), http://www.ams.org/featurecolumn/archive/elections.html [41] Marsland, S.: Machine learning: an algorithmic perspective. Chapman & Hall/CRC Machine learning & pattern recognition series, FL, USA (2009) [42] McCallum, A.: Multi-label text classification with a mixture model trained by EM. In: Proc AAAI 1999 Workshop on Text Learning (1999) [43] Montejo-Raez, A., Urena-Lopez, L.A.: Binary classifiers versus AdaBoost for labeling of digital documents [44] Montejo-R´aez, A., Steinberger, R., Ure˜na-L´opez L.A.: Adaptive selection of base classifiers in one-against all learning for large multi-labeled collections, vol. (3230), pp. 1–12 (2004) [45] Polikar, R.: Ensemble based systems in decision making. IEEE circuits and systems magazinee (3rd quarter) (2006) [46] Peng, W., Gero, J.S.: Concept formation in scientific knowledge discovery. In: Gaber, M.M. (ed.) Scientific Data Mining and Knowledge Discovery: Principles and foundations, Springer, Heidelberg (2010) [47] Qi, G.J., et al.: Correlative multi-label video annotation. In: Proc. MM 2007, Bavaria, Germany, pp. 17–26 (2007) [48] Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993) [49] Rak, R., Kurgan, L., Reformat, M.: Multi-label Associative Classification of Medical Documents from MEDLINE. In: Proc. 4th International Conference on Machine Learning and Applications, ICMLA 2005 (2005)
196
T. Sobol-Shikler
[50] Sahli, N., Jabeur, N.: Knowledge discovery and reasoning in geospatial applications. In: Gaber, M.M. (ed.) Scientific data mining and knowledge discovery: Principles and foundations. Springer, Heidelberg (2010) [51] Scaringella, N., Zoia, G., Mlynek, D.: Automatic genre classification of music content: a survey. IEEE Signal Processing Magazine 23, 133–141 (2006) [52] Schapire, R.E.: The strength of weak learnability. Machine Learning 5(2), 197–227 (1990) [53] Schapire, R., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37, 297–336 (1999) [54] Schapire, R., Singer, Y.: BoosTexter: A boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000) [55] Shiraishi, Y., Fukumizu, K.: Statistical approaches to combining binary classifiers for multi-class classification. Neurocomputing 74, 680–688 (2011) [56] Singh, M., Curran, E., Cunningham, P.: Active Learning for Multi-Label Image Annotation. Technical Report UCD-CSI-2009-01, University College Dublin (2009) [57] Sobol-Shikler, T.: Automatic Inference of Complex Affective States. Computer Speech and Language 25, 45–62 (2011); doi:10.1016/j.csl.2009.12.005 [58] Sobol-Shikler, T.: Analysis of affective expressions in speech, Tech. report, University of Cambridge (2009) [59] Sobol-Shikler, T.: Multi-modal analysis of human computer interaction using automatic inference of aural expressions in speech. In: Proc. IEEE International Conference on Systems, Man & Cybernetics (SMC), Singapore (2008) [60] Sobol-Shikler, T., Robinson, P.: Classification of complex information: Inference of co-occurring affective states from their expressions in speech. IEEE Trans. Pattern Analysis and Machine Intelligence 32(7), 1284–1297 (2010); doi:10.1109/TPAMI.2009.107 [61] Sowa, J.F.: Knowledge representation. Brokks Cole Publishing, CA (2000) [62] Tanner, S., Stein, C., Graves, S.J.: On-board data mining. In: Gaber, M.M. (ed.) Scientific Data Mining and Knowledge Discovery: Principles and Foundations. Springer, Heidelberg (2010) [63] Thabtah, F.A., Cowling, P., Peng, Y.: MMAC: A new multi-class, multi-label associative classification approach. In: Proc. 4th IEEE International Conference on Data Mining, ICDM 2004(2004) [64] Tsoumakas, G., Katakis, I., Vlahavas, I. (20??) Random k-labelsets for multi-label classification. IEEE Trans. Knowledge and Data Engineering (2010) [65] Vapnik, V.: Estimation of Dependences Based on Empirical Data. Springer, Heidelberg (1982) [66] Wang, J.Z., Li, J., Wiederhold, G.: Simplicity: Semantics-sensitive integrated matching for picture Libraries. IEEE Trans. Pattern Analysis and Machine Intelligence 23(9), 947–963 (2001) [67] Wang, H., Huang, M., Wang, X.Z.: A generative probabilistic model for multi-label classification. In: Proc. 8th IEEE International Conference on Data Mining (2008) [68] Wang, M., Zhou, X., Chua, T.S.: Automatic Image Annotation via Local Multi-Label Classification. In: Proc. CIVR 2008, Niagara Falls, Ontario, Canada (2008) [69] Warrell, J., Prince, S.J.D., Moore, A.P.: Epitomized Priors for Multi-labeling Problems [70] Witten, I.H., Frank, E.: Data mining: practical machine learning tools with java implementations. Morgan Kaufmann, San Francisco (2000)
Inference of Co-occurring Classes: Multi-class and Multi-label Classification
197
[71] Woitek, P., Brauer, P., Grossmann, H.: A novel tool for capturing conceptualized audio annotations. In: Proc. AM 2010, Pitea, Sweden (2010) [72] Wu, T.F., Lin, C.J., Weng, R.C., Singer, Y.: Probability estimates for multi-class classication by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004) [73] Yan, R., Yang, J., Hauptmann, A.: Automatically labeling video data using multiclass active learning. In: Proc 9th International Conference on Computer Vision (ICCV 2003), Nice, France, pp. 516–523 (2003) [74] Yang, F., Shi, F., Wang, J.: An Improved GMM-based Method for Supervised Semantic Image Annotation, pp. 506–510 (2009) [75] Yin, X., Han, J.: CPAR: Classification based on predictive association rule. In: Proc. SDM 2003, San Francisco, CA (2003) [76] Zhang, T.: Statistical analysis of some multi-category large margin classification methods. J. Machine Learning Research 5, 1225–1251 (2004) [77] Zhang, M.L., Zhou, Z.H.: ML-KNN:A lazy learning approach to multi-label learning. Pattern Recognition 40, 2038–2048 (2007) [78] Zhang, T., Liu, S., Xu, C., Lu, H.: Boosted multi-class semi-supervised learning for human action recognition. Pattern recognition (2010); doi:10.1016/j.patcog.2010.06.018 [79] Zhu, J., Hastie, T.: Kernel logistic regression and import vector machine. J. Computational and Graphical Statistics 14, 185–205 (2005) [80] Zhu, J., Rosset, S., Zou, H., Hastie, T.: Multi-class AdaBoost, accessed (February 2011), http://www.stanford.edu/~hastie/Papers/samme.pdf [81] Zhu, W.: Semantic scene concept learning by an autonomous agent. In: Proc AAAI 2005 (2005)
Author Index
Bunke, Horst
Liwicki, Marcus 5 L´opez, Antonio M. 25
5
Chertok, Michael Dornaika, Fadi
113 25
Ger´onimo, David Graves, Alex 5
Ogiela, Lidia 39 Ogiela, Marek R. 1 Rouhani, Mohammad
145
Jain, Lakhmi C.
1
113
59
25
Hachaj, Tomasz
Keller, Yosi
Nakajiama, Masayuki
25
Saito, Suguru 59 Sappa, Angel D. 25 Shima, Tetsuo 59 Sobol-Shikler, Tal 171 Trzupek, Mirosław
89