Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4673
Walter G. Kropatsch Martin Kampel Allan Hanbury (Eds.)
Computer Analysis of Images and Patterns 12th International Conference, CAIP 2007 Vienna, Austria, August 27-29, 2007 Proceedings
13
Volume Editors Walter G. Kropatsch Martin Kampel Allan Hanbury Pattern Recognition and Image Processing Group Institute of Computer-Aided Automation Vienna University of Technology Favoritenstr. 9/1832, 1040 Vienna, Austria E-mail: {krw,kampel,hanbury}@prip.tuwien.ac.at
Library of Congress Control Number: 2007932885 CR Subject Classification (1998): I.5, I.4, I.3.5, I.2.10, I.2.6, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-540-74271-9 Springer Berlin Heidelberg New York 978-3-540-74271-5 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12108824 06/3180 543210
Preface
It was an honor and a pleasure to organize the 12th international conference on Computer Analysis of Images and Patterns (CAIP 2007) in Vienna, Austria. CAIP has been held biennially since 1985:
Year Location 1985 Berlin, Germany 1987 Wismar, Germany 1989 Leipzig, Germany 1991 Dresden, Germany 1993 Budapest, Hungary 1995 Prague, Czech Republic 1997 Kiel, Germany 1999 Ljubljana, Slovenia 2001 Warsaw, Poland 2003 Groningen, The Netherlands 2005 Versailles, France 2007 Vienna, Austria
Organizers R. Klette L.P. Iaroslavskii, A. Rosenfeld, W. Wilhelmi K. Voss, D. Chetverikov, G. Sommer R. Klette D. Chetverikov, W. Kropatsch V. Hlavac, R. Sara G. Sommer, K. Daniilidis, J. Pauli F. Solina, A. Leonardis W. Skarbek N. Petkov, M. Westenberg A. Gagalowicz W. Kropatsch, M. Kampel
This year 251 full scientific papers were submitted of which 120 were accepted based on the scientific reviews for presentation during the conference. Consequently the competition for acceptance in the final program was tough. The accepted papers were presented during the conference either as oral presentations or as posters in the non-overlapping scientific program. Oral presentations allowed the authors to reach a large number of participants, while posters allowed for more intense scientific interaction. We tried to continue the tradition of CAIP in providing a forum for scientific exchange at a high quality level. Three internationally recognized speakers accepted our invitation to present a stimulating research topic this year: Zygmunt Pizlo, Purdue University; Arnold Smeulders, University of Amsterdam; and Steven W. Zucker, Yale University. We would like to thank all the program committee members and additional reviewers for their valuable feedback enabling the authors to further improve the quality of their work. We are grateful to our sponsors, the International Association for Pattern Recognition, the Austrian Association for Pattern Recognition, the Austrian Computer Society, and the Vienna Convention Bureau.
VI
Preface
Many thanks go to our local support team: Sigrid Elsinger, Martin Lettner, Ernestine Zolda and Alexander Dorfmeister. We also appreciate the help of the staff members of PRIP. June 2007
Walter Kropatsch Martin Kampel Allan Hanbury
CAIP 2007 Organization
General Chairs Walter G. Kropatsch Martin Kampel
Vienna University of Technology Vienna University of Technology
Steering Committee Andr´e Gagalowicz Reinhard Klette Walter G. Kropatsch Nicolai Petkov Gerald Sommer
INRIA Rocquencourt, France The University of Auckland, New Zealand Vienna University of Technology, Austria University of Groningen, The Netherlands Christian-Albrechts-Universit¨at zu Kiel, Germany
Organizing Committee Sigrid Elsinger Martin Lettner Ernestine Zolda Alexander Dorfmeister
Vienna Vienna Vienna Vienna
University of University of University of University of
Technology Technology Technology Technology
Sponsors CAIP 2007 was sponsored by the following organizations: – – – –
Austrian Association for Pattern Recognition Austrian Computer Society International Association for Pattern Recognition (IAPR) Vienna Convention Bureau
Program Committee Ahmed, S. Andres, E. Bayro-Corrochano, E. Brun, L. Chetverikov, D. De Stefano, C. Del Bimbo, A. Di Gesu, V.
Kingston University, UK Universit´e de Poitiers, France Cinvestav, Mexico ENSICAEN, France Hungarian Academy of Sciences, Hungary Universit` a di Cassino, Italy Universit` a degli Studi di Firenze, Italy University of Palermo, Italy
VIII
Organization
Eklundh, J. Ercil, A. Fuchs, S. Gimel’farb, G. Hanbury, A. Hlav´ aˇc, V. Iwanowski, M. Jiang, X. Jolion, J. Kampel, M. Klette, R. Kozera, R. Kropatsch, W. Leonardis, A. Levine, M. Li, X. Niemann, H. Nystr¨ om, I. Perner, P. Petkov, N. Pitas, I. Pizlo, Z. Radeva, P. Roerdink, J.B.T.M. Rousson, M. Sablatnig, R. Sagerer, G. Sanniti di Baja, G. Sauerbier, M. Schizas, C. Sebe, N. Siebel, N. Smeulders, A. Soille, P. Sommer, G. Stathaki, T. Suk, M. Tan, T. Tao, D. ter Haar Romeny, B. van Vliet, L.
Royal Institute of Technology, Sweden Sabanci University, Turkey TU Dresden, Germany University of Auckland, New Zealand Vienna University of Technology, Austria Czech Technical University, Czech Republic Warsaw University of Technology, Poland University of M¨ unster, Germany Lyon Research Center, France Vienna University of Technology, Austria The University of Auckland, New Zealand University of Western Australia, Australia Vienna University of Technology, Austria University of Ljubljana, Slovenia McGill University, Canada University of London, UK University of Erlangen-N¨ urnberg, Germany Uppsala University, Sweden IBaI Institute, Germany University of Groningen, The Netherlands Aristotle University of Thessaloniki, Greece Purdue University, USA Universitat Aut` onoma de Barcelona, Spain University of Groningen, The Netherlands Siemens Corporate Research, USA Vienna University of Technology, Austria University of Bielefeld, Germany Istituto di Cibernetica “E.Caianiello”, Italy ETH Z¨ urich, Switzerland University of Cyprus, Cyprus Universiteit van Amsterdam, The Netherlands Christian-Albrechts-University of Kiel, Germany Universiteit van Amsterdam, The Netherlands EC Joint Research Centre, Italy University of Kiel, Germany Imperial College London, UK Sung Kyun Kwan University, Korea Chinese Academy of Sciences, China The Hong Kong Polytechnic University, Hong Kong, China TU Eindhoven, The Netherlands Delft University of Technology, The Netherlands
Organization
Vento, M. Villanueva, J. Wechsler, H. Weickert, J. Wilkinson, M. Wojciechowski, K. Yuan, Y.
University of Salerno, Italy Computer Vision Center, Spain George Mason University, USA Saarland University, Germany University of Groningen, The Netherlands Silesian Technical University, Poland Aston University, UK
Additional Reviewers Blauensteiner, P. Donner, R. Gedda, M. Haxhimusa, Y. Ion, A. Karlsson, P. Langs, G. Lettner, M. Miˇcuˇs´ık, B. Niazi, K. Nordin, B. Reiter, M. Strand, R. Vidholm, E. Wildenauer, H.
Vienna University of Technology, Vienna University of Technology, Uppsala University, Sweden Purdue University, USA Vienna University of Technology, Uppsala University, Sweden Vienna University of Technology, Vienna University of Technology, Vienna University of Technology, Uppsala University, Sweden Uppsala University, Sweden Vienna University of Technology, Uppsala University, Sweden Uppsala University, Sweden Vienna University of Technology,
Austria Austria
Austria Austria Austria Auatria
Austria
Austria
IX
Table of Contents
Invited Talks Human Perception of 3D Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zygmunt Pizlo
1
Connection Geometry, Color, and Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ohad Ben-Shahar, Gang Li, and Steven W. Zucker
13
Motion Detection and Tracking Adaptable Model-Based Tracking Using Analysis-by-Synthesis Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Harald Wuest, Folker Wientapper, and Didier Stricker
20
Mixture Models Based Background Subtraction for Video Surveillance Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Poppe, Ga¨etan Martens, Peter Lambert, and Rik Van de Walle
28
Applicability of Motion Estimation Algorithms for an Automatic Detection of Spiral Grain in CT Cross-Section Images of Logs . . . . . . . . . Karl Entacher, Christian Lenz, Martin Seidel, Andreas Uhl, and Rudolf Weiglmaier
36
Deterministic and Stochastic Methods for Gaze Tracking in Real-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Javier Orozco, F. Xavier Roca, and Jordi Gonz` alez
45
Integration of Multiple Temporal and Spatial Scales for Robust Optic Flow Estimation in a Biologically Inspired Algorithm . . . . . . . . . . . . . . . . . Cornelia Beck, Thomas Gottbehuet, and Heiko Neumann
53
Classification of Optical Flow by Constraints . . . . . . . . . . . . . . . . . . . . . . . . Yusuke Kameda and Atsushi Imiya
61
Target Positioning with Dominant Feature Elements . . . . . . . . . . . . . . . . . . Zhuan Qing Huang and Zhuhan Jiang
69
Speeding-Up Differential Motion Detection Algorithms Using a Change-Driven Data Flow Processing Strategy . . . . . . . . . . . . . . . . . . . . . . . Jose A. Boluda and Fernando Pardo
77
Foreground and Shadow Detection Based on Conditional Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yang Wang
85
XII
Table of Contents
n-Grams of Action Primitives for Recognizing Human Behavior . . . . . . . . Christian Thurau and V´ aclav Hlav´ aˇc
93
Human Action Recognition in Table-Top Scenarios: An HMM-Based Analysis to Optimize the Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pradeep Reddy Raamana, Daniel Grest, and Volker Krueger
101
Grouping of Articulated Objects with Common Axis . . . . . . . . . . . . . . . . . Levente Hajder
109
Decision Level Multiple Cameras Fusion Using Dezert-Smarandache Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Esteban Garcia and Leopoldo Altamirano
117
Occlusion Removal in Video Microscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brian Eastwood and Russell M. Taylor II
125
A Modular Approach for Automating Video Analysis . . . . . . . . . . . . . . . . . Gayathri Nadarajan and Arnaud Renouf
133
Rectified Reconstruction from Stereo Pairs and Robot Mapping . . . . . . . . Antonio Javier Gallego, Rafael Molina, Patricia Compa˜ n, and Carlos Villagr´ a
141
Estimation Track–Before–Detect Motion Capture Systems State Space Spatial Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Przemyslaw Mazurek
149
Medical Imaging Real-Time Active Shape Models for Segmentation of 3D Cardiac Ultrasound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jøger Hanseg˚ ard, Fredrik Orderud, and Stein I. Rabben Effects of Preprocessing Eye Fundus Images on Appearance Based Glaucoma Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J¨ org Meier, R¨ udiger Bock, Georg Michelson, L´ aszl´ o G. Ny´ ul, and Joachim Hornegger
157
165
Flexibility Description of the MET Protein Stalk Based on the Use of Non-uniform B-Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Magnus Gedda and Stina Svensson
173
Virtual Microscopy Using JPEG2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francisco G´ omez, Marcela Iregui, and Eduardo Romero
181
A Statistical-Genetic Algorithm to Select the Most Significant Features in Mammograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gonzalo V. S´ anchez-Ferrero and Juan Ignacio Arribas
189
Table of Contents
Biomarker Selection System, Employing an Iterative Peak Selection Method, for Identifying Biomarkers Related to Prostate Cancer . . . . . . . . Panagiotis Bougioukos, Dionisis Cavouras, Antonis Daskalakis, Ioannis Kalatzis, Spiros Kostopoulos, Pantelis Georgiadis, George Nikiforidis, and Anastasios Bezerianos Automatic Segmentation of Femur Bones in Anterior-Posterior Pelvis X-Ray Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feng Ding, Wee Kheng Leow, and Tet Sen Howe Assessing Artery Motion Compensation in IVUS . . . . . . . . . . . . . . . . . . . . . Debora Gil, Oriol Rodriguez-Leor, Petia Radeva, and Aura Hern` andez Assessing Estrogen Receptors’ Status by Texture Analysis of Breast Tissue Specimens and Pattern Recognition Methods . . . . . . . . . . . . . . . . . . Spiros Kostopoulos, Dionisis Cavouras, Antonis Daskalakis, Ioannis Kalatzis, Panagiotis Bougioukos, George Kagadis, Panagiota Ravazoula, and George Nikiforidis Multimodal Evaluation for Medical Image Segmentation . . . . . . . . . . . . . . Rub´en C´ ardenes, Meritxell Bach, Ying Chi, Ioannis Marras, Rodrigo de Luis, Mats Anderson, Peter Cashman, and Matthieu Bultelle Automated 3D Segmentation of Lung Fields in Thin Slice CT Exploiting Wavelet Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panayiotis Korfiatis, Spyros Skiadopoulos, Philippos Sakellaropoulos, Christina Kalogeropoulou, and Lena Costaridou
XIII
197
205 213
221
229
237
Reconstruction of Heart Motion from 4D Echocardiographic Images . . . . ´ lo, and Piotr Bala Michal Chlebiej, Krzysztof Nowi´ nski, Piotr Scis
245
Quantification of Bone Remodeling in the Proximity of Implants . . . . . . . Hamid Sarve, Carina B. Johansson, Joakim Lindblad, Gunilla Borgefors, and Victoria Franke Stenport
253
Delaunay-Based Vector Segmentation of Volumetric Medical Images . . . . ˇ el, Pˇremysl Krˇsek, Miroslav Svub, ˇ ˇ Michal Spanˇ V´ıt Stancl, and ˇ Ondˇrej Siler
261
Non-uniform Resolution Recovery Using Median Priors in Tomographic Image Reconstruction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Munir Ahmad and Andrew Todd-Pokropek
270
Detection of Postmenopausal Alteration of Bone Structure in Digitized X-rays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constantin Vertan, Ion S ¸ tefan, and Laura Florea
278
XIV
Table of Contents
Blood Detection in IVUS Images for 3D Volume of Lumen Changes Measurement Due to Different Drugs Administration . . . . . . . . . . . . . . . . . David Rotger, Petia Radeva, Eduard Fern´ andez-Nofrer´ıas, and Josepa Mauri Eigenmotion-Based Detection of Intestinal Contractions . . . . . . . . . . . . . . Laura Igual, Santi Segu´ı, Jordi Vitri` a, Fernando Azpiroz, and Petia Radeva Brain Tissue Classification with Automated Generation of Training Data Improved by Deformable Registration . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Schwarz and Tomas Kasparek
285
293
301
On Simulating 3D Fluorescent Microscope Images . . . . . . . . . . . . . . . . . . . . David Svoboda, Marek Kaˇs´ık, Martin Maˇska, Jan Huben´ y, Stanislav Stejskal, and Michal Zimmermann
309
Hierarchical Detection of Multiple Organs Using Boosted Features . . . . . Samuel Hugueny and Mika¨el Rousson
317
Biometrics Monitoring of Emotion to Create Adaptive Game for Children with Mild Autistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Ravindra S. De Silva, Masatake Higashi, Stephen G. Lambacher, and Minetada Osano
326
A Simplified Human Vision Model Applied to a Blocking Artifact Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hantao Liu and Ingrid Heynderickx
334
Estimating Reflectance Functions Using a Cyberware 3030 Scanner . . . . . Matthew P. Dickens and Edwin R. Hancock
342
Are Younger People More Difficult to Identify or Just a Peer-to-Peer Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wai Han Ho, Paul Watters, and Dominic Verity
351
Lip Biometrics for Digit Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maycel Isaac Faraj and Josef Bigun
360
An Embedded Fingerprint Authentication System Integrated with a Hardware-Based Truly Random Number Generator . . . . . . . . . . . . . . . . . . Murat Erat, Kenan Danı¸sman, Salih Erg¨ un, Alper Kanak, and Mehmet Kayaoglu A New Manifold Representation for Visual Speech Recognition . . . . . . . . Dahai Yu, Ovidiu Ghita, Alistair Sutherland, and Paul F. Whelan
366
374
Table of Contents
XV
Fingerprint Hardening with Randomly Selected Chaff Minutiae . . . . . . . . ˙ Alper Kanak and Ibrahim So˜gukpınar
383
Wavelet-Based Fingerprint Region Selection . . . . . . . . . . . . . . . . . . . . . . . . . Almudena Lindoso, Luis Entrena, and Judith Liu-Jimenez
391
Face Shape Recovery and Recognition Using a Surface Gradient Based Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mario Castel´ an and Edwin R. Hancock
399
Representation of Facial Features by Catmull-Rom Splines . . . . . . . . . . . . Marco Maggini, Stefano Melacci, and Lorenzo Sarti
408
Automatic Quantitative Mouth Shape Analysis . . . . . . . . . . . . . . . . . . . . . . Augusto Salazar, Jorge Hern´ andez, and Flavio Prieto
416
Color Color Adjacency Histograms for Image Matching . . . . . . . . . . . . . . . . . . . . . Allan Hanbury and Beatriz Marcotegui
424
Segmentation of Distinct Homogeneous Color Regions in Images . . . . . . . Daniel Mohr and Gabriel Zachmann
432
Estimating the Color of the Illuminant Using Anisotropic Diffusion . . . . . Marc Ebner
441
Restoration of Color Images Degraded by Space-Variant Motion Blur . . . ˇ Michal Sorel and Jan Flusser
450
Real-Time Elimination of Brightness in Color Images by MS Diagram and Mathematical Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francisco Ortiz
458
Curves and Surfaces Beyond 2 Dimensions Surface Reconstruction Using Polarization and Photometric Stereo . . . . . Gary A. Atkinson and Edwin R. Hancock
466
Curvature Estimation in Noisy Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thanh Phuong Nguyen and Isabelle Debled-Rennesson
474
3D+t Reconstruction in the Context of Locally Spheric Shaped Data Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wafa Rekik, Dominique B´er´eziat, and S´everine Dubuisson
482
Robust Fitting of 3D Objects by Affinely Transformed Superellipsoids Using Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frank Ditrich and Herbert Suesse
490
XVI
Table of Contents
Fast and Precise Weak-Perspective Factorization . . . . . . . . . . . . . . . . . . . . . ´ Levente Hajder, Akos Pernek, and Csaba Kaz´ o A Graph-with-Loop Structure for a Topological Representation of 3D Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rocio Gonzalez-Diaz, Mar´ıa Jos´e Jim´enez, Belen Medrano, and Pedro Real
498
506
Reading Characters, Words, Lines... Print Process Separation Using Interest Regions . . . . . . . . . . . . . . . . . . . . . Reinhold Huber-M¨ ork, Dorothea Heiss-Czedik, Konrad Mayer, Harald Penz, and Andreas Vrabl
514
Histogram-Based Lines and Words Decomposition for Arabic Omni Font-Written OCR Systems; Enhancements and Evaluation . . . . . . . . . . . Mohamed Attia and Mohamed El-Mahallawy
522
Semi-automatic Training Sets Acquisition for Handwriting Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jerzy Sas and Urszula Markowska-Kaczmar
531
Gabor-Based Recognizer for Chinese Handwriting from Segmentation-Free Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tong-Hua Su, Tian-Wen Zhang, De-Jun Guan, and Hu-Jie Huang
539
Image Based Recognition of Ancient Coins . . . . . . . . . . . . . . . . . . . . . . . . . . Maia Zaharieva, Martin Kampel, and Sebastian Zambanini
547
Text Area Detection in Digital Documents Images Using Textural Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ilktan Ar and M. Elif Karsligil
555
Image Segmentation Optimal Threshold Selection for Tomogram Segmentation by Reprojection of the Reconstructed Image . . . . . . . . . . . . . . . . . . . . . . . . . . . K. Joost Batenburg and Jan Sijbers
563
A Level Set Bridging Force for the Segmentation of Dendritic Spines . . . Karsten Rink and Klaus T¨ onnies
571
Knowledge from Markers in Watershed Segmentation . . . . . . . . . . . . . . . . . S´ebastien Lef`evre
579
Image Segmentation Using Topological Persistence . . . . . . . . . . . . . . . . . . . David Letscher and Jason Fritts
587
Table of Contents
XVII
Image Modeling and Segmentation Using Incremental Bayesian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constantinos Constantinopoulos and Aristidis Likas
596
Model-Based Segmentation of Multimodal Images . . . . . . . . . . . . . . . . . . . . Xin Hong, Sally McClean, Bryan Scotney, and Philip Morrow
604
Image Segmentation Based on Height Maps . . . . . . . . . . . . . . . . . . . . . . . . . Gabriele Peters and Jochen Kerdels
612
Shape Measuring the Orientability of Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paul L. Rosin
620
A 3–Subiteration Surface–Thinning Algorithm . . . . . . . . . . . . . . . . . . . . . . . K´ alm´ an Pal´ agyi
628
Extraction of River Networks from Satellite Images by Combining Mathematical Morphology and Hydrology . . . . . . . . . . . . . . . . . . . . . . . . . . . Pierre Soille and Jacopo Grazzini
636
Fractal Active Shape Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Polychronis Manousopoulos, Vassileios Drakopoulos, and Theoharis Theoharis
645
Decomposition for Efficient Eccentricity Transform of Convex Shapes . . . Adrian Ion, Samuel Peltier, Yll Haxhimusa, and Walter G. Kropatsch
653
Euclidean Shortest Paths in Simple Cube Curves at a Glance . . . . . . . . . . Fajie Li and Reinhard Klette
661
A Fast and Robust Ellipse Detection Algorithm Based on Pseudo-random Sample Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ge Song and Hong Wang
669
A Definition for Orientation for Multiple Component Shapes . . . . . . . . . . ˇ c and Paul L. Rosin Joviˇsa Zuni´
677
Definition of a Model-Based Detector of Curvilinear Regions . . . . . . . . . . C´edric Lemaˆıtre, Johel Miteran, and Jiˇr´ı Matas
686
A Method for Interactive Shape Detection in Cattle Images Using Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Horacio M. Gonz´ alez–Velasco, Carlos J. Garc´ıa–Orellana, Miguel Mac´ıas–Mac´ıas, Ram´ on Gallardo–Caballero, and ´ Fernando J. Alvarez–Franco
694
XVIII
Table of Contents
A New Phase Field Model of a ‘Gas of Circles’ for Tree Crown Extraction from Aerial Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Horv´ ath and Ian H. Jermyn
702
Shape Signature Matching for Object Identification Invariant to Image Transformations and Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stamatia Giannarou and Tania Stathaki
710
Junction Detection and Multi-orientation Analysis Using Streamlines . . . Frank G.A. Faas and Lucas J. van Vliet
718
Decomposing a Simple Polygon into Trapezoids . . . . . . . . . . . . . . . . . . . . . . Fajie Li and Reinhard Klette
726
Shape Recognition and Retrieval: A Structural Approach Using Velocity Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamidreza Zaboli, Mohammad Rahmati, and Abdolreza Mirzaei
734
Improved Morphological Interpolation of Elevation Contour Data with Generalised Geodesic Propagations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacopo Grazzini and Pierre Soille
742
Image Registration and Matching Cylindrical Phase Correlation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jakub Bican and Jan Flusser Extended Global Optimization Strategy for Rigid 2D/3D Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Kubias, Frank Deinzer, Tobias Feldmann, and Dietrich Paulus
751
759
A Fast B-Spline Pseudo-inversion Algorithm for Consistent Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Trist´ an and Juan Ignacio Arribas
768
Robust Least-Squares Image Matching in the Presence of Outliers . . . . . . Patrice Delmas, Georgy Gimel’farb, Al Shorin, and John Morris
776
A Neural Network String Matcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abdolreza Mirzaei, Hamidreza Zaboli, and Reza Safabakhsh
784
Incorporating Spatial Information into 3D-2D Image Registration . . . . . . Guoyan Zheng
792
Performance Evaluation and Recent Advances of Fast Block-Matching Motion Estimation Methods for Video Coding . . . . . . . . . . . . . . . . . . . . . . . Berenice Ramirez
801
Table of Contents
XIX
Spectral Eigenfeatures for Effective DP Matching in Fingerprint Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boris Danev and Toshio Kamei
809
Registering Long-Term Image Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Detlev Droege and Dietrich Paulus
817
Graph Similarity Using Interfering Quantum Walks . . . . . . . . . . . . . . . . . . David Emms, Edwin R. Hancock, and Richard C. Wilson
823
Signal Decomposition and Invariants Visual Speech Recognition Using Motion Features and Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wai Chee Yau, Dinesh Kant Kumar, and Hans Weghorn Feature Extraction of Weighted Data for Implicit Variable Selection . . . . Luis S´ anchez, Fernando Mart´ınez, Germ´ an Castellanos, and Augusto Salazar Analysis of Prediction Mode Decision in Spatial Enhancement Layers in H.264/AVC SVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Koen De Wolf, Davy De Schrijver, Wesley De Neve, Saar De Zutter, Peter Lambert, and Rik Van de Walle Object Recognition by Implicit Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Jan Flusser, Jaroslav Kautsky, and Filip Sroubek An Automatic Microarray Image Gridding Technique Based on Continuous Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emmanouil Athanasiadis, Dionisis Cavouras, Panagiota Spyridonos, Ioannis Kalatzis, and George Nikiforidis
832
840
848
856
864
Image Sifting for Micro Array Image Enhancement . . . . . . . . . . . . . . . . . . . Pooria Jafari Moghadam and Mohamad H. Moradi
871
Wavelet Based Local Coherent Tomography with an Application in Terahertz Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiao-Xia Yin, Brian W.-H. Ng, Bradley Ferguson, and Derek Abbott
878
Nonlinear Approximation of Spatiotemporal Data Using Diffusion Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marie Wild
886
A New Wavelet-Based Texture Descriptor for Image Retrieval . . . . . . . . . Esther de Ves, Ana Ruedin, Daniel Acevedo, Xaro Benavent, and Leticia Seijas
895
XX
Table of Contents
Space-Variant Restoration with Sliding Discrete Cosine Transform . . . . . Vitaly Kober and Jacobo Gomez Agis
903
Features and Classification Comparative Evaluation of Classical Methods, Optimized Gabor Filters and LBP for Texture Feature Selection and Classification . . . . . . . . . . . . . Jaime Melendez, Domenec Puig, and Miguel Angel Garcia
912
A Multiple Classifier Approach for the Recognition of Screen-Rendered Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steffen Wachenfeld, Stefan Fleischer, and Xiaoyi Jiang
921
Improving Stability of Feature Selection Methods . . . . . . . . . . . . . . . . . . . . Pavel Kˇr´ıˇzek, Josef Kittler, and V´ aclav Hlav´ aˇc
929
A Movie Classifier Based on Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . Hui-Yu Huang, Weir-Sheng Shih, and Wen-Hsing Hsu
937
An Efficient Method for Filtering Image-Based Spam E-mail . . . . . . . . . . Ngo Phuong Nhung and Tu Minh Phuong
945
SVM-Based Active Feedback in Image Retrieval Using Clustering and Unlabeled Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rujie Liu, Yuehong Wang, Takayuki Baba, Yusuke Uehara, Daiki Masumoto, and Shigemi Nagata Hierarchical Classifiers for Detection of Fractures in X-Ray Images . . . . . Joshua Congfu He, Wee Kheng Leow, and Tet Sen Howe An Efficient Nearest Neighbor Classifier Using an Adaptive Distance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Omid Dehzangi, Mansoor J. Zolghadri, Shahram Taheri, and Abdollah Dehzangi
954
962
970
Accurate Identification of a Markov-Gibbs Model for Texture Synthesis by Bunch Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Georgy Gimel’farb and Dongxiao Zhou
979
Texture Defect Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michal Haindl, Jiˇr´ı Grim, and Stanislav Mikeˇs
987
Extracting Salient Points and Parts of Shapes Using Modified kd-Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Bauckhage
995
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003
Human Perception of 3D Shapes Zygmunt Pizlo Department of Psychological Sciences, Purdue University, West Lafayette, IN 47907-2081, U.S.A.
[email protected] Abstract. This paper reviews main approaches to 3D shape perception in both human and computer vision. The approaches are evaluated with respect to their plausibility of generating adequate explanations of human vision. The criterion for plausibility is provided by existing psychophysical results. A new theory of 3D shape perception is then outlined. According to this theory, human perception of shapes critically depends on a priori shape constraints: symmetry and compactness. The role of depth cues is secondary, at best.
1 Introduction Reconstruction of a 3D shape from one or more 2D images has been a major challenge in computer vision during its 50 year history. Understanding how the human visual system accomplishes this task has been a major challenge in psychology for about a thousand years. The first written account of human shape perception can be found in Alhazen’s book published in 1083 [1]. Alhazen defined shape constancy as referring to the fact that the percept of the shape of a given object remains constant despite changes in the shape of the object’s retinal image. The shape of the retinal image changes when the viewing orientation of the object relative to the observer changes. Shape constancy phenomenon has played a fundamental role in the study of shape perception, and has proved to be essential in understanding how humans perceive shapes veridically, i.e., as the 3D shapes really are in the physical world. Despite the fact that human and computer vision communities are working on the same problem, viz., formulating a theory of a seeing system, there has not been much interaction between the two. We begin with a brief explanation of why the two communities have worked separately. The main part of the paper will present the history of research on shape constancy, emphasizing the concepts, tools and theories used in both communities. The paper concludes with the presentation of a new theory of 3D shape perception. 1.1 Computer vs. Human Vision The beginning of computer vision coincided with the Cognitive Revolution in psychology. The common assumption was that cognitive psychologists, neuroscientists, engineers and computer scientists would work together in formulating a theory of the human mind and brain, and implementing the theory in artificial systems. It had become clear, very quickly, that this task is extremely difficult. First, psychologists and W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 1–12, 2007. © Springer-Verlag Berlin Heidelberg 2007
2
Z. Pizlo
neuroscientists did not really know much about how the mind and brain work. Second, computer scientists and engineers did not really know how to actually implement even the simplest mental functions, such as reading handwritten text. These two difficulties were responsible for the split between the two communities. Psychological community proceeded with studying mental functions that were simple enough so that they could be modeled, such as perceptual coding of color, contrast and depth, whereas computer vision community proceeded with an attempt to solve simple practical problems regardless of whether or not the solutions had anything to do with how the human visual system works. There was only a brief period in the 1970s during which David Marr tried to bring the two communities together by emphasizing the importance of understanding the human vision, as a general purpose vision system [2]. He did not succeed, but his views on the field of vision had a lasting influence on the psychological community. Now, after 50 years of research, the two communities are trying to join forces again, and learn about each other’s methods, results and approaches. This is especially true in Europe, which is exemplified by launching interdisciplinary efforts under the names ECVision and euCognition. In order to understand where we are now, how we got here, and what should be done next, a review of the main approaches to 3D shape perception is presented, next. The review is organized around the following topics: (i) the nature of 3D shape representation; (ii) the role of learning; (iii) the role of invariants; (iv) the role of surfaces; (v) shape parts; (vi) purposive vs. general purpose vision; and (vii) the role of a priori constraints.
2 Review of Main Approaches Most approaches were tried by both communities, although with different degree of success. Only one or at most two approaches were limited to one community. This fact alone indicates the high degree of synergy between the efforts of these two communities. 2.1 Nature of Perceptual Representation of 3D Shapes Perceptual representation of 3D shapes in the human mind is three-dimensional. There are a number of results and observations supporting this statement. For example, humans are able to mentally rotate 3D objects, as shown by Shepard and his colleagues [3]. It does not matter whether or not mental rotation is used in object recognition and whether the speed of mental rotation is constant, issues that have been discussed vigorously for the last 35 years. What really matters is that when presented with a novel 3D object, the observer has little difficulty imagining how the object looks like from a different viewing orientation. Similarly, given two different views of the same 3D object, the observer has no difficulty recognizing that they represent the same object [4]. Both phenomena are related to shape constancy. Shape constancy in the case of unfamiliar objects cannot be achieved without 3D representation of the shapes. There was a small group of psychologists, starting with Rock, who claimed the perceptual representation of 3D shapes is not 3D [5]. Rock himself claimed that perceptual representation of 3D objects is only 2.5D. Recall that according to Marr, 2.5D
Human Perception of 3D Shapes
3
representation refers to visible surfaces in the viewer-centered coordinate system. According to Rock, the human visual system does not represent an object in the object- centered coordinate system. In essence, Rock claimed that humans do not perceive 3D shapes; only their surfaces. Rock’s followers modified this view by claiming that perceptual representation is only 2D [6] and shape constancy with 3D shapes is either not achieved at all or it requires learning a large number of 2D views of each object (so called multiple view theory [7]). This theory of shape perception was first proposed by Helmholtz [8]. It had been ignored by most subsequent psychologists, including Gestalt psychologists [9], cognitive psychologists [10], Gibson [11], and Marr [2]. The reason was that this theory was inconsistent with the fact that shape constancy is universally achieved in everyday life (regardless of whether or not the 3D shapes are familiar). The only objects, with which shape constancy is not achieved (unless extensive learning is allowed) are crumpled newspapers, amoebas and paperclips [5, 6]. It is difficult to justify the need for a multiple view theory, if the theory is supported only by a handful of unstructured or unnatural objects. Shape constancy performance does improve with learning, when unstructured objects are used. But is learning a part of the theory of perception or memory? More importantly, can learning do the job? 2.2 Role of Learning The claim that the human mind at birth is a tabula rasa, and that perceptual mechanisms (algorithms) have to be learned during one’s life, originated with British empiricists in the 17th century [12]. According to this view, a newborn infant is not able to perceive shapes, depth, motion, colors etc. This extreme view had been gradually changing when confronted with the philosophy of nativism, as well as with empirical results. Elaborating the ideas of such philosophers as Descartes, Reid and Kant, and physiologist Johannes Müller, Hering [13] offered a nativistic theory of perception of color and space. Shortly after that, empirical results showed that the percept of depth and space is present very early in animal and human life. This line of research started with Thorndike, who demonstrated depth discrimination in newborn chicken [14]. Later, Hess showed that visual space perception, including binocular vision, is present at birth in chicken [15]. Experiments with young infants started with Gibson & Walk [16]. They showed that infants perceive depth very early in life. Reports of adult persons, born blind, whose vision had been restored, and who could not recognize objects without extensive learning, were originally taken as a critical proof supporting the empiristic position [17]. But now we know that the visual system can process visual information at birth [18], but it looses this ability when visual stimulation is absent for many years. So, it is not absence of learning, but degeneration of the neural structure, which is responsible for poor vision in such patients. Similarly, experiments with distorting and inverting prisms that were supposed to demonstrate that visual perception of space can be re-learned, showed that if there is any learning or adaptation, it takes place in the motor, not in the visual system [19]. Recent experiments on brain imaging confirmed these results. The spatial representation in the brain is not reorganized, when the visual input is altered by wearing inverting prisms [20]. It is now well established and commonly accepted in the psychological community that the mechanisms underlying perception of contrast, color, motion, texture, depth,
4
Z. Pizlo
and shape are innate, not learned. In particular, results with newborn infants demonstrated that they do perceive objects in 3D space, the way we do [21]. The percept of a 3D object produced by its 2D retinal image involves a priori constraints, and these constraints are innate, not learned. Early demonstrations of Ames [22] and Bruner & Goodman [23], suggesting the operation of learning in size and shape perception had been shown to be flawed (e.g., see Bruner’s, 1973, comments about his own early experiments [24]). Some cross-cultural studies reported differences in visual perception, but these differences can most likely be explained the same way as the results on adult persons born blind (see above). Namely, we all see things the same way when we are born, but if some class of stimuli is absent for an extended period after birth, the visual system may loose the ability to process it adequately. Loosing perceptual abilities due to impoverished visual input is not exactly what proponents of learning theories had in mind. So, why are we still talking about perceptual learning? Why is learning a dominant approach in computer vision community? Learning theories are convenient: instead of figuring out how the stimulus should be analyzed by a computer, we let the computer learn. In psychophysics, perceptual learning can be reliably demonstrated only in a handful of tasks, such as hyperacuity. Discrimination thresholds do improve with learning, and this improvement is related to the visual system’s ability to “learn” the magnitude of visual noise in a given part of the retina. Perceptual learning is also present when viewing fragmented pictures, like the Dalmatian. Considering the fact that these memory effects do not generalize to new images, it is likely that they represent the modulation of visual attention by memory. So, it is not the effect of memory on how the image is analyzed in the brain, but rather which parts of the image an observer is attending to. If perceptual mechanisms are innate, they had to be acquired (learned) during the evolution. It follows that “learning” may be understood more broadly, avoiding the traditional distinction between nativism and empiricism. The problem with this view is that we do not have a good theory of evolutionary learning. Therefore, talking about this as “learning” is likely to lead to confusions. If learning were an important factor in perception, learning would have been the main experimental methodology in visual psychophysics. And it is not. The fact that we all perceive our environment the same way (i.e., individual differences are extremely small), regardless of where we were born and what education we received, suggests one more time that visual experience plays little or no role in visual perception. There are no “good vs. bad perceptionists” in the same sense as there are good vs. bad football or chess players. 2.3 Invariants Gibson [25] was the first to propose that visual space and shape perception involve projective invariants, such as cross ratio of four collinear points. According to Gibson, the retinal images of a binocular, actively moving observer, provide direct information about the 3D scene. Depth cues, as advocated by Marr, and a priori constraints, as advocated by Gestaltists, are not needed because invariants can be computed from images themselves (or, as Gibson put it, from the visual array). Systematic research on the use of invariants in computer vision had not started until 1988 [26]. In the case of planar figures, projective invariants can be computed from a single image. In the
Human Perception of 3D Shapes
5
case of solid objects, a single 2D image allows computation of projective invariants for only some classes of objects, such as a subset of polyhedra (27). These invariants are called model-based. In order to compute projective invariants for an arbitrary object, one needs at least two images. An advantage of projective invariants is that they do not require the cameras to be calibrated. However, the cameras are often calibrated and in such cases, two perspective images allow the reconstruction of Euclidean invariants, such as angles and ratios of distances [28]. In fact, the human eyes are “calibrated cameras”: the visual system “knows” the geometry of the eyeballs. It follows that projective invariants have limited application in human and computer vision. More specifically, they have never been seriously considered as a model of human vision because they cannot distinguish among shapes that are easily distinguished by humans [29]. Perspective invariants provide a better model of human shape perception. They are model-based in the sense that they can establish perspective equivalence between an object and its image, without reconstructing the object (30-31). These invariants have been shown to provide an adequate theory of shape recognition by humans (32). Invariants, however, do not explain how a 3D percept is formed in the case of a single 2D image of an unfamiliar object. But we know that humans always perceive the 3D scene as 3D even if the scene is viewed with one eye. A theory of 3D shape reconstruction is needed. 2.4 Surfaces and Depth Cues The first attempt to provide a general theory of vision was made by Marr [2]. Marr’s approach was motivated by Julesz’ work [33]. Julesz, using random dot stereograms, demonstrated that binocular perception of 3D surfaces is possible without any monocular information about the surfaces. Marr took this as evidence that figure-ground organization is not needed and, in fact, not performed by the visual system. Recall that figure-ground organization refers to establishing which regions and contours in the image represent individual objects, as opposed to background. Instead, according to Marr, the visual system assigns 3D distances and orientations to every point on the retina, based on depth cues such as binocular disparity, motion, texture, and shading. These local distances and orientations represent a description of visible surfaces of objects “out there.” This description was called by Marr 2.5D sketch, and it was a critical element in his theory of shape reconstruction. Once the visible surfaces of an object are known, the visual system should be able to represent them in the objectcentered coordinate system and use this representation to build a 3D model of the shape of this object. But is 3D shape perceptually derived from its surfaces? Mathematically this seems like natural approach. Object-centered properties, such as curvature of the surface are derived from orientations of the surfaces. But perceptually, shape may not be a derivative of surfaces. The first direct test of Marr’s claim was performed only recently. Li & Pizlo [34] asked subjects to judge (i) the 3D orientations of faces of a polyhedron, and (ii) the shapes of the faces. Each type of information is sufficient to reconstruct the 3D shape of the polyhedron. These reconstructions should represent the subject’s percept, as long as the visual system actually performs such reconstructions. Specifically, if 3D shape percept is derived from orientations of surfaces, both types
6
Z. Pizlo
of judgments should lead to the same 3D shapes. They did not. The perceived 3D shape was consistent with the perceived shapes of its faces, but not with the perceived 3D orientations of the faces. The difference was large and systematic. The authors concluded that 3D shape percept is produced by applying simplicity constraints, such as symmetry, to the 2D shapes on the retina, not by using orientations of surfaces. In fact, the 3D shape, once perceived, provides information about its surfaces. So, contrary to the common wisdom, shape is a cue to surfaces, not the other way around. But note that before simplicity constraints can be applied, the 2D shape on the retina must be established. Human 3D shape perception involves figure-ground organization, not 2.5D sketch. 2.5 Shape Parts Theories of shape perception, before 1985, treated shape as one of many perceptual properties and assumed that shape will be explained the same way as depth, speed and size. It was assumed that contextual information is always critical because the retinal image of a given property is always ambiguous. Consider the size of an object. The size on the camera or retinal image depends on the viewing distance. If the viewing distance is not known, the object’s size is also unknown. So, before the size can be judged, the viewing distance has to be estimated. The viewing distance has nothing to do with the physical size of the object. Instead, it represents context, which, in the case of size perception, must be “taken into account.” But shape is different than all other perceptual properties. Shape us unique because it is complex and structured. Shape, unlike size, distance or color, requires very many parameters to describe. It follows that not all information about shape is destroyed by perspective projection to the retina. In fact, only small fraction of this information is destroyed; at least, if something can be assumed about the 3D objects “out there”, like that objects are symmetric. This idea was first implemented by Biederman [35] in his object Recognition-By-Components (RBC) theory. Biederman assumed that most objects consist of simple volumetric parts. The parts formed a small subset of generalized cones and each was characterized by one or more symmetries. Because of the simplicity of the parts, they could be recognized from a single image. Once several parts of an object are recognized, they can be used to identify the entire object. Similar theories were proposed by Pentland [36] and Dickinson et al. [37]. Note that RBC theory is very different from a theory of any other perceptual property. The theory uses simple parts, but there are a number of them and their spatial arrangement, relative to one another, can vary a lot. As such, the theory capitalizes on the complexity of shapes of objects. Surfaces do not have to be reconstructed and depth cues do not play any role. Despite the initial promise, however, this theory did not lead to a working algorithm. No one was able to show how to actually find simple parts in real images. One possible explanation of the failure is the fact that parts of real objects are not necessarily simple. As a result, the fit between the predefined simple parts and images of real parts, was never good enough. What proponents of this approach seemed to have missed is that a priori simplicity constraints may be effective not only when 3D objects are simple. The only thing that is required is that 3D objects be simpler than their 2D images. This is definitely true in the case of symmetric objects; any symmetric objects, including complex ones.
Human Perception of 3D Shapes
7
2.6 Purposive Vision Difficulties in solving the big problem of general purpose vision system posed by Marr, suggested to students of computer vision that it may be more realistic to solve smaller problems for individual applications. In particular, a robot should be active and should intelligently search for new information relevant to the task at hand, and test perceptual hypotheses, rather than passively try to reconstruct the entire 3D scene [38]. The term “purposive” was used in 1932 by Tolman, a behaviorist, who tried to revise behaviorism by adding cognitive elements to it [39]. Later on, the term was used by Wiener, and his colleagues, when they were trying to convince fellow engineers and psychologists, that goal directed behavior can be studied scientifically [40]. The idea that the visual system is like a scientist who tests hypotheses was put forth by Gregory [41]. At present, it is quite well established that human behavior and thinking are purposive. But there is no evidence that vision is purposive. That is, that the perceptual representation of a 3D scene depends on the purposes or motivations of the observer. It is true that visual attention can be affected by the observer’s purposes; the observer will look where the information is most likely to be found. Furthermore, the fact that the visual system is able to analyze only a few objects at a time implies that the observer will form the 3D percept of the attended objects first before reconstructing the rest of the 3D scene. But except for these two limitations, the human visual system is a general purpose system. 2.7 A Priori Constraints The fact that perception involves the operation of a priori simplicity constraints had been recognized by Gestalt psychologists in the first half of the previous century [9,42]. Gestalt’s ideas did not lead, right away, to formal theories because the mathematical tools were not there. Precise definition of simplicity requires the language of information theory, which was not presented until 1948 [43]. Computer simulations, that are essential in testing computational models of vision, could not be performed before 1941. The theory of inverse problems and a general method of solving them were not formulated until 1963 and made available in English literature much later [44]. Finally, the explicit claim that inverse problems, regularization theory, Bayesian inference and simplicity constraints are all needed to explain vision, has not been put forth until 1985 [45]. More than half a century passed between the origin of Gestalt ideas and the first serious implementation of them. During that time, these ideas had been criticized and dismissed for various reasons. Is human visual perception a difficult inverse problem? Does it have to be regularized? Does it involve simplicity constraints? The answer to all of these questions is in the affirmative and this is true especially in the case of 3D shape perception [46]. When presented with a 2D image of a natural 3D scene, an observer perceives the scene as 3D. One can verify this by closing one eye and looking around. An empiristic argument, according to which we see 3D objects as 3D because we know these objects, can easily be rejected by looking at photographs that contain unfamiliar objects. Clearly, the visual system solves a difficult inverse problem because there are infinitely many 3D scenes that could produce any given 2D image. Would the problem get easier when more images were provided? This question was tested in several
8
Z. Pizlo
studies [47-49]. For unstructured objects, like 3D polygonal lines and asymmetric polyhedra in which the visible contours are non-planar, shape constancy completely fails regardless of whether the subject views an object with one eye closed, or with two eyes and whether the object is rotating or not. For other objects, such as symmetric polyhedra, shape constancy is close to perfect under all viewing conditions. These results suggest that depth cues are neither necessary nor sufficient for achieving shape constancy, that is, for seeing shapes as they really are “out there.” What is both necessary and sufficient for achieving shape constancy, is the availability of effective a priori constraints. When the retinal or camera image is produced by a symmetric object, whose contours are planar and compactness large, shape constancy is reliably achieved. Do real objects in natural environment satisfy these constraints? It is difficult to answer this question at this point, but it is quite likely that the answer is “yes.” To the extent that this answer is accurate, the human visual system is a general purpose visual system. Obviously, there are cases when a priori constraints may lead to large perceptual errors. Such cases are not easy to find in the natural world, but are easy to design in the laboratory. This is what was done by Pizlo et al. [50]. The subjects in their experiment viewed binocularly a rotating stimulus. The stimulus was designed in such a way that a priori constraints (symmetry and compactness) were consistent with the percept of a rigidly rotating object, whereas motion and binocular disparity were consistent with a non-rigid object being stretched and then compressed along the line of sight of one eye. The subjects saw a rigidly rotating object, indicating that when a priori constraints and depth cues provide conflicting information, depth cues are ignored and the percept is determined by the constraints. If the shape “a priori” constraints that are used by the visual system are indeed consistent with the shapes of real objects in the natural environment, the visual system should rely on these constraints, rather than on depth cues. It does.
3 A New Theory of 3D Shape Reconstruction The main question becomes what the a priori constraints for shape are rather than which cues are combined in perception of 3D surfaces. Several constraints have already been identified. One of the most important is symmetry. When a 3D object is mirror symmetric, its single 2D orthographic image determines a one-parameter family of 3D shapes [51]. A single 2D perspective image leads to a unique reconstruction, although the reconstruction is expected to be unstable in the presence of visual noise [48,52]. Many objects are mirror symmetric: animals, humans, man-made objects. Symmetry is important: in the presence of gravity, mirror symmetric objects are mechanically stable. How difficult is it to determine whether a given image was produced by a mirror symmetric object? Not very difficult. If distinctive points can be identified, then the line segments connecting pairs of symmetric points are parallel in an orthographic image. If distinctive points cannot be identified, but surface contours can, and the contours are at least piece-wise planar, then the planar parts of the contours are affine equivalent. Once it is known that a 2D image was produced by a symmetric object, the 3D reconstruction is substantially easier. The symmetry itself drastically reduces the family of possible 3D reconstructions from arbitrarily many
Human Perception of 3D Shapes
9
degrees of freedom, to only one. There are other types of symmetries: translational and rotational. When a 2D contour is translated in 3D space, one produces an object called generalized cone (or generalized cylinder or a sweep) [53]. Such objects are also easy to reconstruct because of their translational symmetry. The same is true with objects that have rotational symmetries. If a 3D object has two symmetries, then its reconstruction is unique [51]. Again, in the presence of visual noise, the reconstruction is likely to be unstable. This means that other constraints are needed, as well, either to stabilize the reconstruction or to produce a unique solution, in the first place. The most straightforward constraint is provided by the likelihood function [54]. The 3D reconstructed object should not be too long in depth direction because small changes in the viewing orientation will lead to large changes in its image. It follows that the likelihood function will favor objects which are short in depth direction. Another constraint, not related to likelihood, is maximum compactness. 3D Compactness is defined as V2/S3, where V and S are the object’s volume and surface area. Maximal compactness corresponds to maximizing volume for a given surface area, or minimizing surface area for a given volume. The 3D compactness constraint has never been used before in shape reconstruction models. Its first application was presented by Li & Pizlo [34] and Pizlo et al. [55]. These studies showed that maximal compactness leads to unique and stable reconstructions. Furthermore, the reconstructions tend to be accurate. The 3D compactness is a generalization of 2D compactness that was used in the past [56]. 3D compactness is also a generalization of the minimum variance of angles constraint. The main drawback of the minimum variance of angles constraint is that it can only be applied to polyhedra. Both, compactness and minimum variance of angles give the object its volume. Because all objects have volume, volume does not have to be derived from depth cues such as binocular disparity or motion parallax. It can be assumed. Only such unstructured objects as a crumpled newspaper or polygonal line do not have volume. Note that volume does not have to be physically present. When looking at a Necker cube, we “see” its volume and surfaces even though it is a wire object. The same is true with other objects such as chairs and tables. In such cases, compactness can be computed for the convex hull of the object, rather than for the object itself. When a 3D object is asymmetric, its 3D shape can still be reconstructed by applying compactness constraint and the likelihood function. Reconstruction is much easier if at least some contours on the object’s surface are planar or piecewise planar. Planarity plays the same role as symmetry: it reduces the parameter space of the family of possible 3D shapes. When an object is symmetric, planarity constraint allows the reconstruction of those features on the symmetric object, whose symmetric counterparts are occluded. As a result, one is often able to reconstruct the entire 3D shape, including its back (occluded) surfaces. It is expected that in the case of smoothly curved objects planarity can be replaced by smoothness constraint. Preliminary psychophysical experiments showed that a 3D shape reconstruction model involving these three constraints: symmetry, compactness and planarity, in conjunction with the likelihood function, are able to account for the subjects’ percepts [29,34,48,52,54-55]. The model can explain not only the cases where the percept is veridical, but also cases where subjects make large errors. In this new model, depth cues are not used. They are not needed. The shape constraints are applied to a segmented 2D image. Specifically, the 3D shape constraints are applied to the 2D shapes
10
Z. Pizlo
on the image. It follows that shape reconstruction presupposes that the figure-ground organization has been successfully established. The new model was tested on synthetic and real images. It works well in the presence of image noise. However, before we can claim that we have solved the big problem of formulating a theory of a general purpose vision system, we have to develop a good model of figure-ground organization. Figure-ground organization is itself a difficult inverse problem whose solution depends on the operation of a priori constraints. Finding out what these constraints are, is the main question in human and computer vision. Acknowledgements. This work was partially supported by the National Science Foundation and Department of Energy.
References [1] Alhazen: The optics.Books. p. 1-3 (Translated by Sabra, A.I., The Warburg Institute, London) (1083/1989) [2] Marr, D.: Vision. W.H. Freeman, New York (1982) [3] Shepard, R.N., Cooper, L.A.: Mental images and their transformations. MIT Press, Cambridge, MA (1982) [4] Biederman, I., Gerhardstein, P.C.: Recognizing depth-rotated objects: Evidence and conditions from three-dimensional viewpoint invariance. Journal of Experimental Psychology: HP&P 19, 1162–1182 (1993) [5] Rock, I., DiVita, J.: A case of viewer-centered object perception. Cognitive Psychology 19, 280–293 (1987) [6] Tarr, M.J., Williams, P., Hayward, W.G., Gauthier, I.: Three-dimensional object recognition is viewpoint dependent. Nature Neuroscience 1, 275–277 (1998) [7] Poggio, T., Edelman, S.: A network that learns to recognize three-dimensional objects. Nature 343, 263–266 (1990) [8] von Helmholtz, H.: Treatise on Physiological Optics (translated from German, J.P.C.Southall). Thoemmes, Bristol (1910/2000) [9] Koffka, K.: Principles of Gestalt Psychology. Harcourt, Brace, New York (1935) [10] Hochberg, J.: Perception. Prentice-Hall, Englewood Cliffs, NJ (1978) [11] Gibson, J.J.: The Ecological Approach to Visual Perception. Houghton Mifflin, Boston (1979) [12] Locke, J.: An essay concerning human understanding. Clarendon, Oxford (1690/1975) [13] Hering, E.: Beiträge zur Physiologie. Heft 1. Englemann, Leipzig (1861) [14] Thorndike, E.: The instinctive reactions of young chicks. Psychological Review 6, 191– 282 (1899) [15] Hess, E.H.: Space perception in the chick. Scientific American 195, 71–80 (1956) [16] Gibson, E., Walk, R.: The visual cliff. Scientific American 202, 64–71 (1960) [17] von Senden, M.: Space and sight. Methuen, London (1932/1960) [18] Hubel, D.H., Wiesel, T.N.: Receptive fields of cells in striate cortex of very young, visually inexperienced kittens. Journal of Neurophysiology 26, 994–1002 (1963) [19] Rock, I., Harris, C.S.: Vision and touch. Scientific American 216, 96–104 (1967) [20] Linden, D.E.J., Kallenbach, U., Heinecke, A., Singer, W., Goebel, R.: The myth of upright vision. A psychophysical and functional imaging study of adaptation to inverting spectacles. Perception 28, 469–481 (1999)
Human Perception of 3D Shapes
11
[21] Slater, A., Morison, V.: Shape constancy and slant perception at birth. Perception 14, 337–344 (1985) [22] Kilpatrick, F.P.: Explorations in transactional psychology. New York Univ. Press, NY (1961) [23] Bruner, J.S., Goodman, C.C.: Value and need as organizing factors in perception. Journal of abnormal and social psychology 42, 33–44 (1947) [24] Bruner, J.S.: Beyond the information given; studies in the psychology of knowing. Norton, NY (1973) [25] Gibson, J.J.: The perception of the visual world. Houghton Mifflin, Boston (1950) [26] Weiss, I.: Projective invariants of shapes. In: Proceedings of DARPA Image Understanding Warkshop, Cambridge, MA, pp. 1125–1134 (1988) [27] Rothwell, C.A.: Object recognition through invariant indexing. Oxford University Press, Oxford (1995) [28] Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections. Nature 293, 133–135 (1981) [29] Pizlo, Z.: 3D shape: its unique place in visual perception. MIT Press, Cambridge, MA (2008) [30] Pizlo, Z., Rosenfeld, A.: Recognition of planar shapes from perspective images using contour-based invariants. Computer Vision, Graphics & Image Processing: Image Understanding 56, 330–350 (1992) [31] Pizlo, Z., Loubier, K.: Recognition of a solid shape from its single perspective image obtained by a calibrated camera. Pattern Recognition 33, 1675–1681 (2000) [32] Pizlo, Z.: A theory of shape constancy based on perspective invariants. Vision Research 34, 1637–1658 (1994) [33] Julesz, B.: Foundations of cyclopean perception. University of Chicago Press, Chicago (1971) [34] Li, Y., Pizlo, Z.: Is viewer-centered representation necessary for 3D shape perception. Journal of Vision 6 (2006) (Abstract 268) [35] Biederman, I.: Human image understanding: recent research and a theory. Computer Vision, Graphics and Image Processing 32, 29–73 (1985) [36] Pentland, A.P.: Perceptual organization and the representation of natural form. Artificial Intelligence 28, 293–331 (1986) [37] Dickinson, S., Pentland, A., Rosenfeld, A.: From volumes to views: An approach to 3-D object recognition. CVGIP: Image Understanding 55, 130–154 (1992) [38] Aloimonos, Y.: Purposive active vision, CVGIP: Image Understanding, pp. 840–850 (1992) [39] Tolman, E.C.: Purposive behavior in animals and men. Century, NY (1932) [40] Rosenblueth, A., Wiener, N., Bigelow, J.: Behavior, purpose and teleology. Philosophy of Science 10, 18–24 (1943) [41] Gregory, R.L.: Perceptions as hypotheses. Philosophical Transactions of the Royal Society, London B290, 181–197 (1980) [42] Wertheimer, M.: Principles of perceptual organization. In: Beardslee, D.C., Wertheimer, M., van Nostrand, D. (eds.) Readings in Perception, pp. 115–135, NY (1923/1958) [43] Shannon, C.E.: A mathematical theory of communication. The Bell System Technical Journal 27, 623–656, 379–423 (1948) [44] Tikhonov, A.N., Arsenin, V.Y.: Solutions of ill-posed problems. John Wiley & Sons, New York (1977) [45] Poggio, T., Torre, V., Koch, C.: Computational vision and regularization theory. Nature 317, 314–319 (1985)
12
Z. Pizlo
[46] Pizlo, Z.: Perception viewed as an inverse problem. Vision Research 41, 3145–3161 (2001) [47] Pizlo, Z., Stevenson, A.K.: Shape constancy from novel views. Perception & Psychophysics 61, 1299–1307 (1999) [48] Chan, M.W., Stevenson, A.K., Li, Y., Pizlo, Z.: Binocular shape constancy from novel views: the role of a priori constraints. Perception & Psychophysics 68, 1124–1139 (2006) [49] Li, Y., Pizlo, Z.: Monocular and binocular perception of 3D shape: the role of a priori constraints. Journal of Vision 5 (2005) (Abstract 521) [50] Pizlo, Z., Li, Y., Francis, G.: A new look at binocular stereopsis. Vision Research 45, 2244–2255 (2005) [51] Vetter, T., Poggio, T.: Symmetric 3D objects are an easy case for 2D object recognition. In: Tyler, C.W. (ed.) Human symmetry perception and its computational analysis, pp. 349–359. Lawrence Erlbaum, Mahwah, NJ (2002) [52] Chan, M.W., Pizlo, Z., Chelberg, D.M.: Binocular shape reconstruction: psychological plausibility of the 8 point algorithm. Computer Vision & Image Understanding 74, 121– 137 (1999) [53] Binford, T.O.: Visual perception by computer. IEEE Conference on Systems and Control. Miami (1971) [54] Li, Y., Pizlo, Z.: Reconstruction of 3D symmetrical shapes by using planarity and compactness constraints. Annual meeting of the Vision Sciences Society, Sarasota, FL (Abstract 934) (2007) [55] Pizlo, Z., Li, Y., Steinman, R.M.: A new paradigm for 3D shape perception. Perception 35 (2006) (ECVP abs) [56] Brady, M., Yuille, A.: Inferring 3D orientation from 2D contour (an extremum principle). In: Richards, W. (ed.) Natural computation, pp. 99–106. MIT Press, Cambridge, MA (1983)
Connection Geometry, Color, and Stereo Ohad Ben-Shahar1 , Gang Li2 , and Steven W. Zucker3 1
2
Computer Science, Ben Gurion University, Israel
[email protected] Real-Time Vision and Modeling Dept., Siemens Corp. Research, Princeton, NJ
[email protected] 3 Computer Science, Yale University, New Haven, CT
[email protected] Abstract. The visual systems in primates are organized around orientation with a rich set of long-range horizontal connections. We abstract this from a differential-geometric perspective, and introduce the covariant derivative of frame fields as a general framework for early vision. This paper overviews our research showing how curve detection, texture, shading, color (hue), and stereo can be unified within this framework.
1
Introduction
Early vision is normally thought of as a collection of tasks, including edge detection, texture analysis, and stereo. The tools that are brought to bear to solve these tasks differ as well, with edge detection normally conceptualized in a signaldetection context, texture in the context of statistics for local image patches, and stereo as a problem in projective geometry. The integration of these tasks is normally accomplished with higher-level models. However, since the individual tasks are formulated in terms that differ from one another, these higher-level models are difficult to formulate in formal terms. We have been pursuing a more unified approach. The motivation originally derived from our study of the visual systems in primates, especially the early cortical visual area V1. This is where orientation, as in edge orientation, is first abstracted, and it plays a key role in specifying the functional architecture, or layout, of cortex. While other feature dimensions, such as direction of motion or spatial frequency, are also important, for space limitations we shall not consider them here. The remaining organizing element–eye of origin–is of course necessary for stereo. In this short paper we simply overview our work, with a focus on the geometry that runs through the early vision problems of edge detection, oriented texture and shading analysis, stereo and color. Our goal is to highlight the (differential) geometric framework that is common to all of these problems. In the next section we introduce several concepts from modern differential geometry, especially the covariant derivative, the Frenet equations, and frame fields. Curvature emerges as the central connection between tangent orientations at nearby positions. We then illustrate the quantization of curvature for curves, W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 13–19, 2007. c Springer-Verlag Berlin Heidelberg 2007
14
O. Ben-Shahar, G. Li, and S.W. Zucker
which gives rise to co-circularity. Extending this to 3-D allows us to formulate the stereo problem for space curves. This provides a new framework for stereo, integrating position and orientation disparity, and illustrates how the frame-field structure can elaborate our problem formulations. The full 2-D frame field is the natural setting for oriented textures, and this involves two curvatures, one in the tangential and one in the normal directions. Finally, the extension to color, or at least hue, is sketched. We stress that this presentation is intended as an overview of our work and as an entry to the literature. While all of the results are described in greater detail, with complete references to related research, in the pointers to the literature given as references at the end, we hope that placing these pieces in juxtaposition to one another will illustrate our confidence in — and excitement about — the frame-field representation for early vision. We recommend consulting these more complete papers for the full story.
2
Connection Geometry in the Plane
A unit length tangent vector E(q) attached to point q = (x, y) ∈ 2 is the natural representation of orientation in the plane. Attaching such a vector to points of interest (e.g., along a smooth curve or oriented texture) yields a unit length vector field. Assuming smoothness, an infinitesimal translation along the vector V from q yields a small rotation in the vector E(q). A frame {ET , EN }, placed at the point q with ET identified with E(q) allows us to apply techniques from differential geometry (Fig. 1). EN
V
ET
V
EN E T= E(q)
V θ
q
Fig. 1. Illustration of the geometry behind the connection equation and the covariant derivative
Nearby tangents are displaced both in position and orientation according to the covariant derivatives, ∇V ET and ∇V EN , which can also be represented as vectors in the basis {ET , EN }: ∇V ET w11 (V ) w12 (V ) ET = . (1) ∇V EN w21 (V ) w22 (V ) EN
Connection Geometry, Color, and Stereo
15
Note: the 1-forms wij (V ) are functions of the displacement V . Since the basis {ET , EN } is orthonormal, they are skew-symmetric wij (V ) = −wji (V ). Thus w11 (V ) = w22 (V ) = 0 and the system reduces to connection equation formulated by Cartan [1]: ∇V ET 0 w12 (V ) ET = . (2) ∇V EN −w12 (V ) 0 EN w12 (V ), the connection form, is linear in V , so it can be represented in terms of the frame {ET , EN }: w12 (V ) = w12 (a ET + b EN ) = a w12 (ET ) + b w12 (EN ) . giving rise to the scalars:
. κT = w12 (ET ) . κN = w12 (EN )
(3)
which we interpret as tangential (κT ) and normal (κN ) curvatures. Specializing to the one-dimensional case of curves, only ∇ET is necessary: ∇ET ET 0 w12 (ET ) ET = . (4) ∇ET EN −w12 (ET ) 0 EN Now T ,N , and κ can replace ET ,EN , and κT , respectively, and this is the classical Frenet equation (primes denote derivatives with respect to arclength): T 0 κ T = . (5) N −κ 0 N
3
Curves and Co-circularity
The original application of these ideas was to develop compatibility coefficents for a relaxation labeling process for curve detection in images; see [2]. The basic idea is that local estimates of an image curve, or approximations to the tangent, are obtained at all positions and all orientations. (These correspond to the local measurements of orientation in visual cortex of primates.) Consistent tangent estimates support one another; while inconsistent ones detract support. (This corresponds to the computation supported by the neural substrate of long-range horizontal connections.) The goal is to find that collection of tangents that maximize support. (For a technical definition of support, see [3]). From the geometric perspective above, consistency can be interpreted directly in terms of transport along the osculating circle (a local 2-nd order approximation to the curve at each point); see Fig. 2.
4
(Oriented) Textures and Co-helicity
We now utilize the full differential geometry in the plane. In the neighborhood of an orientation within a texture, there is a full set of possible orientations; if
16
O. Ben-Shahar, G. Li, and S.W. Zucker
The osculating circle aproximates a curve in the neighborhood of a point
Incompatible tangent True image curve
q
Compatible tangent
y
x
Local tangent
Fig. 2. The geometry of co-circularity for image curves. (top) The osculating circle provides an approximation to a curve locally via curvature. (bottom) Different quantizations of curvature indicate which tangent estimates should reinforce one another. In effect these “precompute” the different transports.
Fig. 3. Analysis of oriented textures. (top) Co-helical compatibilities for oriented textures. These are defined in a local neighborhood around the central tangent. For textures this neighborhood is 2-D. Note that now there is variation in orientation (i.e., curvature) in both the tangential and the normal directions, and that singularities arise naturally. (bottom) A Brodatz texture; initial measurements of orientation within the center-of-interest; relaxed (consistent) tangent vector field.
we move in any direction within the texture the orientation can change. This implies the need for compatibility functions that are fully 2-D, and these are illustrated in Fig. 3. The construction is a direct extension of co-circularity with
Connection Geometry, Color, and Stereo
17
a helicoid (in position, orientation space) as the generalization of the osculating circle for curves. Two curvatures are necessary. See [6]
5
Analysis of Hue
One normally thinks of color in (red, green, blue) coordinates. However, when the color is mapped to the psychologically more useful (intensity, hue, saturation) coordinates, the color circle emerges. Considering only hue, the value at each pixel can be represented as a vector (which points to the proper location on the hue circle). The geometry is now close to that for oriented textures, modulo π vs 2π. Compatibility fields, which earlier were drawn among vectors, can now be drawn among hues. Notice that hue can change slightly with movement in either the tangential or the normal directions so, again, two (hue) curvatures are needed. The resulting system can be used for denoising [4]; for segmentation; and to provide a basis for color constancy [5]. V Green
Yellow
White
Cyan
Blue
Red
Magenta
H S
Fig. 4. Geometry of color and hue flows. (top) The intensity-hue-saturation representation for color makes the hue circle explicit. The hue at each pixel can therefore be represented as a vector, and similar geometry to that for oriented textures emerges. (middle) An apple image with the hue represented at each pixel. (bottom) Hue compatibilities now indicate how hue varies in the tangential and the normal directions around the central pixel. They can be used for noise cleaning and object segmentation.
6
Stereo
Stereo correspondence involves both differential and projective geometry. Classical projective geometry is well know in computer vision; here we stress how
18
O. Ben-Shahar, G. Li, and S.W. Zucker
Fig. 5. The geometry of stereo. (top) A space curve in 3-D projects into the left and the right images. Shown are two (space) tangents, each of which projects to a pair of (image) tangents. Compatibilities are thus defined over pairs of tangents, and include orientation as well as positional disparities. (bottom) A complex arrangement of twigs. The left and right images are shown, as is the stereo reconstruction from two viewpoints.
continuity of smooth objects (in this case curves, but also surfaces) can supplement the epipolar constraint for matching, and can supercede the ordering and other heuristic constraints. We formulate the basic transport operation in 3 , for which we need to extend the Frenet equations to add torsion, or deviation from the osculating plane. ⎡ ⎤ ⎡ ⎤⎡ ⎤ T 0 κ 0 T ⎣ N ⎦ = ⎣ −κ 0 τ ⎦ ⎣ N ⎦ (6) B 0 −τ 0 B The central observation is that when the standard frontal-parallel plane assumption is violated, higher-order disparities are introduced. It is these new disparities that are most useful. To illustrate, consider the tangent component of the Frenet 3-frame. Note that this projects to a pair of (2-D) tangents, one in the left image and one in the right. They will have a classical spatial disparity as well as the higherorder orientation disparity; see Fig. 5. The compatibility functions and transport are defined in 3-D; and thus are implemented as relationships between pairs of tangents. The stereo system for space curves is described in [7]; the extension to surfaces is in [8].
Connection Geometry, Color, and Stereo
7
19
Summary and Conclusions
In this lecture we attempted to illustrate how the geometry of interactions for smooth objects provides a unifying theme for many of the problems in early vision. The application to the neurobiology of long-range horizontal connections in the first visual cortical area can be found in [9].
References 1. O’Neill, B.: Elementary Differential Geometry. Academic Press, San Diego (1966) 2. Parent, P., Zucker, S.W.: Trace inference, curvature consistency, and curve detection. IEEE Trans. Pattern Analysis and Machine Intelligence 11, 823–839 (1989) 3. Hummel, R.A., Zucker, S.W.: On the foundations of relaxation labeling processes. IEEE Trans. Pattern Analysis and Machine Intelligence PAMI-5, 267–287 (1983) 4. Ben-Shahar, O., Zucker, S.W.: Hue fields and color curvatures: A perceptual organization approach to color image denoising. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR 03, Madison, WI (2003) 5. Ben-Shahar, O., Zucker, S.W.: Hue geometry and horizontal connections. Neural Networks 17(5-6), 753–771 (2004) 6. Ben-Shahar, O., Zucker, S.W.: The Perceptual Organization of Texture Flow: A Contextual Inference Approach. IEEE Trans. Pattern Analysis and Machine Intelligence 25(4), 401–417 (2003) 7. Li, G., Zucker, S.W.: Contextual Inference in Contour-Based Stereo Correspondence. Int. J. of Computer Vision 69(1), 59–75 (2006) 8. Li, G., Zucker, S.W.: Differential Geometric Consistency Extends Stereo to Curved Surfaces. In: Proc. 9-th European Conference on Computer Vision, Graz, Austria, May 7 - 13 (2006) 9. Ben-Shahar, O., Zucker, S.W.: Geometrical computations explain projection patterns of long-range horizontal connections in visual cortex. Neural Computation 16(3), 445–476 (2003)
Adaptable Model-Based Tracking Using Analysis-by-Synthesis Techniques Harald Wuest1 , Folker Wientapper2 , and Didier Stricker2 1
Centre for Advanced Media Technology (CAMTech) Nanyang Technological University (NTU) 50 Nanyang Avenue Singapore 649812 2 Department of Virtual and Augmented Reality Fraunhofer IGD TU Darmstadt, GRIS, Germany
[email protected] http://www.igd.fhg.de/igd-a4/
Abstract. In this paper we present a novel analysis-by-synthesis approach for real-time camera tracking in industrial scenarios. The camera pose estimation is based on the tracking of line features which are generated dynamically in every frame by rendering a polygonal model and extracting contours out of the rendered scene. Different methods of the line model generation are investigated. Depending on the scenario and the given 3D model either the image gradient of the frame buffer or discontinuities of the z-buffer and the normal map are used for the generation of a 2D edge map. The 3D control points on a contour are calculated by using the depth value stored in the z-buffer. By aligning the generated features with edges in the current image, the extrinsic parameters of the camera are estimated. The camera pose used for rendering is predicted by a line-based frame-to-frame tracking which takes advantage of the generated edge features. The method is validated and evaluated with the help of ground-truth data as well as real image sequences.
1
Introduction
One of the key challenges for augmented reality applications is the real-time estimation of the camera pose. Several markerless tracking approaches exist which use either lines [1,2,3] or point features [4,5] or a mixture of both [6,7] to determine the camera pose. Tracking point features like the well-known KLT-Tracker [8] can be very promising, when the scene consists of many well-textured planar regions. In industrial scenarios objects which shall be tracked are unfortunately often poorly textured and consist of reflecting materials, where methods based on point features often produce insufficient results. Line features are more robust against illumination changes and reflections and can therefore be more appropriate for the camera pose estimation. In this paper we present a model based tracking method which generates a 3D line model at runtime out of any arbitrary surface model of an object and W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 20–27, 2007. c Springer-Verlag Berlin Heidelberg 2007
Adaptable Model-Based Tracking Using Analysis-by-Synthesis Techniques
21
uses this line model to track the object. The benefit of this approach is that the generated view-dependent line model does not contain any occluded edges and it always consists of lines with an appropriate level of detail. Furthermore, since a line model is generated in every frame, the tracking of silhouette edges of round objects like pipes is possible. In the line model generation step a polygonal model of the object is rendered with a predicted camera pose and contours are extracted by analyzing discontinuities of the z-buffer and the normal buffer or by detecting edges in the frame buffer. The fast creation of an edge map is performed on the graphics hardware. The generated 3D line model is the input for a line-based tracking method, where the extrinsic camera parameters are estimated by minimizing the error between the lines of the model and strong maxima of the image gradient. An application for this tracking approach is an augmented reality system for the maintenance of industrial facilities.
2
Line Model Generation
The main objective of the line model generation step should be to aquire many good candidates of lines that are most likely to be visible in the image, as well, while preserving an affordable amount of computational costs. In [2] a manually constructed static line model is used for tracking, and the rendering step is only needed to perform a visibility test. A method of extracting line features out of a textured scene is described by Reitmayr [9], where gradient edges are extracted out of a rendered image. Drummond [10] presents an approach that is based on identifying edges in a rendered CAD model. In their approach a binary space partition tree is used to perform hidden line removal. Representing an object as such a tree, however, is a complex preprocessing step. Our approach uses an untextured surface model stored in a VRML file and extracts in every processed frame a reliable, view-dependent line model. In contrast to [9] one method for the real-time line model generation presented here is based only on the geometric properties of the object. This method is applied especially when no correct material properties of the model are given. It is related to the silhouette generation of polygonal models for the purpose of non-photorealistic rendering: Isenberg et al. [11] describe methods which create silhouettes of a model in both object space and image space. Hybrid algorithms which combine the advantages of both image space and object space methods exist [12]. As real time performance is an important criterion in our application we use only an image space method as detailed in [13]. The algorithm creates a view-dependent 3D line model as follows: First the surface model is rendered with the predicted camera pose. In a second step an edge map is created by analyzing discontinuities in the z-buffer and the normal-buffer. Two types of edges are of interest: step edges (also called C 0 edges), which correspond to a surface partially occluded by another one, and crease edges (C 1 ) as the locations of two adjacent surfaces with different orientation. For the purpose of their detection we present and discuss several filtering methods detailed in the subsections 2.1 and 2.2. Another method for extraction edges out of the frame-buffer is described in section 2.3.
22
H. Wuest, F. Wientapper, and D. Stricker
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 1. Edge map generation: (a) shows a rendered object and (b) the z-buffer of that object. (c) illustrates the silhouette edges. The normal map is shown in (d) and the edges of the normal map in (e). The combined edge image can be seen in (f).
The last step consists of extracting the 3D-lines of the edge map by means of a Canny like edge extraction algorithm. A 3D contour in the world coordinate system is computed by un-projecting every pixel in the edge map with the information stored in the z-buffer. During the contour following of the edges in the edge map 3D control points and their 3D direction are generated directly and used as the input for the registration step, which is described in section 3. 2.1
Edge Map Generation Using Laplacian Filtering of the z-buffer
Discontinuities in a z-buffer image are changes in depth and can be regarded as a point on an edge of an object according to the given camera viewing direction. Having rendered the surface model with the predicted camera pose, a second order differential operator is applied on the z-buffer image to generate an edge image. As one approach in our implementation we use a simple 3 × 3 Laplacian filter mask in order to find points belonging to silhouette edges. For silhouette edges the Laplace operator returns a high absolute value both on the object border and on the neighboring pixel in the background. Since we are interested in the 3D coordinates of pixels, which are located on the edge of an object and not on a background surface or the far plane, we only consider pixels as a silhouette edge, which have a negative response of the Laplacian filter. Therefore, it can
Adaptable Model-Based Tracking Using Analysis-by-Synthesis Techniques
23
be guaranteed that silhouette pixels are actually located on the rendered object. At strong depth discontinuities the Laplace filter produces only a one pixel thick contour, which is very desirable for the contour following later on. In figures 1(b) and (c) the z-buffer of a rendered scene and the resulting silhouette edge map is illustrated. 2.2
Edge Map Generation Using the Normal Buffer
In contrast to the z-buffer method, which is mostly suitable for C 0 edges, another promising approach is the additional extraction of C 1 edges of the normal map as described in [11]. In [14] a method is described, which creates a normal map by placing several light sources on the axes of the camera coordinate system. In our implementation we simply use a pixel shader to generate a normal map. To extract the edge map, simply for every pixel the angle between neighboring normal vectors is calculated and tested, if it exceeds a predefined threshold. Thereby an edge image is created which represents changes in surface orientation. Figure 1(d) and (e) show a normal map of a rendered scene and the resulting crease edge image. A combined edge map which includes both silhouette and crease edges can bee seen in figure 1(f). The whole process of creating an edge image is done on the graphics hardware with two rendering passes. In the first pass the scene is rendered into a texture, whereas the normal vectors are stored in the (r, g, b) color components and the depth is stored in the alpha channel of the texture. In the second pass a pixel shader creates the combined edge map and stores the edge image in only one color channel. The depth information is thereby stored in the intensity value of that pixel. 2.3
Edge Map Generation Using the Frame Buffer
In high detailed scenes with a model consisting lots of triangles the edge map obtained by geometrical properties like the depth or the surface orientation gets very fuzzy and the resulting contours are not very distinct. If material properties of an object exist, edges can also be extracted out of the frame-buffer by simply computing the image gradient. Again the edge image is created in two rendering passes. In the first pass the scene is rendered with only ambient lighting into a texture and the z-buffer value is stored in the alpha channel. In the second pass the image gradient of the (r, g, b) image is calculated. Pixels which are on the far side of a step edge are detected as described in section 2.1 and suppressed in the edge image. Thereby it can be guaranteed that all edge pixels are located on the object, which is necessary to get the correct 3D coordinates. In figure 2 the edge maps of a toy car are shown, which are generated with both the frame-buffer method and the approach using the normal- and the zbuffer. In this case the contours of the frame-buffer are more distinct, especially in very detailed areas like the wheels or the engine hood.
24
3
H. Wuest, F. Wientapper, and D. Stricker
Line Model Registration
The results of the line model generation can now be used as the input for the tracking step. Our implementation is based on the approach of [1] and [6]. The camera pose computation is based on the minimization of the distance of projected control points on a 3D contour and a 2D edge in the image. As it is difficult to decide which of several gradient maxima along the scan line really corresponds to the control point on a model edge, more than one point is considered as a possible candidate. As it was presented in [6], we use multiple hypotheses to prevent the tracking of being perturbed by misleading strong contours. To make the pose estimation robust against outliers, an estimator function can be applied to the projection error. In our implementation we use the Tukey estimator function ρT uk . More details can be found in [2].
4
Prediction of the Camera Pose
The presented method for the line model registration is a local search method. The convergence towards local minima can be avoided if the initial camera pose used for the minimization is a very close approximation to the real camera pose. Therefore a prediction of the camera pose from frame to frame is very beneficial. As correspondences between 3D contour points and 2D image points exist, if the previous frame was tracked successfully, these correspondences can be used again to make a prediction of the current frame. Therefore the control points of the 3D contour generated in the previous frame are projected into the current image with the camera pose of the previous frame. Again a one-dimensional perpendicular search is performed, but instead of looking for gradient maxima, the point which is most similar to the 2D point in the last frame is regarded as the wanted 2D point. So for every 3D control point, the point with the highest result of a normalized cross correlation along the search line is regarded as the corresponding 2D image point. The calculation of the camera pose is done exactly as in section 3, except that only one 2D point for every 3D point exists.
5
Experimental Results
To evaluate the accuracy, the algorithm is tested with a synthetic image sequence. In this sequence a virtual model of a toy car is rendered with a predefined camera path, where the camera moves half around the model and back again. The result of the pose estimation is stored and compared with the given ground truth data. It is analyzed how both the results of the edge map generation with geometry edges and with material edges affect the camera pose estimation. Figure 2 shows the error of the 6 extrinsic camera parameters for the different line model generation methods. For this particular example it can bee seen that the method using material edges produces more accurate results. However, this is not surprising, since this method produces a clearer edge map, if correct material properties are given. The error of most of the parameters is alternating around
Adaptable Model-Based Tracking Using Analysis-by-Synthesis Techniques translation error
25
rotation error 0.15
2 x y z
1.5
α β γ
0.1
1 0.05
error
error
0.5 0
0 −0.05 −0.5 −0.1
−1
−1.5
0
50
100
150
(a)
200 250 frame number
300
350
400
0
50
100
150
(b)
200 250 frame number
300
350
400
(c)
translation error
rotation error 0.15
2 x y z
1.5
α β γ
0.1
1 0.05
error
error
0.5 0
0 −0.05 −0.5 −0.1
−1
−1.5
0
50
100
150
(d)
200 250 frame number
(e)
300
350
400
0
50
100
150
200 250 frame number
300
350
400
(f)
Fig. 2. Comparison of the error between the estimated and the real camera pose of geometry edges (a) and material edges (d). In (b) and (c) the error of the geometry edges are plotted, (e) and (f) show the error by using the material edges.
0, except the camera z-translation, which is clearly above 0. It seems that the estimated camera is further away from the tracked object. The reason for this artefact is that the extracted silhouette edges are always on the object, but the gradient edges in the image have their peak between the object border and the background. The extracted silhouette edges have an error of half a pixel which mostly affect the z-component of the camera translation. An analysis of the processing time is carried out on a Pentium 4 with 2.8GHz with an ATI Radeon 9700Pro graphics card. The average computational costs for every individual step are shown in table 1. Only one 8 bit buffer holding the edge map with the depth information encoded in every pixel is read back from the graphics card to the main memory, which is a significant processing time benefit compared to [9], where both frame-buffer and z-buffer need to be accessed. Together with the image acquisition and the visualization, the system runs for this particular example with a frame rate of more than 20Hz. The algorithm is tested on three real image sequences showing different industrial objects. Some frames of the resulting videos1 with different models can be seen in figure 3. If the 3D models consist only of geometric data and no material properties, the frame-buffer method is inapplicable, and only the geometry-based approach is used. The edges used for tracking are generated with the combined z-buffer and the normal-buffer method as depicted in figure 1. After initializing 1
Movies may be downloaded from
http://www.igd.fhg.de/~hwuest/modeltracking.html
26
H. Wuest, F. Wientapper, and D. Stricker
Table 1. Average processing time of the individual steps of the tracking approach prediction step time in ms create correspondences 11.42 predict pose 2.39 tracking step render model / create edge map 3.98 read GL buffer 6.92 create correspondences 8.56 estimate pose 3.10 total time 36.37
(a)
(b)
(c)
Fig. 3. Tracking objects in real image sequences of industrial scenarios
the first camera pose manually, it is possible to estimate the correct camera path throughout the whole sequences. The virtual toy car, which is used in the previous ground truth evaluation, is also tracked in a real image sequence. Both the method based on geometry and the one based on appearance is able to track the model throughout the whole sequence successfully. As in the synthetic image sequence, for this model the frame-buffer method produced tracking results with a more accurate camera pose and less jitter. All the sequences are tested without the prediction step as well. If large movements, especially fast rotations of the camera occur, the registration step does not produce correct results. The parameters of the camera pose get stuck in local minima and the overall tracking fails. Therefore a rough estimation of the camera pose is indispensable to handle fast camera movements.
6
Conclusion
We have presented a flexible tracking method, which is able to track an object with a given polygonal model by generating 3D contours on the fly and aligning these contours on the image gradient. Our method never runs into any scaling problems, since a line model is generated in every frame with an adequate level of detail. A major improvement is the ability of tracking occluding edges and silhouette edges of objects like tubes or pipes. Depending on the scene, a method
Adaptable Model-Based Tracking Using Analysis-by-Synthesis Techniques
27
for the line model generation has to be chosen. When correct material properties exist, the frame-buffer method produces slightly better results. For large camera movements, a prediction of the camera pose for the model generation helps that the pose estimation converges in the proximate frame. Future work will concentrate on the integration of an inertial measurement unit to get an even better estimate of the predicted camera pose and thus stable results even in the case of large abrupt motions.
Acknowledgments The authors would like to thank their project partners Siemens Automation & Drives and Howaldtswerke Deutsche Werft for providing the test scenarios.
References 1. Comport, A., Marchand, E., Pressigout, M., Chaumette, F.: Real-time markerless tracking for augmented reality: the virtual visual servoing framework. IEEE Trans. on Visualization and Computer Graphics 12(4), 615–628 (2006) 2. Wuest, H., Vial, F., Stricker, D.: Adaptive line tracking with multiple hypotheses for augmented reality. In: ISMAR, pp. 62–69 (2005) 3. Lowe, D.G.: Robust model-based motion tracking through the integration of search and estimation. International Journal of Computer Vision 8(2), 113–122 (1992) 4. Bleser, G., Wuest, H., Stricker, D.: Online camera pose estimation in partially known and dynamic scenes. In: ISMAR, pp. 56–65 (2006) 5. Molton, N.D., Davison, A.J., Reid, I.D.: Locally planar patch features for real-time structure from motion. In: Proc. British Machine Vision Conference (2004) 6. Vacchetti, L., Lepetit, V., Fua, P.: Combining edge and texture information for realtime accurate 3d camera tracking. In: Proceedings of International Symposium on Mixed and Augmented Reality (ISMAR) (2004) 7. Rosten, E., Drummond, T.: Fusing points and lines for high performance tracking. In: IEEE International Conference on Computer Vision, pp. 1508–1511. IEEE Computer Society Press, Los Alamitos (2005) 8. Shi, J., Tomasi, C.: Good features to track. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR’94), pp. 593–600. IEEE Computer Society Press, Los Alamitos (1994) 9. Reitmayr, G., Drummond, T.: Going out: Robust model-based tracking for outdoor augmented reality. In: ISMAR, pp. 109–118 (2006) 10. Drummond, T., Cipolla, R.: Real-time tracking of complex structures with on-line camera calibration. In: British Machine Vision Conference, pp. 574–583 (1999) 11. Isenberg, T., Freudenberg, B., Halper, N., Schlechtweg, S., Strothotte, T.: A developer’s guide to silhouette algorithms for polygonal models. IEEE Comput. Graph. Appl. 23(4), 28–37 (2003) 12. Northrup, J.D., Markosian, L.: Artistic silhouettes: A hybrid approach. In: Proceedings of the First International Symposium on Non-Photorealistic Animation and Rendering (NPAR) for Art and Entertainment, June 2000 (2000) 13. Nienhaus, M., D¨ ollner, J.: Edge-enhancement - an algorithm for real-time nonphotorealistic rendering. In: WSCG (2003) 14. Hertzmann, A.: Introduction to 3d non-photorealistic rendering: Silhouettes and outlines. In: SIGGRAPH 99, ACM Press, New York (1999)
Mixture Models Based Background Subtraction for Video Surveillance Applications Chris Poppe, Ga¨etan Martens, Peter Lambert, and Rik Van de Walle Ghent University - IBBT Department of Electronics and Information Systems - Multimedia Lab Gaston Crommenlaan 8, B-9050 Ledeberg-Ghent, Belgium {chris.poppe,gaetan.martens,peter.lambert,rik.vandewalle}@ugent.be http://multimedialab.elis.ugent.be/
Abstract. Background subtraction is a method commonly used to segment objects of interest in image sequences. By comparing new frames to a background model, regions of interest can be found. To cope with highly dynamic and complex environments, a mixture of several models has been proposed in the literature. This paper proposes a novel background subtraction technique based on the popular Mixture of Gaussian Models technique. Moreover edge-based image segmentation is used to improve the results of the proposed technique. Experimental analysis shows that our system outperforms the standard system both in processing speed and detection accuracy. Keywords: Object Detection, Mixture of Gaussian Models, Video Surveillance.
1
Introduction
The detection and segmentation of objects of interest in image sequences is the first major processing step in many computer vision applications such as visual surveillance, traffic monitoring, and semantic annotation. The outcome of this step is usually used in other processing modules in computer vision applications. Therefore, it is important to achieve very high accuracy in the detection, with the lowest possible false alarm rates. The detection of moving objects in dynamic scenes has been the subject of research for several years and different approaches exist [1]. One of the most popular techniques is background subtraction in which a background model is created and dynamically updated. Moving objects of interest, called foreground objects, are represented by the pixels that differ significantly from this background model. The Mixture of Gaussian Models (MGM) is one of the most popular background subtraction techniques [2]. By using a mixture of Gaussian models and a dynamic update scheme, it can handle highly complex scenes with difficult situations like moving trees and bushes, clutter, noise, and permanent changes of the background. Although it gives good results, the use of the Gaussian models and the update scheme are complex. To overcome the complexity of the traditional W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 28–35, 2007. © Springer-Verlag Berlin Heidelberg 2007
Mixture Models Based Background Subtraction
29
MGM we present a simpler mixture of models technique (SMM). Moreover, we present a fast edge-based image segmentation which is used to improve the detection results. Section 2 elaborates on the mixture technique we propose and compares it with MGM. Subsequently, Sect. 3 discusses the spatial image segmentation we used. Section 4 elaborates on the combination of the proposed background subtraction technique with the spatial segmentation and shows experimental results. Finally, some conclusions and further work are formulated in Sect. 5.
2 2.1
Background Subtraction Using a Mixture of Models Background Subtraction Using MGM
When using background subtraction, a background model is created which resembles the observated environment as closely as possible. A dynamic and complex environment typically introduces different background values for the same pixel (e.g., movement of branches), so a single background model is insufficient. Therefore, MGM was first proposed by Stauffer and Grimson in [2]. MGM is a time-adaptive per pixel subtraction technique in which every pixel is represented by a vector, called Ip , consisting of three color component (red, green, and blue). For every pixel a mixture of multivariate normal distributions, which are the actual models, is maintained and each of these models is assigned a weight. T −1 1 1 G (Ip , μp , Σp ) = e− 2 (Ip −μp ) Σp (Ip −μp ) . n (2π) |Σp |
(1)
(1) depicts the formula for a Gaussian distribution G. The parameters are μp and Σp , which are the mean and covariance matrix of the distribution respectively. For computational simplicity, the covariance matrix is assumed to be diagonal. For every new pixel a matching, an update, and a decision step are executed. The new pixel value is compared with the models of the mixture. A pixel is matched if its value occurs inside a confidence interval within 2.5 standard deviations from the mean of the model. In that case, the parameters of the corresponding distribution are updated according to (2), (3), and (4). μp,t = (1 − ρ) μp,t−1 + ρ (Ip,t ) .
(2) T
Σp,t = (1 − ρ) Σp,t−1 + ρ (Ip,t − μp,t ) (Ip,t − μp,t ) .
(3)
ρ = αG (Ip,t , μp,t−1 , Σp,t−1 ) .
(4)
α is the learning rate, which is a global parameter, and introduces a trade-off between fast adaptation and detection of slow moving objects. Each model has a weight, w, which is updated for every new image according to (5). wt = (1 − α) wt−1 + αMt .
(5)
30
C. Poppe et al.
If the corresponding model introduced a match, Mt is 1 , otherwise it is 0. Formulas (2) to (5) represent the update step. Finally, in the decision step, the models are sorted according to their weights. MGM assumes that background pixels occur more frequently than actual foreground pixels. For that reason MGM defines a threshold based on these weights to decide which models of the mixture depict background or foreground. Indeed, if a pixel value occurs recurrently, the weight of the corresponding model increases and it is assumed to be background. The next section shows the changes we propose to MGM, which result in a technique called simple mixture of models (SMM). 2.2
Background Subtraction Using SMM
The authors of this paper believe that the true strength of MGM lies not in the use of the Gaussian distributions, but in the use of the mixture of weighted models which can deal with multimodal environments. Therefore, we inherit this concept in our system. The update step of MGM demands heavy calculations, since for every matched model the probability distribution function is used to calculate the parameter ρ. Previous research has shown that this calculation can be skipped since this value usually is very small and does not change much [3]. We adopt this in our system and use a constant parameter ρ. Since the distribution functions are now actually no longer used, we discard the conceptual use of the Gaussian models entirely and make use of models consisting of an average pixel value and a maximum difference between new values and the average. This difference will be evaluated according to two independent thresholds. An upper threshold is used for pixel values larger than the average, a lower one for pixels smaller than the average. This separation makes sense, since the symmetric structure of a Gaussian distribution gives a misinterpretation of real pixel values. New larger pixel values, which fall within the range of a model, do not infer that pixels with much smaller values should also be considered part of this model. Additionally, we incorporate the difference between a new pixel value and the last value for that pixel that was regarded as background. MGM only takes the generated models into account and does not regard the previous value for that pixel. As such, if values increase consistently, but with small intermediate steps, the pixel values will fall out of the range of the models. This effect can typically be experienced when a cloud gradually changes the lighting of the environment, which usually can occur in very small time spans (30 seconds), but it will affect MGM drastically. By checking if new pixel values differ only slightly from the previous value, one can assume that this small change is not due to the appearance of a foreground object. As such, we need to take the maximum difference between two consecutive pixels into account in our model. We use a new threshold which is dependent on the upper and lower thresholds of the model, to take the variation of pixel values into account. In this sense our approach resembles the methods applied in the W4 system [4]. Haritaoglu et al. use a minimum value, maximum value and maximum difference with the previous pixel value to evaluate new pixel values. However, these
Mixture Models Based Background Subtraction
31
parameters are defined during a training period and are only periodically updated. Moreover they do not use different models to cope with the multimodal behaviour of complex environments. We use the same update steps as in MGM to adjust the mean, weights and thresholds of our models. The separation of an upper and lower threshold allows to update them independently. The conceptual approach of sorting the models in MGM introduces additional complexity, so we alter this in our system. In most of the cases there is only one model which simulates the background. As such, the first step should be to check wether the weight of the matched model is larger than the predefined threshold. If not, the next step is to check if the weight of the matched model is smaller than all other values. In this case the model surely depicts foreground. Since these two cases occur frequently, a costly sorting step can be avoided. If neither of the initial conditions is fulfilled, the sorting as introduced in MGM can be applied. To further increase the processing speed we have chosen to use a checkerboard pattern for the evaluation of the frames, as shown in following pseudo-code: for each pixel(x,y) if mod(x+y+t,2) result(x,y) = else result(x,y) =
in frame t == 1 doBackgroundSubtraction(pixel(x,y)) result(x-1,y)
For the pixels which are not evaluated at the current frame, a very simple but fast interpolation is used. We copy the decision made for the pixel situated to the left of the current pixel. More advanced interpolations could be made, by using the pixel values in previous frames or the pixels surrounding the current pixel, but these would involve increased memory consumption or multiple passes of the image. By using this checkerboard pattern at each frame, only half of the pixels introduce the complexity of the above presented detection algorithm.
3
Spatial Image Segmentation
SMM takes for each pixel only the past pixel values into account. Although the technique gives reasonable results, including spatial information (such as neighboring pixel values) promises to improve the results. We propose to incorporate spatial information into our system. For this approach we segment the image into different regions according to edge information. This segmentation will consequently be combined with SMM in Sect. 4 to improve the results of the object detection. We use the well-known Canny edge detection technique [5], since it runs very fast and achieves good results. Nevertheless, it is not an optimal edge detection technique and edges can be discontinuous. We want to segment the image based on these edges, so it is important to have as much closed regions as possible. Therefore, we built an extension on the output of the canny edge detector. We
32
C. Poppe et al.
Fig. 1. Left: current frame, middle: canny output, right: extended canny output
detect loose ends with a 3x3 search window and extend the ends in the direction of the edge. Experimental results have shown that an extension of 5 pixels is sufficient for the testdata we used. If more pixels are used, false segmentations will occur more frequently due to noise. Considering that surveillance cameras typically capture an entire environment, it is assumed that many small regions will be found. Therefore 5 pixels should be sufficient to close possible wholes in the edges. This parameter is dependent on the resolution of the images and the complexity of the monitored environment. Fig. 1 shows the result of applying the canny edge detector and the extended version on an image. As can be seen, the extended version is able to close some regions which were left open. After this extension step, the image is divided into closed regions and we use a flood fill operation to give each segment a different value. The pixels representing the edges get the value of the smallest neighboring region. As such, every pixel in the image can be assigned to one and only one region. This image segmentation runs in real-time; segmenting the PetsD2TeC2 sequence takes an average of 28 ms per frame. Fast segmentation is an important factor since the results will be used in the further processing of our surveillance system. Although image segmentation techniques exist which obtain better results, these are more complex and not suited for processing surveillance video data in real-time [6].
4 4.1
Extended SMM Combining SMM with Spatial Image Segmentation
The goal is to improve the result of SMM by using the edge-based segmentation. To accomplish this we use a two-pass matching step, as shown in following pseudo-code: for each pixel(x,y) in frame t if SMMresult(x,y) == 1 nrPix(getSegment(x,y))++ for each pixel(x,y) in frame t if SMMresult(x,y) == 0 if nrPix(getSegment(x,y)) > 0.6*size(getSegment(x,y)) SMMresult(x,y) = 1
Mixture Models Based Background Subtraction
33
Fig. 2. False positive rate for every 50th frame of PetsD2TeC2 sequence
First, we store for each region in the segmented image the number of corresponding foreground pixels resulting from SMM. In a second step, we go over all the background pixels resulting from SMM and see which segments correspond with their location. The segment is considered a foreground region if the number of foreground pixels exceeds a certain amount of the segment size. In that case the background pixels of this segment are denoted as foreground. As such the detection result of SMM is extended to comply with the segmentation result. This way, foreground pixels, which were not detected by only using temporal information in SMM, can still be found. 4.2
Experimental Results
To show the introduced improvements we compare MGM, SMM, and the combination of SMM with the edge-based image segmentation (eSMM). If a background pixel is misclassified as foreground, it is called a false positive. If a foreground pixel is not detected, it is called a false negative. Fig. 2 and Fig. 3, respectively, depict the false positive rate (FP rate) and false negative (FN rate) for the PetsD2Tec2 sequence (384x288)[7]. The FP rate is the percentage of real background pixels which were misinterpreted as foreground. The FN rate is the percentage of real foreground pixels which were not detected. A manual ground truth annotation has been done for every 50th frame of the sequence. For each of these frames, the ground truth annotation has been matched with the output of the detection techniques to find the amount of pixels which are erroneously classified. As can be seen in Fig. 2 SMM has less erroneously interpreted background pixels then MGM. This is mostly due to the use of the maximum difference between consecutive pixel values. The use of the edge-based
34
C. Poppe et al.
Fig. 3. False negative rate for every 50th frame of PetsD2TeC2 sequence Table 1. Average execution times for MGM, SMM, edge-based image segmentation and eSMM in milliseconds per frame sequence PetsD1TeC1 PetsD1TeC2 PetsD2TeC1 PetsD2TeC2
MGM avg. stdev. 147 24 145 23 155 38 159 43
SMM avg. stdev. 65 8 64 6 67 7 66 8
image segmentation eSMM avg. stdev. avg. stdev. 22 7 99 10 26 8 102 10 25 8 108 13 28 8 105 12
image segmentation in eSMM increases the false positives, but it performs still better then MGM. The increase in false positives is mostly due to background pixels, erroneously interpreted as foreground by SMM which are extended by the segmentation technique. Indeed, if many of these misdetections are situated in the same segment, the segment will be regarded as foreground, introducing false detections. Better segmentation techniques can improve the FP rate of eSMM. Fig. 3 shows that SMM has more missed foreground pixels then MGM. However, the extension with the edge-based image segmentation decreases the FN rate considerably and on average it performs better then MGM. Table 1 shows a comparison of the execution times for different test sequences. We have recorded for each frame the time it takes to perform the object detection and consequently present the average values and standard deviations. All sequences have a resolution of 352 by 288 pixels. As shown, SMM is much faster then MGM and allows to process 15 frames per second. Although eSMM introduces improvements of the detection results, it also slows down the system. Nevertheless, eSMM is still faster then MGM. The processing time for segmenting the image according to the edges is also shown, since this step can be done
Mixture Models Based Background Subtraction
35
entirely independent from SMM. A parallel implementation would therefore allow to reduce the processing time even further.
5
Conclusions
This paper presents a new background subtraction scheme based on the popular Mixture of Gaussian Models. We propose the use of simpler models and altered matching, update and decision steps. The system achieves comparable results with the original scheme but with faster response. Additionally, an image segmentation based on edge information has been introduced. The result of the segmentation is consequently used to improve the detection result of our system by reducing the number of false negatives. Experimental results show the gains in processing speed and detection results. An important advantage of our system is that current improvements of the MGM can be projected on our system too. The use of different color spaces, shadow removal techniques etc. can improve the results of MGM and SMM due to the similar structure. Acknowledgments. The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT-Flanders), the Fund for Scientific Research-Flanders (FWO-Flanders), and the European Union.
References 1. Dick, A., Brooks, M.J.: Issues in automated visual surveillance. International Conference on Digital Image Computing: Techniques and Applications, pp. 195–204 (2003) 2. Stauffer, C., Grimson, W.E.L.: Learning Patterns of Activity Using Real-Time Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 747– 757 (2000) 3. Zang, Q., Klette, R.: Evaluation of an Adaptive Composite Gaussian Model in Video Surveillance. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 165–172. Springer, Heidelberg (2003) 4. Haritaoglu, I., Harwood, D., Davis, L.S.: W4: real-time surveillance of people and their activities. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 809–830 (2000) 5. Canny, J.: A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 679–698 (1986) 6. De Bock, J., Pires, R., De Smet, P., Philips, W.: A Fast Dynamic Border Linking Algorithm for Region Merging. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2006. LNCS, vol. 4179, pp. 232–241. Springer, Heidelberg (2006) 7. Brown, L.M., Senior, A.W., Tian, Y., Connell, J., Hampapur, A., Shu, C., Merkl, H., Lu, M.: Performance Evaluation of Surveillance Systems Under Varying Conditions. IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (2005) http://www.research.ibm.com/peoplevision/ performanceevaluation.html
Applicability of Motion Estimation Algorithms for an Automatic Detection of Spiral Grain in CT Cross-Section Images of Logs Karl Entacher1 , Christian Lenz2 , Martin Seidel2 , Andreas Uhl2 , and Rudolf Weiglmaier2 1
2
Salzburg University of Applied Sciences, Austria
[email protected] Department of Computer Sciences, Salzburg University, Austria
[email protected] Abstract. Techniques for an automatic detection of spiral grain in crosssection CT images of logs are proposed and evaluated on sets of natural and artificial cross-section images. Explicit analysis of global rotation, block matching, and optical flow techniques are compared. Experimental results seem to indicate that spiral grain in fact cannot be modeled by a circular motion of luminance values in gray scale images.
1
Introduction
Spiral grain means that wooden fibers are not oriented in parallel along the axis of a tree, they grow in a spiral manner around the pith. An occurance of strong spiral grain heavily influences the quality of sawn timber, i.e. the mechanical properties may be reduced or boards show a tendency to twist after breakdown from a log or after a drying process. Spiral grain may, in different characteristics occur for all wood species [1,2]. There are several different ways to measure spiral grain which means a determination or estimation of the angle of fiber orientation relative to the pith. Traditional destructive methods cut out samples in radial direction to measure the slope of grain. Certain nondestructive ways predict spiral grain based on determination of the fiber orientation on the surface of the log or board. For examples, see the application of the tracheid effect utilizing the light-conducting properties of the softwood tracheids to measure the direction of spiral grain [3,4] or an application of NIR-spectroscopy for grain-angle determination [5]. The usage of computed tomography to predict spiral grain was studied in [6], where the slope of grain is estimated from pixel patterns of unwrapped growth ring images. Ekevad [7] determines local fiberdirections from CT-images using a method to calculate the principal directions of inertia (eigenvectors) of measurement spheres distributed throughout the axial direction of the wood object. In the present paper we study a different way for a possible automatic prediction of spiral grain from CT-data of logs. When interpreting the CT cross-section W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 36–44, 2007. c Springer-Verlag Berlin Heidelberg 2007
Applicability of Motion Estimation Algorithms for an Automatic Detection
37
images as temporally ordered and displaying them in a video style manner, a rotation type motion seems to be clearly present in the data. The aim of the present study is to compare different motion detection algorithms in order to be usable to estimate spiral grain. In Section 1 we propose the different methods applied: global rotation, block matching and optical flow. Section 2 contains experimental results using DICOM data from large dimensioned specimen of Norway Spruce [8,9]. Section 3 concludes the findings.
2
Detection of Spiral Grain in CT Cross-Section Images
The first technique proposed (“rotation analysis”) assumes that spiral grain can be actually modeled by a global rotation of an entire cross-section image. For this purpose, all images in the set are aligned according to their pith. Subsequently, we compare each image (i) with rotated versions of image (i + 1) in an interval [−1.5 ◦ , +1.5 ◦] in s = 1 . . . 100 steps. The rotated version of image (i + 1) exhibiting the smallest error when compared to image (i) indicates the rotation angle between images (i) and (i + 1). The problem of rotating images about an arbitrary angle (except for multiples of 90 ◦ ) is, that the rotated pixels do not fit exactly into the pixel matrix, so that we have to interpolate the gray values. We use either bilinear or bicubic interpolation to cope with this problem. Block matching [10] is mainly used for motion estimation in hybrid video coding schemes like MPEG-1,2,4 or the H.26X family. This technique assumes constant motion within a block of pixels and is restricted to a local translational motion model on a block basis (motion is represented by a single motion vector for each block). Given a block of pixels in the image (i + 1), the corresponding motion vector is computed by searching for the most similar block within a specific search area in image (i) centered around the position of the block in image (i + 1). The matching block is only accepted in case the error between the two blocks is below a certain threshold. Locally constant translational motion is supposed to approximate spiral grain fairly well. Motion vectors are expected to point into the direction of the spiral grain if present. In our implementation we use 8 × 8 pixels blocks and a search area of 2 pixels around the original block (due to the low amount of spiral grain to be expected). Finally, optical flow is a method for estimating the motion of picture elements in a sequence of images without imposing any restriction about the type of motion present. As a result for each image pair, we get an optical flow field representing the motion of brightness information between the images. The Horn & Schunk [11] approach introduces an additional global smoothness constraint in order to minimize strong fluctuations in the flow field. For each pixel a set of equations has to be solved which is usually done in an iterative way which explains the slowness of this technique. The Lucas & Kanade [12] approach relaxes the global smoothness constraint into a local one which facilitates a significant reduction of computational demand. We apply both techniques to image pairs in our cross-section image sets assuming that spiral grain may be modeled by motion of brightness information between image pairs.
38
3 3.1
K. Entacher et al.
Experiments Experimental Settings
We have used three different image sets for our analysis, all of them are crosssection images of logs acquired by computed tomography (CT). We use two CT-image sets of logs known to exhibit sprial grain, each image of both sets is 512×512 pixels with 8 bpp gray color values. Set A (see Fig. 1(a)) consists of 42 and set B (see Fig. 1(b)) consists of 21 slices. The distance between the cross section images in set A is 8mm, whereas in set B it is 16mm. The lines in a light gray value below the log shows the rest for patients, because the CT-images are acquired with a medical scanner.
(a) sample from set A
(b) sample from set B
Fig. 1. Cross-section images used in experiments
For a proof of concept of our techniques we have artificially generated an image series of a log with “regular spiral grain”. For this purpose, we have rotated a single cross-section CT-image of an entire tree (see Fig. 2(a)) around the pith by 0.5◦ clockwise (-) per image generating 18 images in that way (to avoid error accumulation, we have used the first image for each rotation and have rotated it by an increasing angle). The size of the images is 415 × 415 pixels, with 8 bpp gray color values. We have generated two sets, one with bilinear- and the other with bicubic pixel interpolation (see Section 2.1). After image alignment at the pith an area entirely located within the wood area of the images is selected for rotation analysis (in order to avoid errors caused by black background pixels and noisy pixels from the bark – see Fig. 2(b)). Rotation analysis is conducted with image distances of 1 and 3, respectively. The image similarity measures employed in the experiments to determine the rotation angle are mean error (Diff), mean squared error (MSE), peak signal to noise ratio (PSNR), and normalized crosscorrelation (X-corr). 3.2
Results
With respect to rotation analysis, the values given in the columns of Tables 1 – 3 indicate the computed rotation between two slices [in ◦ ] (determined with
Applicability of Motion Estimation Algorithms for an Automatic Detection
(a) artificial spiral grain
39
(b) area to measure grain
Fig. 2. Cross-section of entire tree and area used to measure spiral grain
different error measures in each column). All three tables display results where bilinear interpolation has been used when rotating and matching the images, however, the results are almost identical for bicubic interpolation. Also, there is hardly any difference no matter if slice distance 1 or 3 is employed. The error measures Diff, MSE, and PSNR deliver fairly consistent results, only X-corr often suggests a different amount of rotation as compared to the other measures. For image set A (see Table 1, the results are rather disappointing. We notice a partially significant fluctuation of rotation angles among adjacent pairs of slices where even the orientation of the rotation is changing from positive to negative angles and back to positive ones (see e.g. slices 2–5 and 22–25). Of course, such a phenomenon may not be caused by natural spiral grain which means that obviously we measure nothing but artifacts possibly caused by the high degree of similarity between the cross-section images even without rotation compensation. The results for image set B (see Table 2) are more encouraging. We notice areas of more homogeneous rotation angles as compared to set A. For example, almost constant rotation is obtained in the sets of slices 1–5, 8–11, and 12–15. The log area between slice 15 and 21 seems to exhibit a low amount of rotation (and corresponding spiral grain). The pairs of slices where similar to set A a local fluctuation of rotation angles is observed, are difficult to interpret. We assume to be again confronted with artifacts of unclear provenience. Finally, when applying our rotation analysis approach to the set of artificially roted images (see Table 3), we are able to identify the induced rotation with high precision (except for X-corr). This result shows that this technique is capable of detecting global rotation reliably in principle (in case rotation is actually present). When taking an overall view on the results of rotation analysis they seem to indicate that the assumption of spiral grain being able to be modeled by a global log rotation does not seem to be correct. When applied to the artificially rotated images, rotation angles are computed reliably but when applied to natural cross-section slices we face high fluctuations of rotation angles even among adjacent slices, which cannot be explained by spiral grain.
40
K. Entacher et al.
Table 1. Result of the rotation analysis on image set A (all values represent the rotation between two cross-section slices as a pair [in ◦ ] denoted in the “Image” column) Image set A, rotation analysis with bilinear interpolation distance 1 distance 3 Image Diff MSE PSNR X-corr Image Diff MSE PSNR X-corr 01-02 1.2 1.2 1.2 1.2 01-04 -0.75 -0.78 -0.78 4.08 02-03 0.93 0.87 0.87 1.5 02-05 -0.81 -0.78 -0.78 4.5 03-04 -0.9 -0.96 -0.96 1.5 03-06 -0.96 -0.96 -0.96 -0.33 04-05 1.2 1.17 1.17 1.17 04-07 0 -0.45 -0.45 0.12 05-06 -0.06 -0.54 -0.54 -0.15 05-08 -0.72 -0.72 -0.72 -0.36 06-07 -0.66 -0.54 -0.54 -0.51 06-09 -0.06 -0.63 -0.63 -0.18 07-08 -0.06 -0.57 -0.57 -0.6 07-10 0 0.06 0.06 2.58 08-09 0 0 0 -0.54 08-11 4.17 3.66 3.66 4.5 09-10 0 0 0 1.5 09-12 4.17 4.14 4.14 4.5 10-11 1.17 1.14 1.14 1.47 10-13 -0.06 -0.24 -0.24 4.5 .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . 22-23 0.03 0.06 0.06 1.5 22-25 3.9 3.87 3.87 4.5 23-24 -1.47 -1.05 -1.05 -0.87 23-26 0 0.09 0.09 0.15 24-25 1.2 1.2 1.2 1.23 24-27 3.66 2.85 2.85 3.69 25-26 0 -0.06 -0.06 -0.78 25-28 2.4 2.79 2.79 1.83 .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . 38-39 1.26 1.26 1.26 1.5 38-41 4.2 4.2 4.2 2.13 39-40 0 0 0 0 39-42 4.17 0.27 0.27 4.08 40-41 1.23 1.23 1.23 1.29 41-42 1.23 1.23 1.23 1.29 Sum: ◦ p. sample: min Value max Value
◦
11.10 0.27 -1.5 1.26
13.20 0.32 -1.5 1.26
13.20 27.15 0.32 0.66 -1.5 -1.5 1.26 1.5
14.10 0.34 -2.13 4.2
11.66 11.66 30.07 0.28 0.28 0.73 -1.59 -1.59 -0.84 4.2 4.2 4.5
Figs. 3(a) – 3(c) displays typical visual examples of the results when block matching is applied to our image sets. Whereas the motion vectors obtained when applying block matching to the artificially rotated image set clearly indicate circular motion, the results for image sets A and B as displayed in Figs. 3(a) and 3(b) show besides vectors which are oriented in correct tangential direction also highly irregular motion where even adjacent motion vectors point in almost opposite directions. For many blocks, no motion vectors at all have been identified. Without the acceptable results for the artificial image set one might argue that the assumption of constant translational motion within a block is a too crude approximation for the actual rotation happening, but this does not seem to be the case. At present, the difference between the results for the artificial image set and the natural slices is not clear. In Figs. 4(a) – 4(c) we show typical visual examples when applying the Horn & Schunk optical flow approach to our image sets. Even improving upon block matching, the rotation of the artificial image set is perfectly determined (see Fig. 4(c)). For the natural image sets A
Applicability of Motion Estimation Algorithms for an Automatic Detection
41
Table 2. Result of rotation analysis on image set B
Image 01-02 02-03 03-04 04-05 05-06 06-07 07-08 08-09 09-10 10-11 11-12 12-13 13-14 14-15 15-16 16-17 17-18 18-19 19-20 20-21
Image set B, rotation analysis with bilinear rotation distance 1 distance 3 Diff MSE PSNR X-corr Image Diff MSE PSNR X-corr -0.06 0.72 0.72 1.5 01-04 3.78 3.72 3.72 4.5 0.75 0.75 0.75 1.5 02-05 3.75 3.75 3.75 4.47 0.78 0.78 0.78 1.5 03-06 0.03 0.63 0.63 4.5 0.78 0.78 0.78 1.5 04-07 1.68 1.71 1.71 4.5 -0.96 -0.96 -0.96 -0.66 05-08 0 -0.06 -0.06 0 1.2 1.17 1.17 1.5 06-09 3.81 3.78 3.78 4.5 -0.06 -0.12 -0.12 0 07-10 3.75 3.75 3.75 4.5 1.23 1.23 1.23 1.5 08-11 4.17 4.17 4.17 4.47 1.23 1.26 1.26 1.5 09-12 3.78 3.78 3.78 4.5 1.26 1.26 1.26 1.5 10-13 2.91 2.94 2.94 4.5 -0.06 -0.09 -0.09 0 11-14 3.81 3.78 3.78 4.5 1.17 1.14 1.14 1.5 12-15 -0.57 -0.69 -0.69 4.29 1.23 1.26 1.26 1.47 13-16 4.17 -0.96 -0.96 4.38 1.26 1.26 1.26 1.5 14-17 4.17 4.17 4.17 4.47 0 0 0 1.5 15-18 1.26 1.05 1.05 1.2 0 0.03 0.03 1.5 16-19 1.65 1.68 1.68 4.47 0.03 0.06 0.06 0.15 17-20 1.26 1.23 1.23 1.26 0.69 1.11 1.11 1.5 18-21 0.57 0.6 0.6 1.17 0.03 0.03 0.03 1.5 0.03 0.06 0.06 0.12
Sum: ◦ p. sample: min Value max Value
◦
10.53 0.526 -0.96 1.26
11.73 0.526 -0.96 1.26
(a) image set A
11.73 22.08 0.526 1.104 -0.96 -0.66 1.26 1.5
16.289 0.814 -0.57 4.17
14.46 0.723 -0.96 4.17
(b) image set B
14.46 24.51 0.723 1.226 -0.96 0 4.17 4.5
(c) artificial
Fig. 3. Block matching applied to image sets
and B the result is not that clear but we notice areas of homogeneous optical flow in the lower left part of the slice in set A (see Fig. 4(a)) and in the lower and upper right part of the slice in set B. The judgement whether indicated motion corresponds to spiral grain or is caused by other growth phenomena has to be left to experts in the wood industry. If this is the case, then spiral grain is actually not of “spiral” nature.Finally, Figs. 5(a) – 5(c) displays typical examples
42
K. Entacher et al.
Table 3. Result of rotation analysis on our artificial tree set (with “bilinear rotation”) Artifical image set, -0.5◦ /img, =8.5◦ , rot. analysis with bilinear interpol. distance 1 distance 3 Image Diff MSE PSNR X-corr Image Diff MSE PSNR X-corr 01-02 -0.51 -0.48 -0.48 -0.57 01-04 -1.53 -1.5 -1.5 -1.62 02-03 -0.51 -0.48 -0.48 -0.57 02-05 -1.53 -1.53 -1.53 -1.65 03-04 -0.51 -0.51 -0.51 -0.63 03-06 -1.53 -1.53 -1.53 -1.68 04-05 -0.51 -0.48 -0.48 -0.66 04-07 -1.53 -1.53 -1.53 -1.59 05-06 -0.51 -0.45 -0.45 -0.57 05-08 -1.53 -1.53 -1.53 -1.68 06-07 -0.51 -0.51 -0.51 -0.63 06-09 -1.53 -1.53 -1.53 -1.62 .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . 15-16 -0.51 -0.51 -0.51 -0.6 15-18 -1.53 -1.53 -1.53 -1.59 16-17 -0.51 -0.48 -0.48 -0.57 17-18 -0.51 -0.48 -0.48 -0.51 Sum: ◦ -8.61 p. sample: -0.506 min Value -0.51 max Value -0.45
◦
-8.01 -0.471 -0.51 -0.21
(a) image set A
-8.01 -0.471 -0.51 -0.21
-10.14 -0.596 -0.69 -0.45
-22.95 -0.51 -1.53 -1.53
(b) image set B
-22.77 -0.506 -1.56 -1.44
-22.77 -0.506 -1.56 -1.44
-24.42 -0.543 -1.71 -1.56
(c) artificial
Fig. 4. Horn & Schunk optical flow applied to image sets
(a) image set A
(b) image set B
(c) artificial
Fig. 5. Lukas & Kanade optical flow applied to image sets
of the application of the Lukas & Kanade approach. Whereas rotation is again reliably detected for the artificial set (except for areas near the pith), the optical flow fields for sets A and B are very irregular and do not seem to indicate any
Applicability of Motion Estimation Algorithms for an Automatic Detection
43
relation to regular growth phenomena. Here the price for relaxing the global smoothness constraint is visualized in a drastical manner.
4
Conclusion
While artificial global rotation may be detected and estimated reliably by all three types of techniques considered, natural spiral grain is much harder to be captured. Whereas block matching and Lukas & Kanade optical flow seem to be entirely insuitable for this application (at least in the parameter ranges as applied in our implementation), we obtain at least partially coherent optical flow fields with the Horn & Schunk technique and reasonable rotation angles for parts of the data set using rotation analysis. A comparison with true spiral grain angles from our samples was not carried out since the variation of global rotation vectors is to strong until now. In future work we will concentrate on applying preprocessing techniques to the gray-scale cross section images in order to improve on the results found so far. An application of the algorithms limited to special areas of the cross section will be studied as well. Acknowledgements. The authors would like to thank Alfred Teischinger and Ulrich M¨ uller from the University of Natural Resources and Applied Life Sciences in Vienna for the possibility to use DICOM data of large dimensioned specimen of Norway Spruce recorded within the project XXL-Wood [8,9], and for their suggestion to study the problem. This work was partially supported by the Austrian Science Fund (FWF) Grant P17434-N13.
References 1. Forest Products Laboratory. Wood handbook − Wood as an engineering material. Gen.Tech.Rep.FPL−GTR−113. Madison, WI: U.S. Department of Agriculture, Forest Service. Online Version (March 2007) (1999), http://www.fpl.fs.fed.us/ 2. Harris, J.M.: Spiral grain and wave phenomena in wood Formation. Springer, Heidelberg (1989) 3. Nystr¨ om, J.: Automatic measurement of compression wood and spiral grain for the predection of distortion in sawn wood products. PhD thesis, Lule˚ aUniversity of Technology (2002), Available from: http://www.tt.luth.se/staff/jany/ 4. Gr¨ onlund, A., Oja, J., Grundberg, S., Nystr¨ om, J., Ekevad M.: Process control based on measurement of spiral grain and heartwood content. Draft to be presented at The 18th International Wood Machining Seminar, Vancouver, Canada (2007) 5. Gindl, W., Teischinger, A.: The potential of VIS- and NIR-spectroscopy for the nondestructive evaluation of grain-angle in wood. Wood and Fiber Science 34, 651–656 (2002) 6. Sep´ ulveda, P., Oja, J., Gr¨ onlund: Predicting spiral grain by computed tomography of norway spruce. Journal of Wood Science 48, 479–483 (2002) 7. Ekevad, M.: Method to compute fiber directions in wood from computed tomography images. Journal of Wood Science 50, 41–46 (2004)
44
K. Entacher et al.
8. Teischinger, A., Patzelt, M.: XXL-Wood. Berichte aus Energie- und Umweltforschung 27/2006. BMVIT, Vienna, Austria. (March 2007) (2006), Online Version: www.fabrikderzukunft.at/ 9. Teischinger, A., Buksnowitz, C., M¨ uller, U.: Wood properties of old growth spruce and their technological potential. In: Kurjatko, S., Kudela, J., Lagaˇ na, R. (eds.) Proceedings of the 5th Symposium “Wood Struc¨ oture and Properties 06”, pp. 413–416. Arbora Publishers, Zvolen, Slovakia (2006) 10. Furht, B.: Motion Estimation Algorithms for Video Compression. Kluwer Academic Publishers, Boston, MA (1997) 11. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981) 12. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of Imaging understanding workshop, pp. 121–130 (1981)
Deterministic and Stochastic Methods for Gaze Tracking in Real-Time Javier Orozco1, F. Xavier Roca1 , and Jordi Gonz` alez2 1
Computer Vision Center & Dept. de Ci`encies de la Computaci´ o, Edifici O, Campus UAB, 08193 Bellaterra, Spain 2 Institut de Rob` otica i Inform` atica Industrial (UPC – CSIC), C. Llorens i Artigas 4-6, 08028, Barcelona, Spain
Abstract. Psychological evidence demonstrates how eye gaze analysis is requested for human computer interaction endowed with emotion recognition capabilities. The existing proposals analyse eyelid and iris motion by using colour information and edge detectors, but eye movements are quite fast and difficult for precise and robust tracking. Instead, we propose to reduce the dimensionality of the image-data by using multiGaussian modelling and transition estimations by applying partial differences. The tracking system can handle illumination changes, low-image resolution and occlusions while estimating eyelid and iris movements as continuous variables. Therefore, this is an accurate and robust tracking system for eyelids and irises in 3D for standard image quality.
1
Introduction
Eyelid and iris motion description is demanded from human emotion, truth and deception evaluation by combining psychological and pattern recognition techniques. Ekman and Frisen [4] already established that there are perceptible human emotions, which can be early detected by analysing eyelid and iris movements. Applications on Human Computer Interaction (HCI) demand robustness and accuracy in real-time, which determine the performance evaluation of already proposed techniques. In the literature, there are approaches for gaze analysis dealing with contour detectors, colour segmentation, Hough transform and Optical Flow for eyelids and irises [10,6,9]. These methods are time-consuming and depend on the image quality. On the other hand, restricted detailed textures and templates have been proposed to apply template matching, skin colour detection and image energy minimization [1,7]; for example Moriyama et al [7] deal with three eyelid states namely open, closed and fluttering by constructing detailed templates of skin textures for eyelid, iris and sclera. This approach requires training and texture matching. So, these methods are difficult to be generalized for different image and environment conditions. We propose in this paper a robust and accurate eyelid and iris tracking by combining stochastic and deterministic approaches with reduced image-data. To provide robustness, we reduce the input image by applying Appearance-Models W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 45–52, 2007. c Springer-Verlag Berlin Heidelberg 2007
46
J. Orozco, F.X. Roca, and J. Gonz` alez
[3], which learn the skin texture on-line based on multi-Gaussian assumptions. To provide accuracy, we construct two Appearance-Based Trackers (ABT) to estimate the transition by partial differences with respect to appearance parameters. The first one excludes the sclera and iris information, while achieving fast and accurate eyelid adaptation for any kind of blinking and fluttering motion. The second one is able to track the iris movements, while retrieving the correct adaptation after eyelid occlusions or iris saccade movements. Both trackers agree on the best 3D mesh pose that depends on the head position. The head pose estimation enhances the system capabilities for tracking eyelids and irises in different head position, instead of a frontal face as restriction. Compared to existing gaze tracking methods, our proposed approach has several advantages. First, our system achieves an efficient eyelid and iris tracking, whose movements are encoded as continuous values. Second, this approach is suitable for real-time applications while handling occlusions, illumination changes, faster saccade and blinking movements. Third, we deal with small images and low resolution, which extends the capabilities of the system for different type of applications. The paper is organized as follows: section 2 describes the theoretical foundations for appearance-based trackers, the stochastic observation and deterministic transition models. Section 3 presents experimental results and discussion. Finally, section 4 concludes the paper with the main conclusion and future avenues of research.
2 2.1
Appearance-Based Tracking Image-Data Reduction
The appearance model components are the deformable model and texture. In order to model the eye region, we construct a 3D model of both left and right eyes, which is composed of 36 vertices and 53 triangles, see Fig. 1.(a). This mesh covers the eyeballs, the upper and the lower eyelids, the sclera and the iris. The mesh deformation is determined by the matrix M, which is a n x i matrix: Mn,i = mn,i + Gn,i,k ∗ γ k ,
(1)
where n is the number of vertices and i corresponds to the Cartesian coordinates in the image plane. The matrix mn,i is determined by the biometry of each person in neutral position. The matrix Gn,i,k deforms the mesh depending on
(a)
(b)
(c)
Fig. 1. The 3D mesh (a) is projected onto the input image (b) to construct the corresponding appearance (c)
Deterministic and Stochastic Methods for Gaze Tracking in Real-Time
47
the eyelid and iris movements. The eyelid and iris movements are controlled by the vector γ k for k = 0, 1, 2 for eyelid, iris yaw and iris pitch respectively. These variables are encoded according to the Facial Action Codifying System (FACS), which obey the MPEG-4 codification. Thereby, the 3D mesh Mn,i can be adapted to the eye region, see Fig. 1.(b). The 3D pose of the mesh is taken in to account ρ = [θx , θy , θz , x, y, s], while assuming a weak perspective projection model. Therefore, each 3D point Pi = (Xi , Yi , Zi ) ⊂ M will be projected onto the image point pi = (ui , vi ) by using a projection matrix B. That is, (ui , vi ) = B(Xi , Yi , Zi , 1). Consequently, given an input frame F and the parameters to modify the mesh, which are encoded as q = [ρ, γ], we construct an appearance model to represent the eye region [2]. The shape is provided by the 3D mesh and the texture is obtained by applying a warping function Ψ (F, q) which transfers each pixel from the input frame F into a reference texture according to the vector q. In this way, the final appearance A(q) = [a0 , ..., al ] is obtained, which depends on the mesh configuration q, see Fig. 1.(c). 2.2
Stochastic Appearance Observation
The observation model aims to provide the expected appearances over time, which are based on previous estimations. Therefore, a likelihood distribution is an appropriate method to generate the expected appearances while learning previous estimations. Given an appearance model, A(q) = [a0 , ..., al ], which depends on the mesh configuration q, we assume each pixel of the appearance ai following a Gaussian distribution over time. Thus, we can collect all the variables in a multi-dimensional vector, which can be assumed following a Gaussian distribution as well, N (μ, σ 2 ). μ and σ are vectors of l values according to the appearance variables ai , which means that the variance must be computed component by component and not through inner product of the vector σ. Therefore, the probability for each observation is given by the conditional likelihood function: l P(At |qt−1 ) = N (ai ; μi , σi ). (2) i=0
The tracking goal is the estimation of the vector q at each frame t. We repˆ t (ˆ ˆ t and A resent as q qt ) the tracked parameters and the estimated appearances [2]. For the sake of clarity, hence we assume that At (qt ) and At are equivalent and used depending on the specification level. The estimated average appearance is obtained by applying a recursive filtering technique. Thus, μ and σ 2 are updated for the next frame with respect to previous adaptations and a learning coefficient λ: ˆt μt+1 = λμt + (1 − λ)A
and
ˆ t − μt )2 , σ 2t+1 = λσ 2t + (1 − λ)(A
where μ and σ are initialized with the first appearance A0 .
(3)
48
J. Orozco, F.X. Roca, and J. Gonz` alez
(a)
(b)
(c)
(d)
Fig. 2. The 3D mesh (a) and the appearance (b) for the eyelid tracker. The 3D mesh (c) and the appearance (d) for the iris tracker.
2.3
Deterministic Appearance Transition
In order to estimate the vector qt for the next frame, we adopt an adaptive velocity model, which is predicted by using a deterministic function to obtain ˆ t−1 : the transition state based on the previous prediction, q ˆt = q ˆ t−1 + Δˆ q qt
(4)
where Δˆ qt is the shift of the mesh configuration. ˆ t we construct the corresponding appearance, which Consequently, for each q is compared with the likelihood average appearance by the Mahalanobis distance. Therefore, given Eq. (4), the appearance becomes At ≈ μt , which can ˆ t using the be approximated via a first-order Taylor series expansion around q Vanilla gradient descent method [8]: ˆ t−1 ) + At (qt ) ≈ Ψ (Ft , q
ˆ t, q ˆt) ∂(A ˆ t−1 ). (ˆ qt − q ˆt ∂q
(5)
ˆ t depends on both the previous As a result, the estimation of the vector q adapted and the current average appearance, as well as the minimization distance. The gradient is computed by partial differences with specific descent steps, due to the saccade movements and spontaneous blinking. Thus, tracking is enhanced by quickly retrieving the best adaptation while avoiding drifting problems. Illumination changes, occlusions and faster movements are considered as outliers by constraining with the Huber’s function [5] the gradient descent step for each component of the shift vector Δq. The Huber’s function, ξˆ function is defined as: ˆ 1 if|y| ≤ c 1 dξ(y) η(y) = = (6) c y dy |y| if|y| > c where y is the value of a pixel in the appearance At normalized by the appearance ¯ and σ ¯ according to the Gaussian assumption for the Appearance. statistics μ ¯ Thus, we constrain the appearance registration on The constant c is set 3 ∗ σ. the probabilistic model. 2.4
Eyelid and Iris Tracking
Eyelids and irises perform fast movements, which are difficult to predict due to the spontaneous motion and the low resolution on images from monocular
Deterministic and Stochastic Methods for Gaze Tracking in Real-Time
49
cameras. Therefore, a probabilistic model for them would be quite uncertain. Instead, we propose two models for each tracker, which avoid those pixels that add uncertainty for the appearance predictions, see Fig. 2. Eyelid tracker uses an appearance model, which excludes the sclera and the iris regions by warping these pixels like eyelid skin. Subsequently, the iris tracker uses an appearance that includes the sclera and iris region. Thus, we construct two Appearance-Based Trackers (ABT), which are combined sequentially, as described next. Eyelid Tracker Tw : estimates the eyelid position independently from irises, see Fig. 2.(a). The tracking vector is w = [ρ, γ0 ], and the appearance model A(w), which excludes sclera and iris, see Fig. 2.(b): 1. To obtain At (wt−1 ) by applying the warping function, Ψ (Ft , wt−1 ). 2. The Gaussian parameters1 are estimated, μt (w) and σ 2t (w) by using Eq. (3). We try with five different learning coefficients λw , the high values allow to learn fast movements while the low values keep more information to handle occlusions. 3. To compute the eyelid gradient based on the previous adaptation wt−1 by using Eq. (5), while testing the whole FACS range [-1,1]. 4. The best estimation is obtained using Eq. (4), by comparing the average and likelihood appearances through a Mahalanobis distance in an iterative Gauss-Newton process. The search involves exploitation more than exploration to avoid local minima and to estimate the spontaneous movements. Iris Tracker Tq : estimates yaw and pitch orientations for irises, see Fig. 2.(c). The tracking vector is q = [ρ, γ0 , γ1 , γ2 ] to obtain the appearance A(q): 1. To obtain At (qt−1 ), by applying the warping function Ψ (Ft , qt−1 ). 2. The Gaussian parameters μt (q) and σ 2t (q) by applying Eq. (3). The learning coefficient λq is lower than λw to keep more information from previous frames, since the iris motion is slower than the eyelid blinking. 3. To compute the iris gradient At (qt−1 ), the iris gradient is estimated for q = [ρ, γ0 , γ1 , γ2 ] in the whole FACS range [-1,1], using Eq. (5). 4. Finally, the best estimation is computed using Eq. (4), by minimizing the Mahalanobis distance. The convergence is achieved by using a shorter exploration than the previous tracker. Both trackers are connected through the error estimation, which is standardized according to the number of pixels of each appearance. Subsequently, the eyelid tracker provides the vector w = [ρ, γ0 ] while the iris tracker estimates the iris while correcting the previous mesh orientation, q = [ρ, γ0 , γ1 , γ2 ], see Fig. 3. 1
Let be μ(w) and σ 2 (w) the Gaussian parameters corresponding to the eyelid tracker Tw . Similarly, μ(q) and σ 2 (q) are the Gaussian parameters for the iris tracker Tq .
50
J. Orozco, F.X. Roca, and J. Gonz` alez
Fig. 3. The eyelid tracking handle movements and blinks as continuous values. Iris tracking for spontaneous movements and saccades. The eyelid position is corrected for the pitch variation effects.
3
Experimental Results
Experiments were run in a 3.2 GHz Pentium PC, in ANSI C code. Three image sequences of 250 frames were used for testing both trackers, which correspond originally to facial image sequences with the cropped eye region. They were recorded with monocular and photographic cameras without illumination conditions. We do not use high image resolution, because the reference texture size is 14x18 pixels. We computed the eyelid estimation by using Tw while dealing with any kind of blinking, e.g. open, closed and fluttering. In addition, the iris estimation is done by using Tq , which estimates yaw and pitch motion, saccade movements and eyelid occlusions. On one hand, the eyelid tracker adapts the eyelid position for slow movements and blinks while estimating the mesh orientation, see Fig. 4. This tracker does not depend on iris estimations even when the pitch movement affects the eyelid position. However, this tracker warps the sclera and iris pixels as eyelid skin,
Fig. 4. Eyelid and Iris trackers are applied sequentially while learning each appearance texture on-line, which enhances the robustness and accuracy
Deterministic and Stochastic Methods for Gaze Tracking in Real-Time
51
Fig. 5. Due to tracking capability to handle illumination changes and occlusions, the tracker has a good performance with eyes wearing glasses
Fig. 6. The eyelid and iris tracking deals with images of small and low-resolution. These images are 64x20 pixels where each input eye is 18x12 pixels.
otherwise, those pixels are considered outliers when the eyelid is occluding the inner eye region. On the other hand, the iris tracker adapts well the iris position while dealing with slow yaw and pitch motion while retrieving the correct adaptation of the mesh after either saccade motion or iris movements while the eyelid occludes the iris. Both trackers use the FACS codification, which is expressed as continuous values between -1.0 and 1.0, see Figs. 3. The eyelid and iris pitch have similar plots because they commonly perform a synchronized and spontaneous movement. This is a correlation that is independently well estimated. For an appearance model size of 14x18 pixels, we obtained a performance of 21 fps, and a 96% of correct adaptations. The error is principally due to the saccade motion or eyelid occlusions. Each sequence was also tested using an appearance size of 5x11 pixels, thus obtaining 52 fps for the iris tracker and 67 fps for the eyelid tracker, and average accuracy adaptation of 85%. Iris tracking is a challenge for small reference texture, where the iris size is 2x3 instead of 5x6 for the big resolution. Learning the texture on-line and handling illumination changes are important capabilities to obtain good adaptations by analysing row-image resolution, and occluded eye regions. The Fig. 5 shows a 400 image sequence, where the subject is wearing sunglasses, the input eye region is 42x82 pixels. The tracking got a 91% of correct adaptations with a reference texture of 14x18 pixels. We obtain 82% of correct adaptation analysing small images with low resolution, where the input eye region is 10x18, see Fig. 6. The whole input image is 112x160 that is appropriate for video conference software.
4
Conclusions
Three main contributions were proven in this work. First, the information and dimensionality reduction for the input image by constructing appearance models,
52
J. Orozco, F.X. Roca, and J. Gonz` alez
which are appropriate for statistical modelling. Second, the stochastic observation model provides an accurate likelihood function, which is conditioned by previous estimations and accumulative appearance information. Third, the deterministic transition model allows generating an appearance space by applying first-order Taylor approximations around to the previous adaptation. The experimental results have proven that combining sequentially two ABT, the system is able to estimate accurately the eyelid and iris position in 3D images. The system does not require high quality images or specific illumination conditions, since we do not use colour information, edge detectors, or motion extraction algorithms. We have demonstrated in this framework that eyelid and iris motion is reliable for HCI applications and psychological systems, which demand real-time performance with accurate results. Our system provides a robust and accurate gaze description, able to handle occlusions, illumination changes, small and low-resolution images in 3D. Acknowledgement. This work is supported by EC grants IST-027110 for the HERMES project, IST-045547 for the VIDI-Video project, by the Spanish MEC under projects TIN2006-14606 and DPI-2004-5414. Jordi Gonz` alez also acknowledges the support of a Juan de la Cierva Postdoctoral fellowship from the Spanish MEC.
References 1. Bernogger, S., Yin, L., Basu, A., Pinz, A.: Eye tracking and animation for mpeg-4 coding. In: Proceedings of the 14th International Conference on Pattern Recognition, vol. 2, pp. 1281–1284 (1998) 2. Cootes, T.F., Taylor, C.J.: Statistical Models of Appearance for Computer Vision. Imaging Science and Biomedical Engineering, University of Manchester (2004) 3. Edwards, G., Cootes, T., Taylor, C.: Face recognition using active appearance models. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, pp. 581–695. Springer, Heidelberg (1998) 4. Ekman, P., Friesen, V.: Facial action coding system: A technique for the measurement of facial movement. Consulting Psychologists Press, Palo Alto (1978) 5. Huber, P.J.: Robust estimation of a location parameter. The Annals of Mathematical Statistics 35, 73–101 (1964) 6. Liu, H., Wu, Y., Zha, H.: Eye states detection from color facial image sequence. In: Proc. of the 2nd Conference on Image and Graphics, vol. 4875, pp. 693–698 (2002) 7. Moriyama, T., Xiao, J., Cohn, J., Kanade, T.: Meticulously detailed eye model and its application to analysis of facial image. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(5), 738–752 (2006) 8. Nocedal, J., Wright, S.: Numerical optimization. Springer, New York (1999) 9. Tan, H., Zhang, Y.: Detecting eye blink states by tracking iris and eyelids. Pattern Recognition Letters (2005) 10. Tian, Y.L., Kanade, T., Cohn, J.: Dual-state parametric eye tracking. In: International Conference on Automatic Face and Gesture Recognition, pp. 110–115 (2000)
Integration of Multiple Temporal and Spatial Scales for Robust Optic Flow Estimation in a Biologically Inspired Algorithm Cornelia Beck, Thomas Gottbehuet, and Heiko Neumann Inst. for Neural Information Processing, University of Ulm, Germany {cornelia.beck,thomas.gottbehuet,heiko.neumann}@uni-ulm.de
Abstract. We present a biologically inspired iterative algorithm for motion estimation that combines the integration of multiple temporal and spatial scales. This work extends a previously developed algorithm that is based on mechanisms of motion processing in the human brain [1]. The temporal integration approach realizes motion detection using one reference frame and multiple past and/or future frames leading to correct motion estimates at positions that are temporarily occluded. In addition, this mechanism enables the detection of subpixel movements and therefore achieves smoother and more precise flow fields. We combine the temporal integration with a recently proposed spatial multi scale approach [2]. The combination further improves the optic flow estimates when the image contains regions of different spatial frequencies and represents a very robust and efficient algorithm for optic flow estimation, both on artificial and real-world sequences.
1
Introduction
The correct and complete detection of optic flow in image sequences remains a difficult task (see [3] for an overview of existing technical approaches). Exact knowledge about the movements of the surrounding is needed in many technical applications, e.g., in the context of autonomous robot navigation, but the need for real-time computation in combination with reliable motion estimates further complicates the task. Common problems of optic flow detection are the generation of smooth optic flow fields and the detection of independently moving objects at various speeds. This task has the difficulty of motion detection at temporarily occluded regions. Humans and animals solve these problems in every-day vision very accurately and fast. Neurophysiological and psychophysical research revealed some basic processing principles of the highly efficient mechanisms in the brain [4,5]. We will here present extensions of an algorithm derived from a neural model [6,1] based on these research results. The integration of different temporal scales improves the quality of the detected optic flow in case of temporal occlusions and leads to a more precise representation. In addition, a combination with a multi scale approach makes the algorithm more robust in image sequences containing regions with different spatial frequencies. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 53–60, 2007. c Springer-Verlag Berlin Heidelberg 2007
54
C. Beck, T. Gottbehuet, and H. Neumann
2
Biologically Inspired Algorithm
For optic flow estimation different approaches like regularization, Bayesian models or spatiotemporal energy models have been developed to compute globally consistent flow fields [7,8,9]. Another way is to build a model that simulates the neural processing in the primate visual system. We previously presented such a neural model for optic flow detection based on the first stages of motion processing in the brain [6], namely the areas V1 and MT, including feedforward as well as feedback connections [10]. In model area V1 raw motion is initially detected, model area MT estimates the optic flow of larger regions. To reduce both computing and memory requirements we derived an efficient algorithm from the neural model (see Fig. 1(a), details for implementation in [1]). For the fast extraction of motion correspondences the algorithm uses a similarity measure of the class of rank-order approaches: A variation of the Census Transform [11] provides an abstract representation for directional derivatives of the luminance function. Accordingly, correspondences between two frames of an image sequence can be extracted at locations with the same Census values. Each motion correspondence (called hypothesis) includes a weight which indicates the likelihood of a particular velocity at a certain position. The recurrent signal modulates the likelihood of predicted hypotheses in the first module, enhancing existing motion estimates by (1). To improve the estimations, the hypotheses are integrated in space and velocity (2). In this feedforward integration step hypotheses that are supported by neurons adjacent in position and velocity have an advantage in comparison to isolated motion hypotheses. The likelihoods are modified by a normalization via lateral shunting inhibition (3). The computation of a likelihood representation of motion estimation in module MT utilizes a homologue architecture on a coarser spatial scale (V1:MT ratio is 1:5). likelihoodV1 1 = Input · 1 + C · likelihoodMT . (1) 3 V1 V1 2 likelihood2 = likelihood1 ∗ G(space) ∗ G(vel) . (2) likelihoodV3 1 = likelihoodV2 1 / 0.01 + likelihoodV2 1 . (3) vel
3
Temporal Integration
In the algorithm presented in the last section, the initial motion detection is only calculated for two successive frames (the reference t0 and its previous frame t−1 ). If an image sequence contains temporal occlusions, e.g., an object moving in front of a background, this procedure will fail to calculate the correct optic flow for some of the image regions. The previous frame t−1 contains areas where parts of the background are occluded, while they are visible in frame t0 . In the standard algorithm this leads to wrong or missing motion estimates in such occluded regions. This problem can be solved by using motion cues of additional frames for the initial motion detection (see Fig. 1c and 3). A similar mechanism was proposed by [12] for ordinal depth segmentation of moving objects. The integration
Integration of Multiple Temporal and Spatial Scales
55
Fig. 1. (a) Iterative model of optic flow estimation: V1 and MT-Fine represent the standard algorithm. The dashed box marks the extension added for the integration of multiple spatial scales where a coarse initial guess of the optic flow is calculated in V1 and MT-Coarse that supports the subsequent creation of hypotheses in V1 and MT-Fine. (b) Temporal integration of multiple frames. When using more than two input frames for the motion detection, the motion cues from the different combinations are calculated and the most reliable subset of these is used as input to V1. As the temporal distances between the frames vary, the velocities need to be scaled to the largest temporal distance. (c) Motion detection in sequences with occlusion can be solved using an additional temporally forward-looking step (future step).
of additional frames for motion detection does not only provide an advantage in the case of occlusions. Consider subpixel movement, e.g., in the center of an expanding flow field or for slowly moving objects, where common correlation based approaches are not able to resolve the motion. Utilizing additional frames with larger temporal distance rescales the small velocities to a detectable speed. This leads to a higher resolution of direction and speed and hence provides smoother flow fields (see Fig. 4). The mechanism of integrating motion correspondences of multiple frames used in our algorithm is depicted in Fig. 1(b). One future step is sufficient to solve the temporal occlusion problem whereas multiple past steps increase the accuracy of motion estimates. All hypotheses are scaled according to the maximum temporal distance and a subset with the largest likelihoods at each position is selected as input for V1.
4
Integration of Multiple Spatial Scales
A limitation of the standard algorithm is that movements in spatially lowfrequency areas may not be detected on a fine scale due to the ambiguity of motion cues (background of example in Fig. 2). In contrast, using the same algorithm on a coarse version of the input images will detect this movement.
56
C. Beck, T. Gottbehuet, and H. Neumann
However, small moving objects are overlooked on a coarse scale, as they disappear in the subsampling and the motion integration process. In our algorithm, we use a coarse and a fine scale in a way that combines the advantages of both [2] (see Fig. 1a). In general, algorithms using multiple scales are realized with image pyramids where motion estimation of coarser scales influences the estimation at finer scales as initial guess [13,14]. While the processing of the input image in resolutions of different spatial frequencies provides more information for the motion estimation, combining the different scales is a problem: How can the estimations of a coarse scale be used without erasing small regions detected in a finer scale? We only consider the motion estimation of the coarse scale if the estimation of the fine scale is highly ambiguous. In addition, the coarse estimation has to be compatible with one of the motion correspondences of the fine scale. This avoids small objects from being overlooked.
Fig. 2. If the input image contains regions with a spatially low-frequency structure (background), an algorithm working on a single scale that can detect the movement of the object (small rectangle labeled by the black circle) will miss large parts of the movement in the background (fine scale). Using an additional processing scale solves this problem (two scales). Black positions are positions where no movement was detected, white positions indicate movement to the left, gray positions to the right.
5
Integration of Multiple Temporal and Spatial Scales
To take advantage of both temporal integration and processing at multiple scales, we implemented a version of the algorithm that comprises these two extensions in one approach. The calculation of motion within the coarse scale is effected only on the initial estimation of two frames whereas the fine scale uses multiple frames for temporal integration to keep the computing time as low as possible. Census Transformations of the image sequence are kept for the following steps and the size of the chosen subsets of hypotheses used as input to V1 is limited to ensure constant computational time in further processing.
6
Results
In the first experiment, we tested the advantages of the algorithm with temporal integration in the case of occlusion. As shown in Fig. 3, the sequence contains a rectangular object moving in front of a non-static background. Using only two frames t0 and t−1 for the initial motion detection, the resulting optic flow
Integration of Multiple Temporal and Spatial Scales
57
Fig. 3. A rectangular object is moving to the right in front of a textured background moving to the left (see Input and Ground truth). Without temporal integration, the temporarily occluded region behind the object contains many wrong motion cues as no corresponding regions can be found (motion bleeding in third image). If an additional future time step is added to the initial motion detection, the correct corresponding regions can be matched for every position (fourth image). The correct position of the object is indicated by the black box.
Fig. 4. Results Yosemite Sequence after the first iteration (http://www.cs.brown.edu/ people/black/Sequences/yosFAQ.html). (a) Exemplary input image, (b) Ground truth. For the motion in the sky we assume horizontal motion to the right as proposed in [3]. (c) Median angular error (degree) of module MT for the three different versions of the algorithm. After only one iteration, the results of the algorithm with temporal integration have very low error rates. (d) Standard algorithm: A lot of positions in the sky do not contain a motion hypothesis after the first iteration (black positions), there are some wrong motion hypotheses in the lower left. (e) Temporal integration: The flow field contains less errors and appears smoother, but there are still some void positions in the region of the sky. (f ) The combination of temporal integration and the multi scale algorithm achieves motion hypotheses at every position after the first iteration, even in the coarse structure of the sky the movement to the right is correctly detected. The flow field contains only few errors and is very smooth.
represents many wrong velocities in the area temporarily occluded by the object. In contrast, the extended algorithm with one additional frame t+1 successfully integrates the motion at all positions of the background as well as of the object.
58
C. Beck, T. Gottbehuet, and H. Neumann
Fig. 5. Optic flow estimation for a traffic sequence (from project LFS BadenWuerttemberg, no. 23-7532.24-12-12). (a) Camera movement, (b) Optic flow calculated with the algorithm using temporal integration and multiple scales: The car moving from left to right can clearly be segmented, in the lower right the gray colour indicates correctly an expanding flow field whereas the left part of the image is rather influenced by the rotational movement (white color encodes movement to the left). We calculated the correctly detected movement in the regions of the car and in the surrounding region using masks shown in (c) and (d). The standard and the extended version of the algorithm correctly detect the car movement for about 90% (not shown). (e) The standard algorithm propagates the motion of the car to the surrounding area (black regions) and thus overestimates the size of the car considerably. (f ) The extended version can significantly limit the overestimation. (g) Error plot representing the percentage of pixels overestimating the car size (in relation to its true size).
In a second experiment, improvements by temporal integration and one additional spatial scale are investigated using the Yosemite Sequence. Since in an expanding flow field direction and speed of neighbouring positions change continuously, the higher resolution of velocities leads to a smoother flow field and less outliers (see Fig. 4). Multiple spatial scales further improve the results, as after only one iteration every position of the image contains motion estimates while it takes 3 frames to complete this in single scale conditions. The median angular error rate of the algorithm using temporal integration is considerably lower compared to the standard algorithm. The algorithm is tested in a third experiment with a real-world sequence acquired with a camera mounted in a moving car. This car is turning around a corner, while another car is crossing the street in front of it (Fig. 5). The optic flow calculated with the standard algorithm correctly detects the combination of rotational and expanding flow field as well as the car approaching from the left. Nevertheless, due to the coarse structure of the street and the occlusions generated by the moving car, the region behind the car shows wrong or no velocities and the car movement is propagated into the surrounding area. Using
Integration of Multiple Temporal and Spatial Scales
59
the combined version of the algorithm the results are improved: In the standard algorithm, the region representing car movements is overestimated by 48% of the original size of the car versus 22% in the extended version (mean of 6 successive frames, Fig. 5(g)).
7
Discussion and Conclusion
We presented the advantages of the integration of multiple temporal and spatial scales in a correlation based algorithm for motion estimation. The extension to initial motion cues from more than two frames enables the detection of optic flow in regions of the scene that are temporarily occluded, like in many everyday scenarios (e.g., pedestrian or car moving on a street). Only one additional temporally forward-looking frame is sufficient to fill these areas with correct motion cues as represented in the first experiment in the result section. The idea is similar to the way occlusions are handled in [12]. Nevertheless, while the processing in their algorithm is the basis for the correct segmentation of moving objects by using occlusion information in the context of different disparities, we aim at a complete and correct optic flow field. Furthermore, the temporal integration of motion cues leads to a higher resolution of the velocity. This enables the calculation of a smoother flow field in case of continuous changes in the optic flow, like self-motion sequences. Very slow velocities are only detected if more than two successive frames are employed for the inital motion detection. A comparison of experiment 3 against the most recently proposed optic flow model by Ogale and Aloimonos [15] showed that our extended algorithm has less propagation of the car movement to the surrounding (mean of 22% versus 29%), while the correct estimation at the positions of the car only reaches 60% for their model (versus 90% extended version). One drawback of the integration of more frames might be a slower response to motion direction changes. We tested our algorithm with temporal integration over multiple time steps while changing the object’s direction. The resulting optic flow was slightly disturbed at time steps with strong direction changes, but recovered immediately. Furthermore, the approach could be improved by weighing the likelihood of the estimates of various time steps corresponding to their temporal distance to the reference frame, e.g., with a Gaussian function. We combined the advantages of temporal scales with multiple spatial scales as presented in [2] in one common algorithm. The results for image sequences containing object-background movement and structures of different spatial frequencies were improved. Considering multiple spatial scales allows motion detection in low-spatial frequency areas like a street. The temporal integration adds the optic flow for partially occluded areas and very slow movements. This helps to prevent the propagation of wrong motion cues to surrounding areas. Concerning the efficieny of the algorithm, the computing time for the initial detection of motion hypotheses increases linearly with the number of frames used. The limitation of a specific number of hypotheses at each position in V1 assures that the processing time in V1 and MT remains the same as for the standard version. In
60
C. Beck, T. Gottbehuet, and H. Neumann
conclusion, our algorithm for optic flow estimation combines temporal information of more than two frames and multiple spatial scales efficiently and achieves high quality results in real-world sequences with occlusions, slow movements and continuous direction changes. Acknowledgements. This research has been supported in part by a grant from the European Union (EU FP6 IST Cognitive Systems Integrated project: Neural Decision-Making in Motion; project number 027198).
References 1. Bayerl, P., Neumann, H.: A fast biologically inspired algorithm for recurrent motion estimation. IEEE Transactions On PAMI 29(2), 246–260 (2007) 2. Beck, C., Bayerl, P., Neumann, H.: Optic Flow Integration at Multiple Spatial Frequencies - Neural Mechanism and Algorithm. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Nefian, A., Meenakshisundaram, G., Pascucci, V., Zara, J., Molineros, J., Theisel, H., Malzbender, T. (eds.) ISVC 2006. LNCS, vol. 4291, pp. 741–750. Springer, Heidelberg (2006) 3. Beauchemin, S.S., Barron, J.L.: The Computation of Optical Flow. ACM Computing Surveys 27, 433–467 (1995) 4. Ungerleider, L.G., Haxby, J.V.: What and where in the human brain. Current Opinion in Neurobiology 4, 157–165 (1994) 5. Albright, T.D.: Direction and orientation selectivity of neurons in visual area MT of the macaque. J. Neurophys 52, 1106–1130 (1984) 6. Bayerl, P., Neumann, H.: Disambiguating Visual Motion through Contextual Feedback Modulation. Neural Computation 16, 2041–2066 (2004) 7. Horn, B.K.P., Schunk, B.G.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981) 8. Weiss, Y., Fleet, D.J.: Velocity likelihoods in biological and machine vision. Probabilistic models of the brain: Perception and neural function, pp. 81–100. MIT Press, Cambridge, MA (2001) 9. Adelson, E., Bergen, J.: Spatiotemporal energy models for the perception of motion. Optical Society of America A 2(2), 284–299 (1985) 10. Hup´e, J.M., James, A.C., Girard, P., Lomber, S.G., Payne, B.R., Bullier, J.: Feedback Connections Act on the Early Part of the Responses in Monkey Visual Cortex. J. Neurophys 85, 134–145 (2001) 11. Stein, F.: Efficient Computation of Optical Flow Using the Census Transform. In: Rasmussen, C.E., B¨ ulthoff, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) Pattern Recognition. LNCS, vol. 3175, pp. 79–86. Springer, Heidelberg (2004) 12. Ogale, A.S., Fermueller, C., Aloimonos, Y.: Motion Segmentation Using Occlusions. IEEE Transactions On PAMI 27(6), 988–992 (2005) 13. Simoncelli, E.: Course-to-fine Estimation of Visual Motion. IEEE Eighth Workshop on Image and Multidimensional Signal Processing, Cannes France (September 1993) 14. Burt, P.J., Adelson, E.H.: The Laplacian Pyramid as a Compact Image Code. IEEE Transactions On Communications 31(4), 532–540 (1983) 15. Ogale, A.S., Aloimonos, Y.: A roadmap to the integration of early visual modules. IJCV 72(1), 9–25 (2007)
Classification of Optical Flow by Constraints Yusuke Kameda1 and Atsushi Imiya2 2
1 School of Science and Technology, Chiba University, Japan Institute of Media and Information Technology, Chiba University, Japan
Abstract. In this paper, we analyse mathematical properties of spatial optical-flow computation algorithm. First by numerical analysis, we derive the convergence property on variational optical-flow computation method used for cardiac motion detection. From the convergence property of the algorithm, we clarify the condition for the scheduling of the regularisation parameters. This condition shows that for the accurate and stable computation with scheduling the regularisation coefficients, we are required to control the sampling interval for numerical computation.
1
Introduction
In this paper, we analyse mathematical properties of three-dimensional opticalflow computation algorithm, since three-dimensional optical flow is a fundamental method for the non-invasive cardiac motion analysis [1,2]. Furthermore, classification and separation of the regions on the heart wall using optical flow derives cardiac diagnosis features [10]. Optical flow is a well established method in computer vision [3,4,5,6,7]. For variational cardiac optical-flow computation, additional consistencies and constraints to image motion-analysis are used [1]. The gradient consistency, and thin plate deformation constraint and the divergence-free constraint are typical additional consistency and constraints, respectively. The gradient consistency is used for enhancement of the region boundary. The deformable constraint, which is equivalent to thin plate deformation constraint [8] is used for physical model of heart-wall deformation. The divergence-free constraint is based on the masspreservation requirement for physical objects. Therefore, for accurate and stable computation, the scheduling of the regularisation parameters [9] is a fundamental problem. There are two types of evaluation method for inverse problems for non-invasive diagnosis. First one is analysis the accuracy of the solution using normalised phantoms, that is, evaluate the difference between phantom, which is used the ground truth, and the solution derived by the algorithm. The second one is mathematics-based evaluation, that is, clarification of the convergence and stability of the algorithm employing numerical analysis. In this paper, from the view point of mathematical-based evaluation, we first derive the convergence property on variational optical-flow computation method used for cardiac motion detection. Secondly, we prove that the divergence-free constraint is dependent to the first order smoothness constraint, which is called the Horn-Schunck regulariser [4] and the vector spline constraint [16,13,14,15] is W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 61–68, 2007. c Springer-Verlag Berlin Heidelberg 2007
62
Y. Kameda and A. Imiya
dependent to the second order smoothness constraint, which is equivalent to thin plate deformation constraint [11,12]. Thirdly, from the convergence property of the algorithm, we clarify the condition for the scheduling of the regularisation parameters. This condition shows that for the accurate and stable computation with scheduling the regularisation coefficients, we are required to control the sampling interval for numerical computation. Furthermore, using the first and second order derivatives, we show that it is possible to classify the motions of points in an image using the orders of differential- constraints [10].
2
Vector Spline Constraints
For matching of the deformable boundary between a pair of images [11], the minimisation of 2 JT P (u, v) = (f (x, y) − f (x − u, y − v)) dx R2 2 2 2 2 +λ uxx + 2u2xy + u2yy + vxx + 2vxy + vyy dxdy (1) R2
produces a promising result, assuming that each element of the vector (u, v) is four-times differentiable. For the interpolation of the vector field on a plane, the solutions u which minimise m,n
J1st (u) =
|u(xij ) − xij |2 +
γ1 |divu|2 + γ2 |rotu|2 dx
(2)
R2
i=1,j=1
for twice differentiable function u, and J2nd (u) =
m,n
|u(xij ) − xij | + 2
i=1,j=1
γ1 |∇divu|2 + γ2 |∇rotu|2 dx,
(3)
R2
for four-times differentiable function u, respectively, where {xij }m,n i=1,j=1 , are called the first-order vector-spline and the second-order vector-spline, respectively [14,15]. Here, the gradient operation in the second term of the regulariser in eq. (3) is computed as the vector gradient for the higher dimensional problems. Equations (1) and (3) are typical minimisation problems with the second order constraints which are used in computer-vision problems.
3
Optical Flow Computation
For a spatio-temporal image f (x, y, z, t), the total derivative is given as d ∂f dx ∂f dy ∂f dz ∂f dt f= + + + dt ∂x dt ∂y dt ∂z dt ∂t dt
(4)
Classification of Optical Flow by Constraints
63
dy dz where u = (x, ˙ y, ˙ z) ˙ = ( dx dt , dt , dt ) is the motion of each point x = (x, y, z) . Optical flow consistency [4,5,6] d f =0 (5) dt implies that the motion of the point u = (u, v, w) = (x, ˙ y, ˙ z) ˙ is the solution of the singular equation,
fx u + fy v + fz w + ft = 0. The optical-flow computation with the second order constraint is Jα β (u) = |∇f u + ft |2 + αtr∇u∇u + βtrH u H u dx,
(6)
(7)
R3
for
trH u H u = trH u H u + trH v H v + trH w H w ,
(8)
where H f is the Hessian matrix of the function f . The Euler-Lagrange equation of the variational problem is Δu −
β 2 1 Δ u = (∇f u + ft )∇. α α
Let (x1 , x2 , x3 ) = (x, y, z) . Setting
n + 2 2 shu = (a− i ) + (ai )
(9)
(10)
i=1
where
∂ ∂ ∂ ∂ ui − u(i+1) , a+ ui + u(i+1) , i = ∂xi ∂xi+1 ∂xi ∂xi+1 for u4 = u1 and x4 = x1 , there is a relation, a− i =
tr(∇u∇u ) =
1 (div 2 u + |rotu|2 + sh2 u). 2
(11)
(12)
Furthermore, for trHH , we have the relation trHH = |∇divu|2 + |∇rotu|2 + h2 (u) + k 2 (u) for
v uxy
wxy vxy
uxy wxy
h2 (u) =
xy + + Δxy v Δxy u Δyz w Δyz v Δzx Δzx w
vy wy wz uz ux vx 2
k (u) = divs, s = , vz wz wx ux uy vy
where Δαβ =
∂2 ∂2 ∂2 ∂2 + , Δ = − αβ ∂α2 ∂β 2 ∂α2 ∂β 2
for α, β ∈ {xy, yz, zy}. These relations imply the next properties.
(13)
(14) (15)
(16)
64
Y. Kameda and A. Imiya
Assertion 1. The first order smoothness constraint tr∇u∇u involves the divergence-free constraint divu. Therefore, the first order smoothness constraint involves the divergence-free condition. Assertion 2. The second order minimum-deformation constraint trH u H u involves the second order vecror spline constraint γ1 |∇divu|2 + γ2 |∇rotu|2 . Therefore, considering the regularisation term trH u H u is equivalent to solve vector spline minimisation [12,13,14,15,16] for optical-flow computation.
4
Numerical Scheme and Its Stability
In this paper, by embedding images in a rectangular region which encircles images, we adopt Dirichlet condition f = 0 on the boundary. Furthermore, for the sampled optical flow vector uijk = (uijk , vijk , wijk ) , we set the vectorisation of sampled function as ⎛ ⎞ u111 ⎛ ⎞ uijk ⎜ u112 ⎟ ⎜ ⎟ u := ⎜ (17) ⎟ , uijk = ⎝ vijk ⎠ . .. ⎝ ⎠ . wijk uMMM Then, we have the discrete version of Euler-Lagrange equation Lu −
β 2 1 L u = (Su + s), s = ff ∇f α α
(18)
where L is the discrete Laplacian operation. In eqs. (9) and (18), the operation to u are split to both side of equation. Therefore, both operations are non-singular and spectra of one operation is less than one and the other is larger than one. It is possible to derive a converging numerical scheme. However, the operation S is singular. In this section, we derive a converging iteration scheme from the equation β 1 1 u + Lu − L2 u = u + Su + s, S = ∇f ∇f t op (19) α α α since operations in both side of the equation are non-singular . The second order numerical derivative is ∂2 u =
u(i + h) − 2u(i) + u(i − h) h2
Setting D 2 to be the second order discrete (20), that is, ⎛ −2 1 0 0 ⎜ 1 −2 1 0 ⎜ ⎜ D2 = ⎜ 0 1 −2 1 ⎜ .. .. .. .. ⎝ . . . . 0
0
(20)
differential operator derived by eq. ··· 0 ··· 0 ··· 0 . . .. . .
0 0 0 .. .
0 · · · 0 1 −2
⎞ ⎟ ⎟ ⎟ ⎟, ⎟ ⎠
(21)
Classification of Optical Flow by Constraints
65
for Dirichlet boundary condition, the fourth order operator is D4 = D 2 D2 . Furthermore, the numerical operations L and L2 which stand for Δ and Δ2 , in R3 , respectively, are given as L = D2 ⊗ I ⊗ I + I ⊗ D 2 ⊗ I + I ⊗ I ⊗ D2 , L2 = D 4 ⊗ I ⊗ I + I ⊗ D 4 ⊗ I + I ⊗ I ⊗ D 4
(22)
+2(D2 ⊗ D2 ⊗ I + I ⊗ D2 ⊗ D2 + D 2 ⊗ I ⊗ D2 ),
(23)
where A⊗B is the Kronecker product of matrices A and B, and I is the identity matrix. For an appropriate positive number Δτ , we have a splitting expression, 1 β Δτ Diag I + Δτ S ijk u = P Diag I + Δτ L − L2 Pu − s, α α α (24) for permutation matrix which transforms u of eq. (17) to ⎛ ⎞ u111 ⎜ u ⎟ ⎜ 112 ⎟ v := vec ⎜ (25) .. ⎟ = P u ⎝ . ⎠ u MMM From this, we have the iterative form Aun+1 = Bun + c for A
=
B = P
(26)
Δτ Diag I + S ijk , S = ∇f (i, j, k)∇f (i, j, k) α β Δτ I + Δτ L − Δτ L2 P , c = − s, s = ft ∇f (i, j, k). α α
(27) (28)
Both A and B are non-singular and ρ(A) > 1, where ρ(A) is the maximum of the absolute value of the eigenvalues of the matrix A. Therefore, if ρ(B) < 1 then the iterative form eq. (26) is stable and converges to the solution of the original equation. We derive the condition for α, β, and Δτ to guarantee the condition ρ(B) < 1. Setting U and Λ to be the discrete Fourier transform matrix and the diagonal matrix respectively, we have the relations L = U ΛU ,
L2 = U Λ2 U .
(29)
where Λ = Diag (λijk ) , λijk = μi + μj + μk , μk = − 1 − cos
π k, M +1
.
(30)
66
Y. Kameda and A. Imiya
Substituting these relations to β 2 ρ I + Δτ L − Δτ L < 1 α finally, we have the relation
β Δτ β Δτ
ρ I + Δτ Λ − Δτ Λ2 <
1 − 4 × 3 2 − 16 × 32 . α h α h4
(31)
(32)
From eq. (32), we have the next theorem. Theorem 1. If the inequality
1 − 4 × 3 Δτ − 16 × 32 β Δτ < 1
2 4 h αh
(33)
is satisfied, the iterative form is stable and converges to the solution of the original PDE. The iteration is terminated if max |un+1 (x) − un (x)| < |un (x)| × 10−m , x∈R3
(34)
where |u| is the Euclidean norm in R3 .
5
Numerical Examples
Let u∗(αβγ) be the optical flow vector computed using α-th, β-th, and γ-th constraints. For each point x, we can have many optical-flow vectors u∗(α) , where α is a string of positive integers, for example, 1, 2, 12, 123, and so on. The vector u∗(2) is the deformation vector of the deformable boundary. Therefore, if |u∗2 | is sufficiently small at point x, and u∗(1) u∗(2) the motion in the neighbourhood of this point x is homogeneous. This geometrical property of optical-flow computed by the variational method implies that the operation D(i) = {x | |u∗(i) | |u∗(j) |, i = j}
(35)
derives segments in Rn using the orders of constraints for the computation of variational problems. Generally, it is possible to adopt many constraints for the classification of moving points in a space with optical-flow vectors. For example, two operations Dh = {x | |u∗(1) | |u∗(12) |}, Dd = {x | |u∗(12) | |u∗(1) |}
(36)
classify a scene into a deforming part and a homogeneously moving part, since the regularisers tr∇u∇u and trD2 uD2 u minimise smoothness and elastic energy of the solution u, respectively. Furthermore, the operation u∗(1) , if |u∗12 | 1 on x, u= (37) u∗(12) , otherwise,
Classification of Optical Flow by Constraints
67
z 70 60
z
50 40
z
30 20 10
50
y
100 50
150
100
200
150 250
200 250
x
(a) Optical flow for 105 , 0 for frames 2 and 3 from 20 frames.
y
y x
(b) Optical flow for 105 , 105 for frames 2 and 3 from 20 frames.
x (c) Optical flow for 105 , 104 for frames 2 and 3 from 20 frames.
Fig. 1. Optical Flow Classification in Three-Dimensional Euclidean Space. (a), (b), and (c), are the optical-flow field images of the beating heart for the regularisation parameters α, β are 105 , 0, 105 , 105 , and 105 , 104 , respectively. The Horn-Schunck constraint extracts the smooth motion u(1) of the whole heart. The second order constraint extracts the deformbale motion u(12) . The combination of the Horn-Schunck constraint and deformable constraint allows us to separate and classify the motions of points in images to smooth motion and deformable motion.
allows us to express multi-modal motion simultaneously, using variational computation methods. These properties imply that the orders of the constraints in variational problems act as the scale of scale space analysis. Figure 1 shows cardiac optical flow u(1) and u(12) extracted for some combinations of α, β . Here α, β are 105 , 0 , 105 , 105 , and 105 , 104 . In Figure 1, (a), (b), and (c) are the optical-flow field images of the beating heart for the regularisation parameters α, β are 05 , 0 , 105 , 105 , and 105 , 104 , respectively. These sequence show that, as expected, the flow sequence obtained by the HornSchunck constraint extracts smooth motion of the whole heart. Furthermore, with the second order constraint, the deformbale motion on the wall of beating heart is extracted. The extracted dominate deformable part is the atriums. These results indicate that the combination of the Horn-Schunck constraint and deformable constraint allows us to separate and classify the motions of points in images to the smooth motion u(1) and the deformable motion u(12) .
6
Conclusions
In this paper, we first derived the convergence property on variational opticalflow computation method used for volumetric cardiac motion detection. From this convergence property of the algorithm, we clarified the condition for the scheduling of the regularisation parameters. Images for our experiment are provided from Roberts Research Institute at the University of Western Ontario through Professor John Barron. We express thanks to Professor John Barron for allow us to use his data set. The research was supported by No. 19500139, Grant-in-Aid for Scientific Research of Japan Society for the Promotion of Sciences.
68
Y. Kameda and A. Imiya
References 1. Zhou, Z., Synolakis, C.E., Leahy, R.M., Song, S.M.: Calculation of 3D internal displacement fields from 3D X-ray computer tomographic images. In: Proceedings of Royal Society: Mathematical and Physical Sciences, vol. 449, pp. 537–554 (1995) 2. Song, S.M., Leahy, R.M.: Computation of 3-D velocity fields from 3-D cine images of a human heart. IEEE Transactions on Medical Imaging 10, 295–306 (1991) 3. Aubert, G., Kornprobst, P.: Mathematical Problems in Image Processing:Partial Differential Equations and the Calculus of Variations. Springer, NewYork (2002) 4. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17, 185–204 (1981) 5. Nagel, H.-H.: On the estimation of optical flow: Relations between different approaches and some new results. Artificial Intelligence 33, 299–324 (1987) 6. Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques. International Journal of Computer Vision 12, 43–77 (1994) 7. Weickert, J., Schn¨ orr, C.: Variational optic flow computation with a spatiotemporal smoothness constraint. Journal of Mathematical Imaging and Vision 14, 245–255 (2001) 8. Timoshenko, S.P.: History of Strength of Materials. Dover, Mineola, NY (1983) 9. Chang, H.-H.: Variational approach to cardiac motion estimation for small animals in tagged magnetic resonance imaging, 2006. IEEE Pacific-Rim Image and Video Technology, 363-372 (2006) 10. Chang, H.-H., Moura, J.M.F., Yijen, L., Wu, Y.L., Ho, C.: Early detection of rejection in cardiac MRI: A spectral graph approach, 2006. IEEE International Symposium on Biomedical Imaging, Arlington, 113-116 (2006) 11. Grenander, U., Miller, M.: Computational anatomy: An emerging discipline. Quarterly of applied mathematics 4, 617–694 (1998) ´ 12. Sorzano, C.O.S., Th´evenaz, P., Unser, M.: Elastic registration of biological images using vector-spline regularization. IEEE Tr. Biomedical Engineering 52, 652–663 (2005) 13. Wahba, G., Wendelberger, J.: Some new mathematical methods for variational objective analysis using cross-validation. Monthly Weather Review 108, 36–57 (1980) 14. Amodei, L., Benbourhim, M.N.: A vector spline approximation. Journal of Approximation Theory 67, 51–79 (1991) 15. Benbourhim, M.N., Bouhamidi, A.: Approximation of vectors fields by thin plate splines with tension. Journal of Approximation Theory 136, 198–229 (2005) 16. Suter, D.: Motion estimation and vector spline. In: Proceedings of CVPR’94, pp. 939–942 (1994)
Target Positioning with Dominant Feature Elements Zhuan Qing Huang and Zhuhan Jiang School of Computing and Mathematics, University of Western Sydney, NSW, Australia
[email protected],
[email protected] Abstract. We propose a dominant-feature based matching method for capturing a target in a video sequence through the dynamic decomposition of the target template. The target template is segmented via intensity bands to better distinguish itself from the local background. Dominant feature elements are extracted from such segments to measure the matching degree of a candidate target via a sum of similarity probabilities. In addition, spatial filtering and contour adaptation are applied to further refine the object location and shape. The implementation of the proposed method has shown its effectiveness in capturing the target in a moving background and with non-rigid object motion. Keywords: Object detection, object tracking.
1 Introduction Searching for a target region that matches the object template in the previous frame in a video is somewhat different from the object recognition in its traditional sense. The main task in this process is to locate the position of the candidate object in the current frame. Sometimes it may additionally require detailed description of the object on such as the shape deformation. With the increasing variety of applications in the field of video based surveillance, monitoring, robotics and medical treatment etc, a great deal of research has been conducted in the regard. The main approaches considered so far in the literature for the object tracking are the template matching, the snake approach, and the use of kernel density estimation, to name a few. The template matching [1-3] is to match the whole object region directly at different positions in the current frame to locate the new object position that leads to minimum errors. The snake approach [4-6, 11] on the other end emphasises on the extraction of the object contour based on certain deforming criteria such as the minimisation of an energy function. More recently, kernel functions [7-10] are utilised to estimate the minimum errors between the candidate and target regions, in the form of density distributions. Other methods may incorporate additional features such as the specification of the selected edges being tracked [12]. The statistic framework is also widely integrated into these algorithms. The multi-hypothesis method, for instance, approaches the problem by deriving the highest probability of certain hypotheses [13-14] with the Bayesian rule. In [4], as an example, the contour of the tracked object is propagated in the framework of a posterior probability. The main purpose of this work is to approach the tracking problem with less limitation exhibited by the above mentioned main techniques so as to accommodate at least to a certain extent the non-rigid movement, moving background, object shape deformation, object colour changes, as well as the computational practicality. Our proposed W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 69–76, 2007. © Springer-Verlag Berlin Heidelberg 2007
70
Z.Q. Huang and Z. Jiang
approach will operate on a simple yet sufficiently distinguishing object representation on the local background within a statistical framework. The simplified object model will result in lower computational load and allows for non-rigid movement of an object in a moving background. The probabilistic similarity of object pixels has subsequently improved the accuracy of the object detection in a different frame. The tracking of an object in a video sequence is then carried out on the basis of the dynamic segmentation of the object with respect to its local background in terms of the dominant elements of colours. This paper is organised as following. In section 2, we propose a new form of representation for the target template within the environment of a local background, and analyse the extraction of the colour intensity bands for the object. In section 3, we then develop a matching scheme in a probabilistic framework, based on the dominant elements of intensity bands, and follow it up with a refining finalisation. Experimental results are then reported in section 4, with a final conclusion made in section 5.
2 Dynamic Dominant Elements Among the object features such as colour, edge and shape, the colour feature is often more significant in matching the object. We here propose to model an object through a simplified representation of the significant colour bands, and segment the object colours into suitable intensity bands determined by maximising the segments difference between the object and local background. 2.1 Band Selection Let Is be the intensity corresponding to the peak value of object density ρo, which acts as the centre of the most significant band. The intensity bands hi ′s are thus set to hi = [I s + (i − k b − 3 / 2 ) w,
I s + (i − k b − 1 / 2 ) w ],
i = 1,.., n ,
(1)
where kb=⎣Is/w+1/2⎦, n=kb+⎣ke⎦+kp, ke=(1-Is)/w+1/2, kp=0 or 1, and w is the bandwidth. If the RGB colour space is used, then each colour component will be dealt with independently. The main strategy for the band selection is to represent the object with the bands that most distinguish it from its local background. The technical method we introduce here is to calculate the “difference” of the neighbouring regions, one on the object side and one on the background side, along the object boundary. To carry this out, a set of boundary pixels si’, 1≤i′≤n′, typically successively equal in distance, along Fig. 2.1. (a) Typical local neighborhoods along the boundary. (b) Segment patterns in a sample circle. the boundary of the object are first selected, and a circle or a square is used to trap a local neighbourhood where, for instance, the circle centres at a selected boundary pixel s and the radius r is half the distance of the successively
Target Positioning with Dominant Feature Elements
71
selected boundary pixels, as illustrated in figure 2.1 (a). For notational simplicity, such squares of width 2r will also be referred to as a “circle” of radius r. If we denote the segment membership as a bit pattern θ, with each single bit indicating the presence of a particular segment element, then we can compare the segment membership θ+ on the object side and the membership θ - on the background side, as shown in figure 2.1 (b). The task thus becomes searching for a w that leads to the largest difference between θ+ and θ-. To calculate the segmental difference along a boundary section, we count up such differences while ignoring their magnitude (the difference of segment indices, or the distance of colour), via n' ⎡ ⎤ d θ = ∑ ⎢∑ (θ i+ ⊕ θ i− ) /(∑ θ i+ ) 2 ⎥ , i =1 ⎣ ⎦
(2)
where the total segment disparity dθ indicates the extent to which the object separates itself from the background, n' is the total number of circles selected along the boundary path, ⊕ represents the bit XOR operation, and the denominator is a normalising factor. We note that the w having the largest dθ is treated as the most desirable bandwidth, although the w may also be selected in a range corresponding to a higher dθ. 2.2 Extraction of Dominant Elements We now seek to represent an object with a few feature elements that would distinguish the object from local background and facilitate tracking the object. We first let the object colours and the local background colours be binned based on the corresponding bands. We then define the difference Δk and the ratio δk by
Δk=Qok – Qbk, δk= Qok / QBk ,
(3)
where Qok and Qbk are respectively the number of object pixels and the number of local background pixels that fall into the kth bin, and QBk is the number of pixels in the background of the full frame that fall into the kth bin. To proceed with the band selection, we set thresholds μ for all Δi and ν>0 for all δi respectively. The kth band will be selected, and thus considered dominant, if both Δk >μ and δk >ν. Both conditions together ensure the kth band to highlight the object from background. For the selected m bands h'k, we denote by xk the corresponding centre. These xk will thus serve as the dominant elements to represent the object. We note that the bands obtained above are selected according to the intensity distribution without considering the actual spatial information. Once the h'k are derived, however, we can choose a small window for every pixel within a selected segment, keep within the window only the pixels in the same segment, and then calculate there the local mean, or another feature value by a different model, to substitute the pixel with the calculated value. Amongst such new pixel values, the closest to the middle of the band is chosen as the element representing the segment. The advantage of this second method is that it is able to reduce the effect of the colour noises in the background when only sporadic background pixels exhibit intensities very similar to those in the object.
72
Z.Q. Huang and Z. Jiang
3 Improving Target Matching We have proposed in the previous section an efficient model to represent an object so that searching for it in a different frame will be at a lower computational cost. In order to achieve better matching capability, we establish a probabilistic framework that also accommodates well the object representation in terms of dominant feature elements. 3.1 Probability Measurement and Synthesis We here aim to locate pixels in the current frame that are similar to any dominant elements in the previous frame, and build up the object from such pixel patches. Let {yi} denote all the pixels in the current frame or part of the frame corresponding to the local window in the previous frame. Then the probability of the pixel yi belonging to the kth segment of the object in the previous frame is proportional to, pk ( yi ) = ϕ (| yi − xk |) ,
(4)
where φ is a similarity function such as a triangle or Gaussian function. Since the object to be tracked is characterised by its dominant elements, we sum up the probabilities of a pixel yi belonging to each such element to reflect its overall probability of being within the object region in the current frame 3
m
p( yi ) = ∑∑ α k( j ) pk( j ) ( yi ) ,
(5)
j =1 k =1
where j corresponds to the 3 RGB colours, α(j)k ≥0 are chosen to reflect the importance of the object elements derived from the selected bands, m is the total number of the elements, and ∑j,k α(j)k =1. In this work, we typically adopt the α(j)k evenly for the selected bands. This sum will be referred to as the probability map later on. To yield a better object probability representation, we reduce the effect of the pixels that may accumulate sufficient probabilities but actually are not really close to any of the dominant elements, by limiting the contribution to colours within the distance χ
⎧ϕ (| yi − x k |) if | yi − x k |≤ χ . p k ( yi ) = ⎨ 0 if | y i − x k |> χ ⎩
(6)
Setting χ=w/2, for instance, would exclude the pixels which have a distance to the kth element larger than half of the bandwidth. Another technique is to set a single sufficient threshold on all the probabilities pk(y) corresponding to each element before they are summed up in (5). And still another is to allow a different threshold for each pk(y) and apply a final threshold to the total sum p(y). As the dominant bands are selected mainly based on the nearby background in our approach, it is possible for the probabilities obtained in area distant to the object to also exhibit certain similarity to some segments. To reduce this interference, we introduce a spatial filter to filter out the distant pixels. Hence we first find the rough centre of highest probability region (candidate region) nearest to the previous object
Target Positioning with Dominant Feature Elements
73
location; then a masking function such as a Gaussian is applied. This subsequently results in the following modified probability distribution for a better object extraction (7) p ' ( y i ) = g (d ( y i )) p ( y i ) ,
where d(yi) is the distance from yi to the centre, and g(d) is a decreasing function. 3.2 Object Extraction and Refinement
We first observe that the band selection process discriminates against those bands that are similar to the background, though the presence of different RGB colours may compensate it to a good extent. We may thus refine the extracted object p′(yi) with the help of its shape in the previous frame. More precisely, the former shape is first projected onto the current probability map, and is then adapted so as to maximise the total probability of the current shape. This is done by shifting the shape centre to the new position (Cx, Cy) determined by (8) ( , ) = max arg ∑ p' ( y ) ,
c c x
y
y∈Ta , b Ωo
where Ωo is the projected object region, and Ta,b is a spatial translation by a and b pixels horizontally and vertically respectively. We may also render the probabilities with the neighbours, similar to the Weighted Region Consolidation method in [15]. Based on the probability map, a threshold τ is chosen to identify the candidate object. In fact the shape refinement can be achieved by classifying contour pixels y'(s) according to y' (s) ∈Ωo if and only if p(y'(s)) > τ. In other words, we expand its contour when the probability at the contour point is larger than the given threshold τ and shrink the contour when less than τ. We can thus exclude the nearby independent background from the candidate object. Of course this method may lead to ragged boundaries as it does not ensure the smoothness of the contour. Alternatively, we can propagate the contour in the framework of an energy function. Let mapping E: [0,1]→R2 be a parameterized close curve, the energy function can then be expressed as 1
φ ( s ) = ∫ g (| ∇p( s ) |) | E ( s ) | ds , 0
(9)
where ∇p(s) is the probability change at the pixel on the contour, g(.) is a monotonically decreasing function, Ė(s) is the vector derivative of the curve with respect its parameter s. A similar approach can be found in [11]. For better algorithmic stability, we may set a maximally allowed pixel shift for the contour in a single frame step so as to prevent the pixels from moving an unsuitably large distance in the case of substantial background interference.
4 Experiments We now experiment on a video sequence in 240x320 pixels as shown in figure 4.1 (a) and (b). The object at the beginning is extracted from the static background as in figure 4.1 (c), and is then on the moving background in the subsequent frames. On the frame within which the object is already known, we create a local background by including the object within the 115x135 area. The disparities dθ along the object
74
Z.Q. Huang and Z. Jiang
border for different radiuses are calculated and illustrated in figure 4.2 (a) and (b). The curves in figure 4.2 (a) from top to bottom illustrate the effect of the circle radius changing from 2 to 5 pixels. We see from figure 4.2 (a) that some bandwidths can achieve larger local differences of the object from the background than the others. We also observe there that a suitable bandwidth would be 25 to 35. With Is=0.12, the bands for a given bandwidth w can be obtained via (1) similar to the figure 4.2 (c) and (d) where dot-dash line denotes the centre of the bands. The corresponding segmentation is then shown in figure 4.2 (e) and (f). Figures 4.2 (c) to (f) are done for w=30; and the contours in figure 4.2 (e) and (f) are drawn there for a better illustration, and a brighter colour there indicates a larger amount of pixels in a band within the object. For each band identified in figure 4.2 (c) and (d), we calculate Qok and Qbk respectively. For the band selection, we set μ=0. If the full frame is considered for the matching, also set ν=0.5, see section 2.2 for the definition of μ and ν.
(a)
(b)
(c)
Fig. 4.1. (a) First frame. (b) Second frame. (c) Extracted object in the first frame.
a b
c e d f
Fig. 4.2. (a) Total segmental disparity against bandwidth for radius= 2,3,4,5. (b) Radius=5. (c) Bands chosen for the red color. (d) Bands chosen for the green color. (e) Segmentation with bands in (c). (f) Segmentation with bands in (d).
Let χ=w/2 for (6), and define ϕ in (4) as the triangular function ϕ=1–|t| for |t| 0 sgn(a) = 1, if a < 0 sgn(a) = −1, and if a = 0 sgn(a) = 0. The estimate is increased by one at every frame when it is smaller than the sample or decreased by one when it is greater than the sample. The absolute difference between It and Mt is the first differential estimation Δt . Unlike in [7], in [6] the absolute difference Δt is used to compute the timevariance of the pixels, representing their motion activity measure, employed to decide whether the pixel is moving or static. The Vt is used as the dimension of a temporal standard deviation. It is computed as a Σ − Δ filter of the difference sequence. Finally in order to select pixels that have a significant variation rate over its temporal activity, the Σ − Δ filter is applied N = 4 times. After the first estimation of the background and the moving pixels, further spatiotemporal processing is necessary to: eliminate the non significant pixels
80
J.A. Boluda and F. Pardo Table 1. The Σ − Δ background estimation Initialization for each pixel x: M0 (x) = I0 (x) For each frame t for each pixel x: Mt (x) = Mt−1 (x) + sgn(It (x) − Mt−1 (x)) For each frame t for each pixel x: Δt (x) = |Mt (x) − It (x)| Initialization for each pixel x: V0 (x) = Δ0 (x) For each frame t for each pixel x such that Δt (x) = 0: Vt (x) = Vt−1 + sgn(N × Δt (x) − Vt−1 (x)) For each frame t for each pixel x: if Δt (x) < Vt (x) then Dt (x) = 0 else Dt (x) = 1
from the detection (noise, false detection), enhance the segmentation of the moving objects, reduce the ghost effect that produces false detection where a moving object leaves after a long stay, and reduce the aperture effect that leads to poor detection when the projected motion is weak. A Markovian model for spatiotemporal processing was proposed [8]. Instead, several morphological reconstruction processes are proposed in [6]. A simple common edges hybrid reconstruction has been performed to enhance Δt as shown in equation (1). The inputs are the original image It and the Σ − Δ difference image Δt . t Δt = HRecΔ α (M in(∇(It ), ∇(Δt )))
(1)
The gradient modules of Δt and It are computed by estimating the first Sobel gradient and then computing the Euclidean norm. M in(∇(It ), ∇(Δt )) acts as a logical conjunction, retaining only the edges that belong both to an object of Δt and It . In order to recover the object in Δt , we reconstruct the common edges within Δt and with α as elementary structuring element (a ball with radius=3). This has been done by performing a geodesic reconstruction of the common edges (marker image) with Δt as reference (or conditional image). The conditional dilation for the reconstruction has been limited to four iterations since no relevant changes appear with more iterations. Thus, after Δt has been reconstructed, Vt and Dt are computed.
Speeding-Up Differential Motion Detection Algorithms
3.2
81
Change-Driven Data Flow Algorithm
The original algorithm has been modified using the change-driven data flow processing strategy. With this procedure not all the pixels will be processed. Only the pixels that undergo a change grater or equal to the CST will fire the related instructions. Fig. 1 shows the data flow and the intermediate stored images.
If
I t+1 (x) − I t (x)
> CST
Mt G x (I t ) , G y (I t ) Δt G x( Δ t ) , G y( Δ t )
∇ (I t )
∇ ( Δ t) Min ( ∇ ( Δ t) , ∇ (I t ) ) Δt Δ t’ = Rec α Min ( ∇ ( Δ t) , ∇ (I t ) )
Vt Dt
Fig. 1. Modified algorithm data flow and intermediate images
An initial image It is stored in the computer as the current image. Any absolute difference of the current image pixel It+1 (x) with the stored image fires the intermediate images computation. Moreover, these updates must be done taking into account that the change of an input pixel may modify several output variables. For example, if a pixel is modified (with a value greater or equal to the CST) then its contribution to 6 pixels for the Sobel gradient image Gx and also for 6 pixels for image Gy must be updated. It may appear that a single pixel modification can produce an uncontrolled explosion of operations, but these operations are a simple addition (and sometimes a multiplication by two) per pixel and sometimes these can be reutilized. In the original algorithm for each pixel the Sobel gradient images Gx and Gy are computed in any case; with six additions per pixel with the corresponding multiplications.
82
4
J.A. Boluda and F. Pardo
Experimental Results
The original differential motion detection algorithm and the change-driven data flow version have been implemented. Both algorithms have been tested using several traffic sequences downloaded from the professor H. H. Nagel public ftp site: http://i21www.ira.uka.de/image sequences/ at the University of Karlsruhe. Fig. 2(a) shows a sequence frame of 740 × 560 pixels. There are stationary cars at this frame that correctly appear as motionless, as part of the background. Fig. 2(b) shows the Mt or background image. At this point only three cars in the top left of the image, a cyclist at below the cars and a car partially occluded by a tree at the center appear as moving objects, as the original version shows in Fig. 2(c). Change-driven data flow algorithm results are shown in Fig. 2(d).
Fig. 2. (a) Original sequence, (b) Background Mt image, (c) Dt in the original algorithm, (d) Dt in the modified algorithm with CST=3
There are no evident differences in terms of detected moving points between the original algorithm and the change-driven data flow with CST = 3 as Fig. 2(c) and (d) show. The results are almost the same but the executed time decreases significantly. Fig. 3 shows the executed time speed-up achieved between the modified algorithm, with different CST values, and the original implementation. It must be noted that the change-driven data flow implementation with
Speeding-Up Differential Motion Detection Algorithms
83
CST = 1 gives the same image results as the original but it has a lower computational cost. Likewise, as CST increases, the speed-up improves and with CST ≥ 1 the change-driven data flow implementation improves the original algorithm execution times (speed-up greater tan 1).
2.4
Change−driven data flow speed−up
2.2
2
1.8
1.6
1.4
1.2
1
1
2
3
4
5
6
CST value
Fig. 3. Measured speed-up obtained with the change-driven data flow algorithm
The execution time speed-up improves because only pixels with a value greater or equal to CST are processed. In this way, additional operations required by the data flow implementation are compensated by the decreasing number of pixels processed when CST ≥ 1. The immediate question is whether there is a limit to CST growth and therefore to the algorithm speed-up versus the original implementation. The answer is that if CST is too high, very few points are processed, so there is not a systematic background update, and moreover, moving points with a gray value difference lower than CST are not detected. This property produces two effects that limit the change-driven data flow strategy. If CST increases then fewer correct moving points are detected and otherwise more false positives are detected. The effect of false positives due to the non constant background computation is more relevant than the decrease in correct positives that contribute to computation of the moving object center. Values for CST ≥ 6 make this approach unfeasible and therefore further pepper noise filtering is required for Dt , and consequently the change-driven data flow speed-up disappears. Therefore, practical values in the tested sequences for CST are 1 ≤ CST ≤ 5. The number of processed points is sequence and frame dependent but decreases as the CST value increases. Practical CST values reduce the number of pixels to be processed in a factor of five with a CST ≥ 5, maintaining the same accuracy for the detection of moving objects but giving a speedup of almost two for this CST value.
84
5
J.A. Boluda and F. Pardo
Conclusion
A procedure for speeding-up differential motion detection algorithms has been presented. The change-driven data flow strategy is based on processing the pixels associated which changes above a threshold. The implementation of this methodology requires extra storage to keep track of the intermediate results of preceding computing stages. A recent differential motion detection algorithm has been chosen to test the change-driven data flow policy. The change-driven data flow algorithm implementation shows a significant speed-up for a CST range that gives similar image results to the original implementation. Thus, we have demonstrated the feasibility of using the change-driven data flow algorithm strategy in differential motion detection algorithms.
Acknowledgments This work has been supported by the project TEC2006-08130/MIC of the Spanish Ministerio de Educaci´ on y Ciencia.
References 1. Sandini, G., Questa, P., Scheffer, D., Diericks, B., Mannucci, A.: A retina-like CMOS sensor and its applications. Sensor Array and Multichannel Signal Processing Workshop, 514-519 (2000) 2. Boluda, J.A., Domingo, J.: On the advantages of combining differential algorithms and log-polar vision for detection of self-motion from a mobile robot. Robotics and Autonomous Systems 37(4), 283–296 (2001) ¨ 3. Ozalevli, E., Higgins, C.M.: Reconfigurable Biologically Inspired Visual Motion Systems Using Modular Neuromorphic VLSI Chips. IEEE Transactions on Circuits and Systems 52(1), 79–92 (2005) 4. Silc, J., Robic, B., Ungerer, T.: Processor architecture: from dataflow to superscalar and beyond. Springer, Heidelberg (1999) 5. Pardo, F., Boluda, J.A., Sosa, J.C.: Circle detection and tracking speed-up based on change-driven image processing. In: Proc. ICGST International Conference on Graphics, Vision and Image Processing GVIP-05 (2005) 6. Manzanera, A., Richefeu, J.C.: A new motion detection algorithm based on Σ-Δ background estimation. Pattern Recognition Letters 28, 320–328 (2007) 7. McFarlane, N., Schofield, C.: Segmentation and tracking of piglets in images. Machine Vision and Applications 8, 187–193 (1995) 8. Manzanera, A., Richefeu, J.C.: A robust and computationally efficient motion detection algorithm based on Σ − Δ background estimation. In: Proc. ICGVIP’04, pp. 46–51 (2004)
Foreground and Shadow Detection Based on Conditional Random Field Yang Wang National ICT Australia Kensington, NSW 2032, Australia
[email protected] Abstract. This paper presents a conditional random field (CRF) approach to integrate spatial and temporal constraints for moving object detection and cast shadow removal in image sequences. Interactions among both detection (foreground/background/shadow) labels and observed data are unified by a probabilistic framework based on the conditional random field, where the interaction strength can be adaptively adjusted in terms of data similarity of neighboring sites. Experimental results show that the proposed approach effectively fuses contextual dependencies in video sequences and significantly improves the accuracy of object detection. Keywords: Conditional random field, contextual constraint, object detection, shadow removal.
1 Introduction Moving object detection in image sequences is very important to application areas such as human-computer interaction, content-based video compression, and automated visual surveillance. In recent years, background modeling is a commonly used technique to identify moving objects with fixed camera. However, accurate detection could be difficult due to the potential variability such as shadows cast by moving objects, nonstationary background processes (e.g. illumination variations), and camouflage (i.e. similarity between appearances of moving objects and the background) [7]. Comprehensive modeling of spatiotemporal information within the video sequence is a key issue to robustly segment moving objects in the scene. Temporal information is fundamental to handle nonstationary background processes. Linear processes or probability distributions can be used to characterize background changes from recent observations [1] [11]. In [10], the recent history of pixel intensity is modeled by a mixture of Gaussians, and the Gaussian mixture is adaptively updated for each site to deal with dynamics in background processes. On the other hand, spatial information is important to understand the structure of the scene. Gradient (or edge) features help improve the reliability of object and shadow detection [2] [12]. In [9], spatial cooccurrence of image variations at neighboring blocks is employed to enhance the detection sensitivity. Moreover, contextual constraint is an essential element to effectively fuse spatial and temporal information throughout the detection process. For instance, contiguous sites are likely to belong to the same object in the scene, and one site tends to remain W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 85–92, 2007. © Springer-Verlag Berlin Heidelberg 2007
86
Y. Wang
in the same object at consecutive time instants. Markov random field (MRF) and hidden Markov model (HMM) have been extensively employed to formulate contextual constraints [6] [8]. However, conditional independence of observations is usually assumed in the previous work, which is too restrictive for visual scene modeling. Compared to generative models including MRF and HMM, the conditional random field (CRF) relaxes the strong independence assumption and captures contextual dependencies among observations [4]. Originally proposed for text labeling, CRF has been applied to image processing recently [3]. In this paper, a CRF based approach is proposed to fuse contextual dependencies in image sequences for moving object detection. Spatial and temporal constraints during the detection process are unified in a probabilistic model that permits neighborhood interactions in both labels and observations. The method discriminates moving cast shadows and handles nonstationary background processes. Experimental results show that the proposed approach effectively integrates contextual constraints in video sequences and significantly improves the detection accuracy. The rest of the paper is arranged as follows: Section 2 presents the CRF based model to formulate contextual dependencies in video sequences. Section 3 proposes the object detection method and describes the implementation details. Section 4 discusses the experimental results. Finally our technique is concluded in Section 5.
2 Model Given an image sequence, the observation and label of a point x at time instant t are denoted by g xt and l xt respectively. The observation g xt consists of intensity information at the site x. Label l xt assigns the point x to one of K classes (K equals 3 in this work). The label l xt = ek if point x belongs to the kth class, where ek is a Kdimensional unit vector with its kth component equal to one. Here t ∈ N, x ∈ X, and X is the spatial domain of the video scene. The entire label field and observed image over the scene at time t are compactly expressed as Lt and Gt respectively. Given the observations, there are two ways to estimate the labels. In a generative framework, both the prior model of the label field and the likelihood model of the observed data are formulated to estimate the joint distribution over observations and labels. Alternatively, in a discriminative framework the posterior distribution of the label field is directly formulated. Based on conditional random field, in this work spatial and temporal constraints within the video scene are integrated through a discriminative framework of contextual dependencies among neighboring sites. The definition of conditional random field (CRF) is given by Lafferty et al. in [4]. For random variables S and observed data D over the video scene, (S, D) is a conditional random field if, when conditioned on D, the random field S obey the Markov property: p( s x | D, s y , y ≠ x) = p( s x | D, s y , y ∈ N x ) , where set Nx denotes
the neighboring sites of point x. In this work, the probability distribution of the labels is modeled by a conditional random field to formulate contextual dependencies. At each time t, label field Lt obeys the Markov property when image Gt is given. Hence Lt is a random field globally conditioned on the observed data. Using the
Foreground and Shadow Detection Based on Conditional Random Field
87
Hammersley-Clifford theorem and considering only up to pairwise clique potentials, the posterior probability of the label field is given by a Gibbs distribution with the following form. p( Lt | G t ) ∝ exp{−
∑ [V (l x
x∈ X
t x
| Gt ) +
∑V
t t x , y (l x , l y
| G t )]} .
(1)
y∈ N x
The one-pixel potential Vx (l xt | G t ) reflects the local constraint for a single site. Meanwhile the two-pixel potential Vx , y (l xt , l ty | G t ) models the contextual information (or pairwise constraint) between neighboring sites. Strength of the constraints is dependent on the observed data. The potential functions are further expressed as Vx (l xt | G t ) = Vxl (l xt ) + Vxl | g (l xt | G t ) , Vx , y (l xt , l ty | G t ) = V xl, y (l xt , l ty ) + V xl,| gy (l xt , l ty | G t ) .
(2a) (2b)
In the one-pixel potential, the first term reflects the prior knowledge for different label classes, and the second term reflects the observed information for individual sites. In the two-pixel potential, the first term imposes the connectivity constraint independent of the observations, and the second term imposes the neighborhood interaction dependent on the observed data. Thus, during the detection process both the dataindependent and data-dependent contextual dependencies are unified in a probabilistic discriminative framework based on the CRF. The proposed approach will degenerate into a MRF approach when the data-dependent pairwise potential is set to zero, which is equivalent to the conditional independence assumption of the observed data (i.e., ignore interactions among observations when labels are given) used in previous generative approaches [6] [8]. The approach in [12] also ignores the data-dependent pairwise potential although its model is based on CRF, which should be categorized as a MRF approach from this point of view. Therefore, theoretically the proposed approach more comprehensively models contextual constraints in video sequences than previous approaches. At each time t, the MAP (maximum a posteriori) estimate of the label field is computed as Lˆt = arg max p( Lt | G t ) given the potential functions. Lt
3 Method Given the video sequence, each pixel in the scene is to be classified as moving object, background, or shadow. For a site x at time t, the label l xt equals e1 for background pixel, e2 for cast shadow, and e3 for moving object. Here static shadows are considered to be part of the background. In order to segment foreground objects, the system should first model the background and shadow information. 3.1 Observation Model
For each point x, the pixel intensity g xt has three (R, G, and B) components for color images or one value for grayscale images. Grayscale images are considered in this
88
Y. Wang
section, while the formulation for color images can be derived similarly. Assume that Gaussian noise corrupts each pixel in the scene, so that the observation model becomes the following for a background point at time t. g tx = bxt + n tx , if l xt = e1 ,
(3)
where bxt is the intensity mean for a pixel x within the background, and n tx is independent zero-mean Gaussian noise with covariance (σ xt ) 2 . Intensity means and variances in the background can be estimated from previous images. For stationary background scenes, the intensity mean and variance of each background point can be estimated from a sequence of background images recorded at the beginning. For nonstationary background scenes, the background updating process is based on the idea of Stauffer and Grimson [10]. The recent history of each pixel is modeled by a mixture of Gaussians. As parameters of the mixture model change, the Gaussian distribution that has the highest ratio of weight over variance is chosen as the background model. Given the intensity of a background point, a linear model is used to describe the change of intensity for the same point when shadowed in the video frame. g tx = rxt bxt + n tx , if l xt = e2 .
(4)
Considering the contiguity of video image, the coefficient can be estimated from its neighborhood as rxt ≈ g tx bxt if the point is under shadow.
∑
x∈ N x
∑
x∈ N x
In order to achieve maximum application independence, it is assumed that the model of moving objects is unknown in this work. In the absence of any prior knowledge about foreground, uniform distribution is used for the foreground pixel intensity. From the above discussion, the intensity likelihood for each point x at time t becomes ⎧ N ( g tx ; bxt , (σ xt ) 2 ), if l xt = e1 ⎪⎪ p( g xt | l xt ) = ⎨ N ( g tx ; rxt bxt , (σ xt ) 2 ), if l xt = e2 , (5) ⎪ t c, if l x = e3 ⎩⎪ where N ( z; μ ,σ 2 ) is a Gaussian distribution with argument z, mean μ, and variance
σ2. c is a small positive constant ( c = 1 256 for grayscale images). 3.2 Spatiotemporal Constraint
The one-pixel potential function is set as Vx (l xt | G t ) = − ln p(l xt | g xt ) so that the posterior distribution p(Lt | Gt) becomes the product of local posterior if the two-pixel potential is ignored. Since
∏ p(l
t x
| g xt )
x∈ X t t t t | g x ) ∝ p(l x , g x ) = p(l x ) p( g tx | l xt ) , the Vxl (l xt ) = − ln p(l xt ) and Vxl | g (l xt | G t ) =
p(l xt
two terms in the one-pixel potential become
− ln p( g tx | l xt ) . The first term reflects the prior information of the label without the
Foreground and Shadow Detection Based on Conditional Random Field
89
current image. Given the label estimation at the previous time instant, prediction of the label for the current time instant can be expressed by the following dataindependent one-pixel potential. ⎧⎪α1 , if l xt = lˆxt −1 Vxl (l xt ) = − ln p (l xt ) = ⎨ , ⎪⎩α 2 , if l xt ≠ lˆxt −1
(6)
where α1 is smaller than α2 so that the site is likely to remain in the same class as the previous time instant. The potential function imposes the temporal continuity constraint to encourage a point to have the same label at successive video frames. The data-dependent one-pixel potential Vxl | g (l xt | G t ) = − ln p( g tx | l xt ) reflects the constraint from local observation. The potential function can be computed using the local intensity likelihood derived in the previous section. Since foreground objects usually consist of contiguous regions, the connectivity constraint is imposed by the following data-independent pairwise potential. Vxl, y (l xt , l ty ) = − β1l xt ⋅ l ty ,
(7)
where ⋅ denotes the inner product, and the positive β1 weights the importance of dataindependent spatial connectivity. The smoothness constraint is negative only when the two neighboring sites belong to the same class. Thus two neighboring pixels are more likely to belong to the same class than to different classes. The term imposes the smoothness constraint independent of the observed data. The data-dependent pairwise potential can be expressed as Vxl,| gy (l xt , l ty | G t ) ∝ || g tx − g ty || l xt ⋅ l ty , so that the potential function encourages data similarity when neighboring sites belonging to the same class. However, under heavy noises neighboring sites may become quite different even though they belong to the same class. To prevent this problem when detecting moving objects under noisy environments, the data-dependent pairwise potential is replaced by the following normalized term. Vxl,| gy (l xt , l yt | G t ) =
β 2 || g tx − g ty || 2 l xt ⋅ l ty
∑
∑
x
y
1 1 || g tx − g it || 2 + || g ty − g it || 2 | N x | i∈ N | N y | i∈ N
,
(8)
where ||⋅|| is the Euclidean distance, and the positive β2 weights the importance of data-dependent spatial interaction. Naturally, the potential function imposes an adaptive contextual constraint that will adjust the interaction strength according to the similarity between neighboring observations. The term models the neighborhood interaction dependent on the observed data. 3.3 Parameter and Optimization
The potential functions for the posterior distribution of the label field at time t are expressed as the following. Vx (l xt | G t ) = −αl xt ⋅ lˆxt −1 − ln p( g tx | l xt ) ,
(9a)
90
Y. Wang
Vx , y (l xt , l ty | G t ) = − β1l xt ⋅ l ty +
β 2 || g tx − g ty || 2 l xt ⋅ l yt
∑
∑
x
y
1 1 || g tx − g it || 2 + || g ty − g it || 2 | N x | i∈N | N y | i∈N
, (9b)
where the positive α = α2 – α1 weights the importance of temporal continuity. To balance the influence of pairwise potential terms, it is assumed that
∑| N
x∈ X
x
| β1 =
∑| N
x
| β 2 =| X | β .
(10)
x∈ X
The parameters α and β are manually determined to reflect the influence of temporal constraint and spatial constraint respectively. At each time instant, the posterior distribution of label field is optimized by mean field approximation [5]. The mean field algorithm suggests that when estimating the label mean at a single site, the influence from neighboring sites can be approximated by that of their means. Thus the label field at each time instant can be computed from an iterative procedure. Initially, uniform distributions are assumed for all the labels in L0.
4 Results The proposed approach has been tested on video sequences captured under different environments and compared with other detection methods. The 24-pixel neighborhood is utilized in the algorithm. Our C program can process about three 160×120 frames per second on a Pentium 4 2.8G PC. Three moving object detection algorithms are studied in our experiments: the Gaussian mixture approach [10], the proposed approach without using data-dependent pairwise potential, and the proposed approach. The same initialization and neighborhood are used in these algorithms (when applicable). GM is a pixel based method without contextual constraints among neighboring sites, while the last two could be viewed as methods to impose contextual constraints based on Markov random field (MRF) and conditional random field (CRF) respectively. Figure 1 shows the detection results by the Gaussian mixture approach and the proposed method for the “laboratory” sequence with background illumination change.
(a)
(b)
(c)
Fig. 1. (a) One frame of a sequence. (b) Detection result by the GM approach. (c) Detection result by the proposed approach.
Foreground and Shadow Detection Based on Conditional Random Field
(a)
(b)
91
(c)
Fig. 2. (a) One frame of a sequence. (b) Detection result by the MRF approach. (c) Detection result by the proposed approach.
The gray regions in Figure 1c represent moving cast shadows. Compared to Figure 1b, the proposed approach greatly improves the accuracy of foreground detection by imposing contextual constraints. It can be seen that the noisy pixels in the scene are correctly classified by the proposed method. Meanwhile moving cast shadows are exactly removed from the foreground in Figure 1c. Figure 2 shows the detection results by the Markov random field approach and the proposed method for the “aerobic” sequence. The MRF approach produces smooth detection results. However, sometimes it may smooth in a wrong way since the interactions among observations are neglected. The camouflage regions in the human body are erroneously detected in Figure 2b, while this problem is overcome by the proposed approach with data-dependent neighborhood interactions. Table 1. Error rates of detection results
error
GM 6.4%
MRF 4.3%
CRF 3.0%
The detection results are also evaluated quantitatively by comparing to the manually labeled ground-truth images. Table 1 shows the average error rate (portion of misclassified points in the entire image) for ten representative frames of the sequences shown in Figure 1 and 2 (five frames of each sequence). The MRF approach outperforms the GM approach by introducing contextual constraints. Compared to the MRF approach, the CRF approach takes advantage of datadependent neighborhood interactions. In our experiments, the CRF approach averagely reduces the error rate of the other two (GM and MRF) approaches by 53%, and 30% respectively. The substantial increase of the accuracy indicates that the proposed approach effectively fuses spatial and temporal dependencies for object detection in video sequences.
5 Conclusion There are two main contributions in this paper. First, based on conditional random field, a probabilistic discriminative framework has been proposed to fuse contextual
92
Y. Wang
dependencies in image sequences. Second, under the proposed framework, spatial and temporal constraints have been formulated for moving object and shadow detection. Experimental results show that the method significantly improves the performance of object detection and shadow removal in video sequences. Our future study is to automatically determine the algorithm parameters through learning technology.
References 1 Haritaoglu, I., Harwood, D., Davis, L.S.: W4: Real-time surveillance of people and their activities. IEEE Trans. Patt. Anal. Mach. Intel. 22, 809–830 (2000) 2 Jabri, S., Duric, Z., Wechsler, H., Rosenfeld, A.: Detection and location of people in video images using adaptive fusion of color and edge information. In: Proc. Int’l. Conf. Pattern Recognition, vol. 4, pp. 627–630 (2000) 3 Kumar, S., Hebert, M.: Discriminative fields for modeling spatial dependencies in natural images. Advances in Neural Information Processing Systems, 1351–1358 (2004) 4 Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. Int’l. Conf. Machine Learning, pp. 282– 289 (2001) 5 Li, S.Z.: Markov Random Field Modeling in Image Analysis. Springer, Heidelberg (2001) 6 Paragios, N., Ramesh, V.: A MRF-based approach for real-time subway monitoring. Proc. IEEE Conf. Computer Vision and Pattern Recognition 1, 1034–1040 (2001) 7 Prati, A., Mikic, I., Trivedi, M.M., Cucchiara, R.: Detecting moving shadows: Algorithms and evaluation. IEEE Trans. Patt. Anal. Mach. Intel. 25, 918–923 (2003) 8 Rittscher, J., Kato, J., Joga, S., Blake, A.: A probabilistic background model for tracking. Proc. European Conf. Computer Vision 2, 336–350 (2000) 9 Seki, M., Wada, T., Fujiwara, H., Sumi, K.: Background subtraction based on cooccurrence of image variations. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2, 65–72 (2003) 10 Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. Patt. Anal. Mach. Intel. 22, 747–757 (2000) 11 Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principles and practice of background maintenance. Proc. Int’l. Conf. Computer Vision 1, 255–261 (1999) 12 Wang, Y., Loe, K.-F., Wu, J.-K.: A dynamic conditional random field model for foreground and shadow segmentation. IEEE Trans. Patt. Anal. Mach. Intel. 28, 279–289 (2006)
n-Grams of Action Primitives for Recognizing Human Behavior Christian Thurau and V´ aclav Hlav´ aˇc Czech Technical University, Faculty of Electrical Engineering Department for Cybernetics, Center for Machine Perception 121 35 Prague 2, Karlovo n´ amˇest´ı, Czech Republic
[email protected],
[email protected] Abstract. This paper presents a novel approach for behavior recognition from video data. A biologically inspired action representation is derived by applying a clustering algorithm to sequences of motion images. To obey the temporal context, we express behaviors as sequences of ngrams of basic actions. Novel video sequences are classified by comparing histograms of action n-grams to stored histograms of known behaviors. Experimental validation shows a high accuracy in behavior recognition.
1
Motivation and Problem Formulation
Human behavior analysis is an important component of surveillance, video annotation, or sports video interpretation. Basically, three main problems have to be addressed; (a) how to track humans in videos, (b) how to adequately represent actions, and (c) how to classify a sequence of action representations onto behavioral classes. In this work, we address the problem of action representation and classification using action primitives. We validate the approach in a human tracking scenario where we assume for simplicity a static camera, and a well working background subtraction algorithm so that silhouette images of moving persons can be extracted. Based on extracted silhouette images, we classify action sequences onto a set of predefined behavioral prototypes, e.g. ‘walking’, or ‘running’.
2
Related Work
The activity representation proposed in this paper is inspired by the concept of movement primitives in biological organisms. Recent results in behavioral biology indicate that instead of directly controlling motor activation commands, movements are the outcome of a combination of movement primitives [1]. Furthermore, recent experiments in cognitive psychology explains the representation of motor skills in human long-term memory by sequences of basic-action concepts [2]. Basic-action concepts are similar to movement primitives and assume an internal tree-like structuring of basic action units. A more detailed review of the problems of motor control can be found in [3,1,4]. Note that, although W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 93–100, 2007. c Springer-Verlag Berlin Heidelberg 2007
94
C. Thurau and V. Hlav´ aˇc
different disciplines decided for different terms to describe movement control, we can find certain common ideas. In this paper, we use the term action primitive or basic action unit for referencing single prototypical actions. Behavior will be interpreted as the outcome of appropriate sequencing of action primitives. The idea of movement primitives found its way into various robotic and motion control applications. A movement primitive is in this context often referred to as a computational element in a sensorimotormap that transforms desired limb-trajectories into actual motor commands [5]. Related to our approach is the work of Fod et al. [4], where the concept of movement primitives is applied in a computational approach by linear superposition of clustered primitive motor commands that finally resulted in complex motion in simulated agents. For imitation of human activities in simulated game worlds, we used a probabilistic sequencing of action primitives [6]. Despite using different features, the action primitives were derived in a similar way as in [4], or as proposed in this paper. For recognition tasks, the idea of representing actions by a set of basic building blocks is not new, for a brief overview we recommend [7]. Recently, Moeslund et al. [8] recognized actions by comparing strings of manually found motion primitives, where the motion primitives are extracted from four features in a motion image. In [9], silhouette keyframe images are extracted using optical flow data. The keyframes are grouped in pairs and used to construct a grammar for action recognition. In [10], the idea of recognizing human behavior using an activity language is further expanded. Moreover, numerous other papers deal with behavior recognition from motion. Related to our approach are the works of Hamid et al. [11], and Goldenberg et al. [12]. [11] applied a sequence comparison using bags of event n-grams to recognize higher-level behaviors, where n-grams are build using a set of basic hand crafted events. Goldenberg et al. [12] extracted eigenshapes from periodic motions for behavior classification.
3
Proposed Behavior Representation and Recognition
The purpose of this paper is twofold; first, we want to advance automatic behavioral analysis from video data, second, we want to propose psychologically/biologically inspired approaches for recognizing human activities. For action representation, we map each frame of a video sequence to a certain action primitive. Since behavior is a temporal phenomenon, we express the sequential observation of basic action units using n-grams, a popular representation used in speech analysis or text mining. The number of occurrences of certain n-gram patterns holds valuable information for the underlying behavior. Consequently, we classify video sequences by comparing extracted histograms of n-gram patterns to already known behavioral histograms. Experimental results on a publicly available dataset, kindly provided by the Weizmann Institute [13], will illustrate the applicability of the proposed approach. A sequence of silhouette images is one of the most simple and accessible representations for human motion. Silhouettes of humans were frequently used for classifying activities from video data [12,13]. Given a sequence of silhouette
n-Grams of Action Primitives for Recognizing Human Behavior
95
Fig. 1. 10 exemplary action primitives. Sequencing of action primitives results in more complex motions and can thus be interpreted as behavior.
images, the task is to recognize a specific behavior. However, we first have to extract a suitable action representation before behavior recognition can take place. In the following, we (a) explain how to derive action representation using the concept of movement primitives, we (b) further extend the action representation using n-grams to incorporate the temporal context, and finally (c) we introduce a simple classifier by means of histogram comparison of sequences of action n-grams. (a) Following ideas discussed in [4,12], in order to make the problem of action primitive extraction feasible, we reduce the dimensionality of the input images using Principal-Component-Analysis (PCA). Applying PCA to the image training data results in a set of eigenvectors, or, since we are dealing with silhouette images, eigenshapes (the term eigenshapes for shape representation was introduced in [12] where the eigendecomposition alone was used for activity recognition). Given K different behaviors, represented by training sequences B = skn , k = 1, . . . , K of length n of K different behaviors where each frame in a sequence consists of an aligned binary silhouette image. The eigenshapes Mi , i = 1, . . . , m correspond to the m largest eigenvalues of the covariance matrix BB . A lower dimensional representation of silhouette images can thus be achieved by projecting the data onto these first m eigenvectors. Given a mapping to a lower dimensional description of human silhouettes, we can express a sequence of silhouettes Si = [s1 , . . . , sn ], sk ∈ Rm in the reduced space Rm . The idea of action primitives assumes that complex motions can be reconstructed using a set of basic action units. In order to derive a meaningful set of action primitives, we apply Ward’s clustering method [14] on all silhouette vectors s occuring in any of the n training sequences [S1 , . . . , Sn ]. We used the Euclidean distance as a similarity measure. Applying Ward’s clustering minimizes the inner cluster variance and results in a set of almost equally sized clusters. This is important, since a representation of activities using a discrete set of actions requires (a) a sufficient coverage of all possible actions, and (b) a maximum variability in cluster center assignments. This leads to a greater variability in sequences of action primitives and showed to increase behavior classification rates. Figure 1 shows a few resulting clusters using Ward’s method. Other clustering approaches, as for instance k-means, single-linkage clustering, or the Neural Gas algorithm (for a review of clustering approaches see [15]) showed a performance decrease1 . Clustering leads to a set A of k cluster centres 1
For some initial k-means or Neural Gas cluster configurations the results were similar, however, due to a random initialization we were not able to reproduce these results. Generally, we experienced a decrease in classification rates ranging from 3% to 5%.
96
C. Thurau and V. Hlav´ aˇc
a
b
b
c
d
d
e
f
(a,b) (b,b) (b,c) (c,d) (d,d) (d,e) (e,f) Fig. 2. Sample sequence showing the original shapes (upper row), their reconstruction using action primitives (middle), and their bi-gram representation (lower row)
or, in our case, action primitives, where A = [a1 , . . . , ak ], ∈ Rm . Each silhouette vector can be expressed by its most similar action prototype. Thus, we can express a sequence of silhouettes Si = [s1 , . . . , sn ] by their corresponding sequence of action primitives ASi = [ai,argmin |ai −s1 | , . . . , aj,argmin |aj −sn | ]. (b) The temporal context of specific action prototypes is crucial for activity recognition. Therefore, instead of analyzing videos based on individual action primitives, we propose a representation by means of n-grams of action primitives. n-grams provide a sub-sequencing of length n of a given sequence and are widely used for sequence analysis in language processing or bioinformatics. For example, a sequence of action primitives s = [a, a, b, b, c, c] could be simply expressed as a sequence of bi-grams s = [(aa), (ab), (bb), (bc), (cc)] or tri-grams s = [(aab), (abb), (bbc), (bcc)]. Figure 2 shows an example on how a sequence of frames can be represented using a sequence of n-grams of action primitives. It should be noted that with increasing n, the number k of possible instances of n-grams increases exponentially since k = j n where j denotes the number of action primitives. Fortunately, behavior and motion follow certain constraints in the sequencing of action primitives. Thus, only a small amount of all possible combinations of action primitives is observed and need to be considered for action recognition. (c) In order to compare different sequences of action n-grams, we have to measure the similarity of two sequential n-gram streams. Fortunately, sequence similarity measures are a well studied topic; numerous applications require appropriate sequence analysis and similarity measures, e.g. the analysis of DNA sequences in bioinformatics, but also computer security. Interestingly, for the recognition of actions it was mostly ignored so far. For a recent review for sequence similarity measures, we recommend [16], the following notations are partly based on this publication. For sequence comparison we use the Manhattan distance function applied to the histograms of action primitives sequences. Given a sequence s of n-grams of action primitives, a histogram φ(s) for all action primitives [a1 , . . . , an ] that were extracted during the training phase is defined as a mapping φa (s) :
→ N+ ∪ 0, φ(s)ai := occ(ai , s) ,
(1)
n-Grams of Action Primitives for Recognizing Human Behavior
97
Fig. 3. Four exemplary behaviors; from left to right one can see one frame taken from a walking, running, jumping, and waving behavior
where i = 1, . . . , n, and n denotes the total number of action primitives, and occ(ai , s) denotes the number of occurrences of ai in s. The Manhattan distance d(φ(s1 ), φ(s1 )) between two histograms φ(s1 ) and φ(s2 ) is defined as d(φ(s1 ), φ(s2 )) =
n
| φai (s1 ) − φai (s2 ) |
(2)
i=1
Since we compare sequences of different length, the histograms are length normalized, i.e. φ(s)n = φ(s) | s |−1 , for values φ(s)ai > 0. For every training sequence St = [st1 , . . . , stn ], a separate histogram φ(si ), i = 1, . . . , n is computed. For each training histogram, the associated class or behavior label is known. For a novel sequence sk , the task now is to select a suitable class label based on histogram comparisons to already labeled histograms. For the experiments, we selected a class using the maximum aggregation rule, i.e. by finding the class histogram φ(sj ) so that φ(sj ) = argmini d(φ(si ), φ(sk )). Among others, the l −1 average aggregation rule argminl l,i=k did perform worse, l,i=1 d(φ(s i ), φ(sk ))k which indicates that it might not be a good idea to classify activity based on the average of all observed activities of one class of behavior.
4
Experimental Results
To verify the presented approach we carried out a series of experiments. The data already contained extracted silhouettes, therefore we could skip the obligatory person tracking. We resized each video frame to 40x40 binary image. The data contained 10 different behaviors (walking, running, jumping, waving, . . . ) performed by 9 subject, Figure 3 shows some exemplary frames (for further details see [13]). For the experiments, we applied temporal smoothing of classification results (see e.g [17]), i.e. we classify based on a majority voting of the last 10 classified frames. Thereby, we could filter out occasional errors and make behavior recognition more robust. However, we will also report on results without temporal smoothing that were slightly worse. To select optimal parameters, we varied the number of length of n-grams, the histogram sequence length, and the number of action primitives. Figure 4 illustrates the effects of these variations. In a leave-one-out cross validation test series, we tested each configuration on videos collected from one subject. Table 1 shows the average classification rates,
98
C. Thurau and V. Hlav´ aˇc
1 0.98
average classification rate
0.96 0.94 0.92 0.9 0.88 0.86 1−gram 2−gram 3−gram 4−gram 5−gram
0.84 0.82 0.8 10
20
30
40 50 60 number of action primitives
70
80
90
Fig. 4. Average classification rates for varying number of n-grams and action primitives. While the rates for all n-gram configurations are sufficient, tri-grams overall showed the most robust and generally best performance. n-grams with n > 3 suffered from overfitting when using too many action-primitives, this became significant only when using more than 150 action primitives (sequence length was set to 25, action primitives were clustered using the 30 first eigenvectors). The best classification rate of 99.62% could been achieved using tri-grams constructed of 29 action primitives.
Table 1. Average classification rates for leave-one-out cross validation of 10 different behaviors performed by 9 subjects (excluding all behaviors of one subject for each run). The sequence lengths corresponds to the number of frames used for calculating ngram histograms. Here, we used a tri-gram consisting of 50 instances of 30-dimensional action-primitives. Sequence length Classification rate Majority voting Abs. correct Abs. error
5 86.17 92.55 3704 298
10 93.80 96.17 3416 136
15 96.53 97.74 3032 70
20 25 97.80 98.58 98.64 99.10 2620 2193 36 20
and absolute numbers of classified frames for varying sequence lengths for s. Setting the sequence length to contain the whole movie sequence usually resulted in a perfect classification of all test movies. Using a sequence length of 25 frames (corresponding to 1 second) we still achieved sufficient classification rates, usually above 95% under various configurations. The best configuration classified 99.62% of all frames correctly. Thereby, we reach similar classification rates as [13]. Note, however, we can not directly compare the approaches since [13] classified 10 frame long sequences of video where 5 frames overlapped. Moreover, [13] excluded for testing one behavior of one subject. Using the same frame number as in [13] we reach slightly worse classification results. In a second experimental run we used the n-gram histograms of only one subject and classified behaviors performed by the remaining 9 subjects. A summary
n-Grams of Action Primitives for Recognizing Human Behavior
99
Table 2. Average classification rates for using the behavioral data provided by one subject for training, and testing against 8 subjects (using 50 action primitives and a sequence length of 25 frames). Generally, the videos are classified surprisingly good, indicating the proposed approach also works for one-shot learning. Subject 1 2 3 4 5 6 7 8 9 Avg. classification rate 87.70 88.59 83.54 86.81 85.43 87.85 80.44 75.41 57.33 majority voting 89.57 88.17 84.18 90.43 85.75 87.64 82.61 75.48 57.39
of the results can be seen in Table 2. Interestingly, we reached reasonable classification rates as well. This leads to the conclusion that the approach is also usable for one-shot learning, or in settings where we can not access a sufficient amount of training data.
5
Conclusion
In this paper, we proposed a novel approach for behavior recognition making use of action representation using bags of n-grams of action primitives. Action representations are based on biological findings and thus motivate a biologically inspired approach. Further, we expressed activity as n-grams and thereby could rely on conventional methodology for sequence analysis often used in speech processing or text mining. In a series of experiments, we could show the applicability for behavior recognition, where we relied on binary silhouettes images of human motion. Varied experimental settings also indicate usability for scenarios where only a few training samples are available. In the future, we plan to use more sophisticated features for describing prototypical actions instead of the so far used silhouette images (e.g. the recently introduced Histograms of Oriented Gradients for finding humans in images [18]). Moreover, similar to [9], we currently investigate action primitives derived from different views in order to make the approach less sensitive to view changes.
Acknowledgments We would like to thank Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, and Ronen Basri for making their behavior dataset publicly available [13]. Christian Thurau is supported by a grant from the European Community under the EST Marie-Curie Project VISIONTRAIN MRTN-CT-2004-005439. The second author was supported by the project MSM6840770013.
References 1. Ghahramani, Z.: Building blocks of movement. Nature 407, 682–683 (2000) 2. Schack, T., Mechsner, F.: Representation of motor skills in human long-term memory. Neuroscience Letters 391, 77–81 (2006)
100
C. Thurau and V. Hlav´ aˇc
3. Wolpert, D.M., Ghahramani, Z., Flanagan, J.R.: Perspectives and problems in motor learning. TRENDS in Cognitive Sciences 5(11), 487–494 (2001) 4. Fod, A., Matari´c, M., Jenkins, O.: Automated Derivation of Primitives for Movement Classification. Autonomous Robots 12(1), 39–54 (2002) 5. Thoroughman, K., Shadmehr, R.: Learning of action through adaptive combination of motor primitives. Nature 407, 742–747 (2000) 6. Thurau, C., Bauckhage, C., Sagerer, G.: Synthesizing Movements for Computer Game Characters. In: Rasmussen, C.E., B¨ ulthoff, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) Pattern Recognition. LNCS, vol. 3175, pp. 179–186. Springer, Heidelberg (2004) 7. Moeslund, T., Reng, L., Granum, E.: Finding Motion Primitives in Human Body Gestures. In: Gibet, S., Courty, N., Kamp, J.-F. (eds.) GW 2005. LNCS (LNAI), vol. 3881, pp. 133–144. Springer, Heidelberg (2006) 8. Moeslund, T., Fihl, P., Holte, M.: Action Recognition using Motion Primitives. In: The 15th Danish conference on pattern recognition and image analysis (2006) 9. Ogale, A.S., Karapurkar, A., Aloimonos, Y.: View-invariant modeling and recognition of human actions using grammars. In: ICCV Workshop on Dynamical Vision (2005) 10. Guerra-Filho, G., Aloimonos, Y.: A Sensory-Motor Language for Human Activity Understanding. In: 6th IEEE-RAS International Conference on Humanoid Robots (HUMANOIDS’06), pp. 69–75. IEEE Computer Society Press, Los Alamitos (2006) 11. Hamid, R., Johnson, A., Batta, S., Bobick, A., Isbell, C., Coleman, G.: Detection and Explanation of Anomalous Activities: Representing Activities as Bags of Event n-Grams. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2005 (CVPR’05), IEEE, NJ, New York (2005) 12. Goldenberg, R., Kimmel, R., Rivlin, E., Rudzsky, M.: Behavior classification by eigendecomposition of periodic motions. Pattern Recognition 38, 1033–1043 (2005) 13. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as SpaceTime Shapes. In: Tenth IEEE International Conference on Computer Vision, IEEE Computer Society Press, Los Alamitos (2005) 14. Ward, J.H.J.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963) 15. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999) 16. Rieck, K., Laskov, P., Sonnenburg, S.: Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees. In: Advances in Neural Information Processing Systems, vol. 19 (2007) 17. Thurau, C., Hettenhausen, T., Bauckhage, C.: Classification of Team Behaviors in Sports Video Games. In: Proc. Int. Conf. on Pattern Recognition, IEEE, pp. 1188–1191. IEEE Computer Society Press, Los Alamitos (2006) 18. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2005 (CVPR’05), IEEE Computer Society Press, Los Alamitos (2005)
Human Action Recognition in Table-Top Scenarios : An HMM-Based Analysis to Optimize the Performance Pradeep Reddy Raamana, Daniel Grest, and Volker Krueger Computer Vision and Machine Intelligence, Copenhagen Institute of Technology, Aalborg University, Denmark {pradeep,dag,vok}@cvmi.aau.dk
Abstract. Hidden Markov models have been extensively and successfully used for the recognition of human actions. Though there exist wellestablished algorithms to optimize the transition and output probabilities, the type of features to use and specifically the number of states and Gaussian have to be chosen manually. Here we present a quantitative study on selecting the optimal feature set for recognition of simple object manipulation actions pointing, rotating and grasping in a tabletop scenario. This study has resulted in recognition rate higher than 90%. Also three different parameters, namely the number of states and Gaussian for HMM and the number of training iterations, are considered for optimization of the recognition rate with 5 different feature sets on our motion capture data set from 10 persons. Keywords: Hidden Markov model, Action Recognition, Optimization.
1
Introduction
Extensive research has been carried out in the past to understand and recognize human actions. Typical applications of the human activity recognition are surveillance, human-robot interaction, imitation learning[1,11]. In surveillance, it is important to detect abnormal and suspicious actions. In robotics community, huge body of research is in place to imitate human through demonstration, imitation and learning. Research on imitation learning shows that actions can be clearly segmented into atomic action units [7,13]. This gives a way to recognize human actions through interpretation of its atomic units [8,9], such as pointing, grasping, and rotating in case of table-top scenarios. Though the scenario is simple, due to the similarity of different actions, the recognition task is difficult. Due to the simplicity of the scenario, the action recognition study on this scenario lets us gain understanding of the human motion and gives insights as to how to tackle the problem of action recognition in complex scenarios. Considerable research in computer vision community has focused on cyclic motions, such as walking or running [1] and Jenkins et. al [8] came up with W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 101–108, 2007. c Springer-Verlag Berlin Heidelberg 2007
102
P.R. Raamana, D. Grest, and V. Krueger
a system to extract the behavior vocabularies for the problem of complex task learning. [4] presents a learning system for one and two hand motions to compute the hand motion trajectory that optimizes the imitation performance given the constraints of the body of the robot. Here we present a quantitative study to optimize the recognition performance of the simple human actions, like pointing, grasping, and rotating performed in a table-top scenario. Our action dataset consists of 10 persons, in contrast to many other studies with 1/2 persons [1,3,10]. We use the Hidden Markov Model(HMM) to learn the motion model. It defines the joint probability distributions over observations and the states of the model. An HMM does time-warping to some extent implicitly in the cases of motion sequences, which differ slightly in their execution speed. Hence it is well-suited to our action dataset consisting of 10 persons with differing execution paces. Our study is based on about 890 sequences, each sequence with an average of 115 poses, totaling 102350 poses. Though there exist well-established algorithms, such as the Baum-Welch algorithm, to optimize the transition and output probabilities of a HMM, the structure of the HMM and its inputs have to be chosen manually. The feature set used to train the HMM is really important, as an inappropriate feature set could lead to low performance of the system and making the system not being able to capture the discriminative features among different actions. Apart from the feature set, the number of states for the HMM and the number of Gaussians for each state has to be given by hand. Moreover the number of training iterations for the HMM also need to be set beforehand. These parameters are usually empirically determined, but this task is often laborious and time consuming in case of big datasets. Hence this study saves a lot of effort and time for subsequent researchers, working in similar scenarios. Moreover this scenario is gaining importance in imitation learning. Perhaps the work most similar to our work presented here is that of Guenter et. al [6]. They have presented simple algorithms to optimize the number of states, training iterations and Gaussians for the HMM in the context of handwritten word recognition task. Their work merely finds parameters with optimal performance, without any studies on different feature sets. Here we perform an extensive evaluation of different feature sets and present results on our action dataset, pointing out the feature set with best classification rate. This evaluation is necessary to build a system with the best performance. Enroute to finding the optimal feature set, the optimal values for different parameters of HMM will also be determined. Recent research [5] shows that one can do motion tracking even from a single view. So this study could be combined with motion capture to recognize the actions in real-time. The paper is organized as follows: Section 2 briefly describes the action recognition through Hidden Markov models and Section 3 describes the motion capture setup and collection of training data. In Section 4, we describe the different feature sets used for the evaluation. Section 5 describes the quantitative evaluation of different feature sets and we conclude in Section 6.
Human Action Recognition in Table-Top Scenarios
2
103
HMM Based Human Action Recognition
L. Rabiner [14] gives an excellent tutorial on how to use HMMs. Given an observation sequence O = O1 O2 . . . OT and a model λ = (A, B, Π), the probability of the observation sequence O, given the model λ, P (O | λ) is calculated by enumerating over all possible state sequences of length T, Q = q1 q2 . . . qT . Then the probability of the observation sequence given the model, irrespective of the state sequence, is obtained by summing the probability over all the possible state sequences: P (O | Q, λ) =
T
P (Ot | Qt , λ) , P (O | λ) =
all Q
P (O | Q, λ)P (Q | λ) (1)
t=1
Then we follow the forward-backward algorithm [14], to compute the likelihood of the observation sequence given the HMM. we train one fully-connected HMM for each class of sequences. To recognize an observation sequence from a dataset, we calculate the likelihood of this sequence from all the trained HMMs. The HMM that gives maximum likelihood represents the observation sequence. If this HMM and the observation sequence belong to the same action class, then we say we could correctly recognize the sequence.
3
Our Action Dataset
Our dataset consists of four actions: Pointing (P), Grasping (G), Rotating (R) and Displacing(D), performed by 10 persons. Each action is performed
Fig. 1. Locations of the electromagnetic sensors on the body of the person and the Setup used to record both markers and video data
1. in three different directions ( to the right, to the front and to the left ) 2. with two different heights and 3. with two different distances.
104
P.R. Raamana, D. Grest, and V. Krueger
Fig. 2. Sequence of Images of a Person Performing the action Rotating
Each person repeated each action 5 times for all combinations of the above settings. With 10 persons, this gives rise to good amount of training examples with respect to variability in execution speed, direction, height, scale and arm lengths. This is essential in order to capture the key features of a particular action invariant to aforementioned properties. Each person has 7 electromagnetic sensors. They are located at the upper back of the torso, shoulder, elbow, wrist, hand, thumb and the index finger, as depicted in Fig. 1. The measurements were acquired with Motionstar of Ascesion [2]. For each sequence, the 3D coordinates of the 7 sensors were recorded at 25 fps, starting from a nearly vertical position (rest pose, Fig. 2). The sequence ends also in the rest pose. Figure 2 shows a person performing the action Rotating. The scenario was also video-recored by 4 calibrated cameras simultaneously with the marker data. The video data could be used for motion tracking etc. The whole setup used is shown in Fig. 1. In this work we consider only three actions: P, G and R, because the action class displacing is a combination of pointing, grasping and moving and hence different from the primitive actions P, G and R.
4
Feature Sets
Each feature set is a sequence of features for all poses. For all the sequences, the 3D position of the torso is considered to be the origin for each pose. It is subtracted from the positions of all the sensors, so that the data used for recognition is independent of the position of the person performing the action. In order to determine the best feature set, we’ve extracted the following feature sets appropriate for the one-arm movements in a table-top scenario: 1. Joint Angles (JA):This consists of the angles at the shoulder and at the elbow between the lower arm and the upper arm. Knowing the problem of
Human Action Recognition in Table-Top Scenarios
2. 3. 4.
5.
5
105
singularity i.e. 0 and π are same, we use the up and direction vectors in the shoulder coordinate frame, obtained from the kinematic chain built for the right arm for each pose. This is invariant to the arm length and can handle directions easily. Direction vector from Torso to Wrist ( T2W ) Direction vector from Shoulder to Wrist ( S2W ) T2W + Grasp Distance GD: Here we define Grasp Distance to be the distance between the sensors on the thumb and the index finger. It is intuitive to see that, GD is discriminative feature between Grasping and the rest. This is confirmed by the experiments. S2W + GD Direction vector from Shoulder to Wrist and the grasp distance.
Quantitative Analysis
In order to learn and recognize different actions, we trained one HMM for each class of sequences. In this work, both training and testing are irrespective of the different individuals. Each time three quarters of sequences from each class are used for training and the fourth quarter is used to test the performance of trained HMMs. The recognition is done using Maximum Likelihood approach as follows. The likelihood of test sequence against all the trained HMMs are calculated and the HMM with maximum likelihood represents the test sequence. We use the Hidden Markov Model Toolbox for Matlab, developed by Kevin Murphy [12] for training and testing. Having chosen different feature sets, its important to set the suitable architecture for the HMM for each of these feature sets to maximise the classification rate. This means that the optimal values for different parameters of the HMM have to be determined. From the preliminary trials on the dataset, we learned that the following ranges of parameters are producing recognition rates(RR) within acceptable range: – Number of States (Q): from 5 to 40 in steps of 5 – Number of Gaussian (G) for each state: 10, 20 and 30 – Number of Training Iterations (I): 10, 15 and 20 With these ranges for different parameters, we use a simple pick-the-best algorithm to find the optimal values for the parameters corresponding to each feature set. For each feature set, we enumerated over the values of each parameter in the above ranges and the performance is noted. The set of parameters corresponding to the maximum recognition rate is noted for each feature set. Though the algorithm is straightforward, it is important to make an extensive evaluation over the above ranges, to make sure that we do not miss the best combination. Each run of the algorithm, with one set of parameters takes about 25 minutes on an average with a 2GHz computer. The best recognition rates and the optimal values for the HMM parameters are given in Table 1. From Table 1, it is clear that GD definitely improved the performance of the system. S2W+GD has optimal performance. One reason is that the motion of shoulder joint for different persons is different for the same action. So it helped the HMM to learn
106
P.R. Raamana, D. Grest, and V. Krueger Varitation of Recognition Rate, with 10 Gaussians and 20 Iterations 95 JA T2W S2W T2W+GD S2W+GD
Recognition Rate −−>
90
85
80
75
70
65
5
10
15
20 25 Number of States −−>
30
35
40
Fig. 3. Variation of RR over number of states, with 10 Gaussian and 20 Iterations Table 1. Best Recognition Rates for Different Feature Sets Feature Set
Best Recognition Rate
JA T2W S2W T2W+GD S2W+GD
80.09 88.68 87.78 90.49 94.12
% % % % %
Optimal Q, G, I 40, 25, 30, 35, 35,
20, 10, 10, 10, 10,
20 20 15 20 15
Table 2. Statistics of the Recognition Rates for All Feature sets Feature Set JA T2W S2W T2W+GD S2W+GD
Mean RR 71.23 79.27 76.99 82.13 84.16
% % % % %
Max. RR 80.09 88.68 87.78 90.49 94.12
% % % % %
Min. RR
Std.Dev
57.92% 56.56% 33.48% 58.37% 45.70%
5.02 6.37 9.37 6.13 9.54
the key features common to the action and to dispose the variant features, and obtain better discrimination among different actions. This study also shows that inclusion of elbow motion(in feature set JA) degraded the classification rate, rather than improving it. This was also observed by Vicente et. al. [15]. From Table 1, we can also see that the combination of 10 Gs and 15 Is, 10 Gs and 20 Is have the best performance. Figures 3, 4 show the variation of the recognition rate over the number of states with these combinations. We can see that the performance increases with the number of states gradually for many of the feature
Human Action Recognition in Table-Top Scenarios
107
Varitation of Recognition Rate, with 10 Gaussians and 15 Iterations 100 JA T2W S2W T2W+GD S2W+GD
95
Recognition Rate −−>
90
85
80
75
70
65
60
5
10
15
20
25
30
35
40
Number of States −−>
Fig. 4. Variation of RR over number of states, with 10 Gaussian and 15 Iterations
sets. The non-monotonicity of the variation shown is due to the randomness in the initialization of the HMMs during training. Apart from the best performance, the average performance of the particular feature set is important. This is because it can be expected, that a feature set with higher average performance and low sensitivity to HMM parameters would perform equally well on another datasets in a similar scenario. The mean, maximum, minimum and the standard deviation of recognition rate over all the trials for the five feature sets are given in Table 2. The standard deviation of the feature set S2W+GD is higher than that of the feature set T2W+GD. But S2W+GD dominates T2W+GD clearly in mean and best recognition rates. So we can say that the feature set S2W+GD is optimal and is suitable for movements in table-top scenarios. The optimal set of parameters are determined to be 35 states for the HMM and 10 Gaussians for each state and 15 iterations for the training.
6
Conclusions
A quantitative study on optimizing the performance of the human action recognition system in table-top scenarios is presented. A set of simple object manipulative actions are used to find the best feature set and optimal values for the HMM parameters. These set of actions could be classified most accurately with the direction vector from shoulder to wrist and the grasp distance. The high average performance of this feature set shows that this is less sensitive to HMM parameters. So it is expected to perform equally well on different datasets in a similar scenario. We also have observed that inclusion of elbow motion degraded the classification rate, rather than improving it. We are building an online task recognition system, integrating this with a motion capture system.
108
P.R. Raamana, D. Grest, and V. Krueger
Acknowledgment. This work has been supported by PACO-PLUS, FP6-2004IST-4-27657.
References 1. Aggarwal, J.K., Cai, Q.: Human motion analysis: A review. Computer Vision and Image Understanding: CVIU 73(3), 428–440 (1999) 2. Ascension. Motion Star Real-Time Motion Capture. http://www.ascension-tech. com/products/motionstar 10 04.pdf 3. Billard, A., Calinon, S., Guenter, F.: Discriminative and adaptative imitation in uni-manual and bi-manual tasks. Robotics and Autonomous Systems 54, 370–384 (2006) 4. Calinon, S., Billard, A., Guenter, F.: Discriminative and adaptative imitation in uni-manual and bi-manual tasks. Robotics and Autonomous Systems 54 (2005) 5. Grest, D., Koch, R., Krueger, V.: Single view motion tracking by depth and silhoutte information. In: Scandinavian Conference on Image Analysis (2007) 6. Guenter, S., Bunke, H.: Optimizing the number of states and training iterations and gaussians in an hmm-based handwritten word recognizer. In: Seventh International Conference on Document Analysis and Recognition, vol. 1, pp. 472–476 (August 2003) 7. Mataric, M.J.: Sensory-motor primitives as a basis for imitation: linking perception to action and biology to robotics. In: Dautenhahn, K., Nehaniv, C.L. (eds.) Imitation in Animals and Artifacts, pp. 391–422. MIT Press, Cambridge (2002) 8. Jenkins, O.C., Mataric, M.J.: Performance-derived behavior vocabularies: Datadriven acqusition of skills from motion. International Journal of Humanoid Robotics 1(2), 237–288 (2004) 9. Krueger, V., Grest, D.: Using hidden markov models for recognizing action primitives in complex actions. In: Scandinavian Conference on Image Analysis (2007) 10. Campbell, L., Bobick, A.: Recognition of human body motion using phase space constraints. In: International Conference in Computer Vision, pp. 624–630 (1995) 11. Moeslund, T., Hilton, A., Krueger, V.: A survey of advances in vision-based human motion capture and analysis. In Computer Vision and Image Understanding 104, 90–126 (2006) 12. Murphy, K.: Hidden Markov Model (HMM) Toolbox for Matlab (1998) http://www.cs.ubc.ca/∼ murphyk/Software/HMM/hmm.html 13. Newtson, D., Engquist, G., Bois, J.: The objective basis of behavior unit. Journal of Personality and social psychology, 847–862 (1977) 14. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989) 15. Vicente, I.S., Kragic, D.: Learning and recognition of object manipulation actions using linear and nonlinear dimensionality reduction. In: 15th IEEE Int. Symp. on Robot and Human Interactive Communication(RO-MAN) (Submitted 2007)
Grouping of Articulated Objects with Common Axis Levente Hajder Computer and Automation Research Institute, Hungarian Academy of Sciences Kende u. 13-17., H-1111 Budapest, Hungary
[email protected] Abstract. We address the problem of nonrigid Structure from Motion (SfM). Several methods have been published recently which try to solve the task of tracking, segmenting, or reconstructing nonrigid 3D objects in motion. Most of these papers focus on deformable objects. We deal with the segmentation of articulated objects, that is, nonrigid objects composed of several moving rigid objects. We consider two moving objects and assume that the rigid SfM problem has been solved for each of them separately. We propose a method which helps to decide whether an object is rotating around an axis defined by another moving object. The theories of the proposed method is discussed in detail. Experimental results for synthetic and real data are presented.
1
Introduction
3D motion based nonrigid object reconstruction is a popular topic in three dimensional computer vision. It has many important applications such as registration of medical images, robotic vision and face reconstruction. Most of the published rigid SfM methods are based on various extensions of the well-known factorization method by Tomasi and Kanade [1]. The original method assumes orthographic projection, while its extensions can cope with weak perspective [2], para-perspective [3] and real perspective [4] cases as well. 3D motion segmentation algorithms for rigid objects can also be based on factorization methods [5, 6], but efficient fundamental matrix [7] or trifocal tensor [8] based segmentation procedures also exist. Most segmentation algorithms, such as [8–10], cannot cope with nonrigid (e.g., articulated) objects. Recently, researchers have begun to deal with the reconstruction of nonrigid moving 3D objects [11–14]. These studies assume that the structure of the nonrigid object can be written as a weighted sum of K so-called key objects: S=
K
l i Si
(1)
i=1
Unfortunately, this assumption is unfavorable for the following reasons: 1. The motion of many real nonrigid objects, such as articulated objects, cannot be written in the above form. For instance, the motion of a human arm, or that of a windmill blade cannot be expressed by eq. (1). W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 109–116, 2007. c Springer-Verlag Berlin Heidelberg 2007
110
L. Hajder
2. Every key object has a different weight in every frame. The calculation of the weights is computationally expensive. The large number of the parameters to be optimized reduces the quality of the result. Our approach differs from the factorization techniques [11–14]. We focus on the reconstruction of an articulated object by grouping the rigid parts. We assume that the 3D motion segmentation problem for the parts has been solved by some of the existing techniques, and the motion and the structural data of the parts are available. Our goal is to group the parts of a potential articulated object. There are different types of articulated objects. In this paper, two rigid parts are considered. (This assumption is not prohibitive: If more than two parts have been found, a brute-force solution is to apply the proposed methods to each pair of parts.) The following problem is addressed in our study: How to determine if a part is rotating around an axis defined by the other part?
2
SfM Under Weak Perspective
Given P feature points of a rigid object tracked across F frames, xf p = (uf p , vf p )T , f = 1, . . . , F , p = 1, . . . , P , the goal of SfM is to recover the structure of the object. If the origins of 2D coordinate system at the centroid is subtracted from the trajectories in all images, the 2D coordinates are calculated as xf p = qf Rf sp ,
(2)
where Rf = [rf 1 , rf 2 ]T is the orthonormal rotation matrix, sp the 3D coordinates of the point, qf is the nonzero scale factor of weak perspective. For all points in the f -th image, the above equations can be rewritten as Wf = (xf 1 . . . xf P ) = Mf · S 2×3
2×P
(3)
3×P
where Mf is called the motion matrix, S = (s1 , . . . , sP ) the structure matrix. Under orthography Mf = Rf , under weak respective Mf = qf Rf . For all frames, the equations (3) form W = M · S ,
2F ×P
(4)
2F ×3 3×P
where W T = [W1T , W2T , . . . , WFT ] and M T = [M1T , M2T , . . . , MFT ]. The task is to factorize the measurement matrix W and obtain the structural information S. This can be done in two steps. In the first step, the rank of W is reduced to three by the singular value decomposition (SVD), since the rank of ˆ 2F ×3 · Sˆ3×P . This factorization is deterW is at maximum three: W 2F ×P = M mined only up to an affine transformation because an arbitrary 3×3 non-singular
Grouping of Articulated Objects with Common Axis
111
ˆ QQ−1 S. ˆ Therefore M ˆ contains the base matrix Q can be inserted so that W = M vectors of the frames deformed by an affine transformation. The matrix Q can be determined optimally by least squares optimization both for the orthogonal [1] and the weak-perspective [2] cases imposing the corresponding constraints on the ˆ Q, frame base vectors. The estimated motion vectors can be written as M = M ˆ the estimated structure as S = QS.
3
The Proposed Grouping Method
In this section, we consider two moving rigid objects (parts) and assume that they have been segmented by any of the existing 3D motion segmentation methods [7, 9, 10]. The motion and the structure data of the objects are thus provided by factorization for each frame i, 1 ≤ i ≤ F : W1i = M1i S1
(5)
W2i
(6)
=
M2i S2 .
The goal of the two proposed methods is to group the segmented parts into an articulated object. 3.1
Relative Motion of Two Objects
To examine the motion of an object with respect to another object, one has to determine the relative motion of the two objects. Recall that it is assumed that the factorization has been computed: W1i = M1i S1 and W2i = M2i S2 . The −1 i i relative motion of the objects can be written as either M12 = M1i M2i or M21 = i −1 i i i M2 M1 . It should be noted that M1 and M2 are non-invertible matrices. Each of them can be inverted if completed by the third base vector. Its direction is that of the cross-product of the first and the second base vectors, while its length is either unit (orthography) or the average length of the other two base vectors (weak-perspective). As shown in [10], the factorization is ambiguous. If W1i = M1i S1 is valid SfM factorizations, then W1i = (M1i A1 )(AT1 S1 ) is also valid factorizations if and only if AT1 A1 = I. W2i = (M2i A2 )(AT2 S2 ) is also a valid factorization if AT2 A2 = I. (Here A1 and A2 are the common transformation matrices in all frames.) The ambiguity modifies the relative motions:
i M12 = AT1 M1i i M21
=
−1
i M2i A2 = AT1 M12 A2
−1 AT2 M2i M1i A1
=
i AT2 M21 A1
(7) (8)
Since the matrices A1 and A2 are orthogonal, the following conclusion is drawn: Due to the factorization ambiguity, the obtained relative motion is the true relative motion transformed by an unknown Euclidean transformation.
112
3.2
L. Hajder
Rotation Around an Axis Defined by Another Object
Without loss of generality, let us assume that the coordinate system of the second object is such so that the hypothetic axis is parallel to the vector [1, 0, 0]T . Select a base vector of the first object; let its coordinates be [x1 , y1 , z1 ]T . If the first object is rotating around the axis and the rotation angle in the ith frame is αi , then the coordinates of the base vector in the ith frame are [x1 , y1 cos(α) + z1 sin(αi ), z1 cos(αi )−y1 sin(αi )]T . Similarly, for a second base vector [x2 , y2 , z2 ]T , these coordinates are [x2 , y2 cos(α) + z2 sin(αi ), z2 cos(αi ) − y2 sin(αi )]T . Considering the base vectors as points in the 3D space, we observe that the points of each base vector form a circle, the two circles lie in parallel planes, and their axes are common. For simplicity, we will speak of the ‘coaxial circles’. The calculated relative motion is an Euclidean transformation of the true relative motion. Two coaxial circles transformed by an Euclidean transformation are also coaxial circles. Therefore, the problem of detecting the rotation of an object w.r.t. another object is reduced to fitting coaxial circles to the calculated motion data. A large number of circle and ellipse fitting algorithms is available in both 2D and 3D, such as [15–17]. These methods are optimized to fit a single circle to data points. To our best knowledge, the problem of simultaneous fitting of coaxial circles has not been addressed. In this section, we give a simple but efficient algorithm to solve this task. An error metric is also defined here. Finally, a threshold limit must be set to determine whether the object is rotating around an axis defined by the other object. 3.3
Fitting Coaxial Circles to 3D Points
Given two 3D data set with N and M points, the goal is to determine coaxial circles fitted the two point set. The solution is divided into two parts: Two parallel planes fitted to the points first, then the circles on that planes are estimated. Fitting Parallel Planes to Points. The ith 3D point of the first set is represented by (Xi , zi ) while the j th point in the second by (Uj , wj ) where T T Xi = xi yi and Uj = uj vj . The equations of the parallel planes can be written as z = aT X + b 1 w = aT U + b 2
(9) (10)
T where a is a vector with two elements: a = a1 a2 . The plane fitting is based on the error function: J=
N i=1
(zi − aT Xi − b1 )2 +
M j=1
(wj − aT Uj − b2 )2
(11)
Grouping of Articulated Objects with Common Axis
113
Its derivatives gives the optimal solution to the parameters of the parallel planes: N T ∂J = i=1 (a Xi )Xi − zi Xi + b1 Xi + ∂a M + j=1 (aT Uj )Uj − wj Ui + b2 Ui = 0 N ∂J T = i=1 b1 − zi + a Xi = 0 ∂b1 M ∂J T = j=1 b2 − wi + a Ui = 0 ∂b2
(12) (13) (14)
b1 and b2 can be expressed as N T i=1 zi − a Xi b1 = N M T j=1 wj − a Uj b2 = M
(15) (16)
Optimal value of a can be calculated by solving the following linear equation: M M N N N T T j=1 {Uj } j=1 {Ui } T T i=1 {Xi } i=1 {Xi } {Xi Xi + Ui Ui } − − a= N M i=0
=
N i=1
{zi Xi } +
M j=1
N {wj Uj } −
i=1 {zi }
M
N
i=1 {Xi }
N
−
j=1 {wj }
M
j=1 {Uj }
M
With the known a, offset parameters b1 and b2 come from eqs( 15) and( 16). Determination of the Center and of the Radius of the Circles. The centers of the circles are estimated as the closest point of the planes to the origin. Finally, the radiuses of the circles are estimated by averaging the distance between the points and the corresponding circle centers. Definition of the Error Metric. The definition of the fitting error is simple: let the error of a points be the distance between the original point and the closes point on the corresponding circle. The fitting error is the average of the errors of all points.
4
Experiments on Synthetic Data
For every test, two objects are generated as two point clouds by a Gaussian random number generator with zero mean and standard deviation σobj . The objects undergo the same translational motion, while their rotations are connected by an axis. The free angles of the motion is randomized by the same random number generator. Then the 3D points of the objects are projected onto the image
114
L. Hajder
plane. Finally, 2D noise is added to every 2D point. The 2D noise is generated by a zero-mean Gaussian random number generator with standard deviation σnoi . Three tests were done: – Fitting error versus the noise: The test result presented in the left plot of the Figure 1 shows that the error is growing approximately linearly with the level of noise. The method seems efficient until 7 − 8% noise level. According to the test, error value 0.02 seems to be a good threshold limit for the method. – Fitting error versus the number of frames: The error is decreasing when the number of frames is increasing as it is demonstrated in the central plot of Figure 1, because more frames serve more 3D points to the circle fitting. – Fitting error versus the number of noise: Adding more points to the 3D objects improves the quality of the result, because the quality of motion data produced by the factorization becomes better by adding more points. Better motion data yield more precise 3D points to the circle fitting algorithm. The test is shown in the right plot of Figure 1.
Fig. 1. Left: Fitting error versus the noise level. Center :versus the number of frames. Right: versus the number of points.
5
Experiments on Real Data
Our methods were also tested on a real video sequence consisting eleven frames. Three frames of the images sequence are shown in Figure 2. There are three moving objects in the video: a CD box, a juice box and a painted plastic bear. The bear is connected to the CD box with an axis. Hundreds of feature points have been tracked in the images and these points have been segmented by our region-based 3D segmentation method [10]. Then the motion and structure matrices were calculated by the factorization method of Tomasi and Kanade[1]. We have examined whether the relative motion of the bear and the CD box is a rotation around an axis connected to the CD box. The fitted coaxial circles and the relative motion data are visualized in Figure 3. The fitting error is small (0.109), so the conclusion is that the relative motion between the bear and the CD box is a rotation.
Grouping of Articulated Objects with Common Axis
115
Fig. 2. Real video sequence with a bear, a CD and a juice box. Left: first frame, Middle: 6th frame: Right: 11th (last) frame.
Fig. 3. Coaxial circles fitted to the relative motion between the bear and the CD box. The points representing the relative motion are showing by little octaeders.
6
Conclusion and Future Work
In this paper, we addressed the problem of grouping moving rigid parts of articulated objects. We formulated an axis-connected articulated objects and presented a new method : it determines whether the motion of two rigid objects are connected by an axis. The method and the corresponding theory were discussed in the paper, and the produced error values were examined versus noise in the image space, versus the number of frames and versus the number of points. The proposed method were also tested on data points come from a real video sequence. In the future, we plan to deal with the grouping of different moving parts of human bodies, because a human skeleton is an articulated object, and its moving parts are the bones.
References 1. Tomasi, C., Kanade, T.: Shape and Motion from Image Streams under orthography: A factorization approach. Intl. Journal Computer Vision 9, 137–154 (1992) 2. Weinshall, D., Tomasi, C.: Linear and Incremental Acquisition of Invariant Shape Models From Image Sequences. IEEE Trans. on PAMI 17(5), 512–517 (1995) 3. Poelman, C.J., Kanade, T.: A Paraperspective Factorization Method for Shape and Motion Recovery. IEEE Trans. on PAMI 19(3), 312–322 (1997)
116
L. Hajder
4. Sturm, P., Triggs, B.: A Factorization Based Algorithm for Multi-Image Projective Structure and Motion. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1064, pp. 709–720. Springer, Heidelberg (1996) 5. Trajkovi´c, M., Hedley, M.: Robust Recursive Structure and Motion Recovery under Affine Projection. In: Proc. British Machine Vision Conference (September 1997) 6. Kurata, T., Fujiki, J., Kourogi, M., Sakaue, K.: A Robust Recursive Factorization Method for Recovering Structure and Motion from Live Video Frames. In: IEEE ICCV Frame-Rate Workshop (September 1999) 7. Torr, P.H., Murray, D.W.: Outlier detection and motion segmentation. In: Arcelli, C., Cordella, L.P., Sanniti di Baja, G. (eds.) Visual Form 2001. LNCS, vol. 2059, pp. 432–443. Springer, Heidelberg (2001) 8. Torr, P.H.S., Zisserman, A., Murray, D.W.: Motion clustering using the trilinear constraint over three views. In: Europe-China Workshop on Geometrical Modelling and Invariants for Computer Vision, pp. 118–125 (1995) 9. Kanatani, K.: Motion Segmentation by Subspace Separation and Model Selection. In: ICCV, 586–591 (2001) 10. Hajder, L., Chetverikov, D.: Robust 3D Segmentation of Multiple Moving Objects Under Weak Perspective. In: ICCV Workshop on Dynamical Vision, CD ROM (2005) 11. Brand, M., Bhotika, R.: Flexible Flow for 3D Nonrigid Tracking and Shape Recovery. In: IEEE Conf. on Computer Vision and Pattern Recognition. vol. 1, pp. 312–322 (December 2001) 12. Torresani, L., Yang, D., Alexander, E., Bregler, C.: Tracking and Modelling Nonrigid Objects with Rank Constraints. In: IEEE Conf. on Computer Vision and Patter Recognition (2001) 13. Xiao, J., Chai, J.X., Kanade, T.: A closed-form solution to non-rigid shape and motion recovery. In: ECCV. vol. 4, pp. 573–587 (2004) 14. Xlad, X., Del Bue, A., Agapito, L.: Non-rigid factorization for projective reconstruction. In: British Machine Vision Conference, pp. 169–178 (2005) 15. Gander, W., Golub, G.H., Strebel, R.: Least-squares fitting of circles and ellipses. In: Numerical analysis (in honour of Jean Meinguet), pp. 63–84 (1996) 16. Taubin, G.: Estimation of planar curves, surfaces, and nonplanar space curves defined by implicit equations with applications to edge and range image segmentation. IEEE Trans. on PAMI 13(11), 1115–1138 (1991) 17. Fitzgibbon, A.W., Pilu, M., Fisher, R.B.: Direct least square fitting of ellipses. IEEE Trans. on PAMI 21(5), 476–480 (1999)
Decision Level Multiple Cameras Fusion Using Dezert-Smarandache Theory Esteban Garcia and Leopoldo Altamirano National Institute of Astrophysics, Optics and Electronics, Luis Enrique Erro No. 1.Tonantzintla, Pueba, Mexico {eomargr, robles}@inaoep.mx
Abstract. This paper presents a model for multiple cameras fusion, which is based on Dezert-Smarandache theory of evidence. We have developed a fusion model which works at the decision fusion level to track objects on a ground plane using geographically distributed cameras. As we are fusing at decision level, track is done based on predefined zones. We present early results of our model tested on CGI animated simulations, applying a perspective-based basic belief assignment function. Our experiments suggest that the proposed technique yields a good improvement in tracking accuracy when spatial regions are used to track. Keywords: Multiple Cameras Fusion, Tracking, Dezert-Smarandache Theory, Decision Level Fusion.
1
Introduction
Using multiple cameras has proven to increase vision systems capabilities, mainly by extending visual field or complementing information from cameras to reduce uncertainty and handling issues such as occlusion. Information coming from sensors might be fused at several levels [1], depending on the type of information provided by cameras, whenever it can be just images or any other information which has been processed inside the camera (known for that reason as smart sensors). In practice, the main task of surveillance systems aims for autonomous systems, or at least to make surveillance a less tedious task. While there are emerging approaches [2,3,4,5] which take in consideration information related to the objects in scene (such as where an object is standing and how long it has been on that particular zone), it is still no common to find work related to the use of multiple cameras to improve information useful for such surveillance systems. In [5] a tracking system using predefined regions is used to analyze behavioral patterns, where uncertainty might be clearly reduced using more than one camera. In [4] a Hierarchical Hidden Markov Model is used to identify activities, based on tracking people on a cell divided room, and even two static cameras cover scene, their purpose is to focus on different zones, but not to refine information. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 117–124, 2007. c Springer-Verlag Berlin Heidelberg 2007
118
E. Garcia and L. Altamirano
In this work we propose a decision fusion level model to take advantage of several geographically distributed cameras, and reduce uncertainty related to the 3D position of an object with respect to the camera view. We consider previously defined zones to track objects position, so the stage of the processing at which data integration takes place allows an interpretation of information which describes better the position of objects being observed and at the same time is useful for high level surveillance systems, such those focused on behavior recognition. In our model, individual decisions are taken by means of an axis-projection-based generalized basic belief assignment (gbba) function and finally fused using Dezert-Smarandache (DSm) hybrid rule. The recent DezertSmarandache Theory (DSmT) [6] is specially useful for the fusion of information arising from several independent sources, and provides an effective combination tool when sources become conflicting, compared to other theories such as Dempster-Shafer [7]. We also propose a way to reduce and manage dynamically the frame of discernment to optimize computer resources. This paper is organized as follows: in section 2, the Dezert-Smarandache theory is briefly described as mathematical framework. In section 3, we define our model and explain how zones are dynamically used to create the frame of discernment and also explain the gbba function we used. The performance of the proposed fusion model is illustrated in section 4. The paper ends with a conclusion in section 5.
2
DSm Hybrid Model
The DSmT considers two mathematical models used to represent and combine information [8]. First we describe the Free-DSm model and then the DSm hybrid model, which is defined from the free one but introduces integrity constraints, which we found useful to avoid wasting computer resources. The Free DSm model, denoted as Mf (Θ), defines Θ = {θ1 , . . . , θn } as a set or frame of n non exclusive elements and an hyper-power set DΘ as the set of all composite possibilities obtained from Θ in the following way: (1) ∅, θ1 , . . . , θn ∈ DΘ .(2) ∀A ∈ DΘ , B ∈ DΘ , (A ∪ B) ∈ DΘ , (A ∩ B) ∈ DΘ . (3) DΘ is formed only by elements obtained by rules 1 or 2. Function m(A) is called general basic belief assignment or mass for A, defined as m() : DΘ → [0, 1], and is associated to a source of evidence. A DSm hybrid model introduces some integrity constraints on elements A ∈ DΘ when there are known facts related to those elements in the problem under consideration. The restricted elements are forced to be empty in the hybrid model M(Θ) = Mf (Θ) and the mass is transferred to the non restricted elements. When DSm hybrid model is used, combination rule for two or more sources is defined for A ∈ DΘ with these functions: mM(Θ) (A) = φ(A) [S1 (A) + S2 (A) + S3 (A)]
(1)
Decision Level Multiple Cameras Fusion
S1 (A) =
k
mi (Xi )
119
(2)
X1 ,X2 ,...,Xk ∈DΘ i=1 X1 ∩X2 ∩...∩Xk =A
S2 (A) =
k
mi (Xi )
(3)
i=1 X1 ,X2 ,...,Xk ∈∅ [U =A]∨[[U ∈∅]∧[A=It ]]
S3 (A) =
k
X1 ,X2 ,...,Xk ∈D X1 ∪X2 ∪...∪Xk =A X1 ∩X2 ∩...∩Xk ∈∅ Θ
mi (Xi )
(4)
i=1
where φ(A) is called the characteristic emptiness function of a set A (φ(A) = 1 if A ∈ ∅ and φ(A) = 0 otherwise). ∅ = {∅M , ∅} where ∅M is the set of of all elements of DΘ forced to be empty. U is defined as U = u(X1 )∪u(X2 )∪. . . ∪u(Xk ), where u(X) is the union of all singletons θi ∈ X, while It = θ1 ∪ θ2 ∪ . . . ∪ θn .
3
Multiple Cameras Fusion
We use a set of cameras to get information from scene. Motion detection and tracking is performed separately for each one of the cameras, then tracked objects are matched in case there are more than one in scene. These are previously required tasks, which might be performed in many ways: motion detection can be performed by using background subtraction, while tracking is commonly solved with filters like Kalman. A homography is necessary to relate information coming from cameras, it is possible to recover homography from a set of static points on ground plane [9] or dynamic information in scene [10]. Correspondence between objects detected in cameras might be achieved by following features matching techniques [11] or geometric ones [12,13]. Once homography has been recovered and correspondence is achieved, it is possible to create hypotheses, evaluate gbba function and fuse information coming from cameras. We propose in this paper that each camera becomes an expert by each object in scene, providing beliefs about object’s position, so what we expect from each camera is a set of hypotheses of people’s location and their assigned beliefs. 3.1
Dynamic Frame of Discernment
Let Γ = {γ1 , . . . , γn } denote ground plane partition, where each γx is a predefined zone of ground plane, which might be an special interest zone, such as corridor or parking area, whenever it is a regular polygon or not. For each moving object i, it is created a frame Θi = {θ1 , . . . , θk }. Each element θx represents a zone γy where the object i might be situated, according to
120
E. Garcia and L. Altamirano
information from cameras. It means that Θi is built dynamically only considering the zones for which there exist some belief provided by at least one camera. This is specially helpful to reduce |DΘ | and thus avoid wasting computer resources. Each camera behaves as an expert, assigning mass to each one of the unrestricted elements of DΘ . Assignation function is simple, and has as main purpose to consider perspective influence in uncertainty. It is achieved by mean of measuring intersection area between γx and object’s vertical axis projected on ground plane, centered on object’s feet. The length of the axis projected on ground plane is determined by the angle of the camera respect to the ground plane, taking object’s ground point as the vertex to measure the angle. So if the camera were just above the object, its axis projection would be just one pixel long, meaning no uncertainty at all. We establish axis projection length as λ = lcos(α), where l is the maximum length that the projection can take measured in pixels, and α is the measured angle.
(a) Cameras perspective
(b) Object’s vertical axis projection
Fig. 1. Projected axis, used to evaluate gbba function
An expert (camera) can provide belief to elements θx ∩ θy ∈ DΘ , by considering couples γi and γj (represented by θx and θy respectively) crossed by axis projection. It is used as information indicating that the object is in the border of γi and γj . Elements θx ∪ . . . ∪ θx can have an associated gbba value, which represents local or global ignorance. We also restrict elements in θx ∩ . . . ∩ θy ∈ DΘ for which there is not a direct basic assignation made by one of the cameras, thus they are included in ∅M , and calculations are simplified. That is possible because of the hybrid DSm model definition. Decision fusion is used to combine the outcomes from cameras, making a final decision. We apply hybrid DSm rule of combination over DΘ in order to achieve a final decision. To transfer the masses of the integrity constraints of the hybrid DSm model, constraints are considered and it is not necessary to compute S1 (A), S2 (A) and S3 (A) in equation 2.
Decision Level Multiple Cameras Fusion
4
121
Experimental Results
As results of our approach, we present tests made using a computer-generatedimagery animated sequence from three points of view, which simulates a three camera surveillance system. We considered a simple squared scenario with a grid of sixteen regular predefined zones, introducing object’s shadow as noisy element. Furthermore, object’s color is sensitive to its position, because illumination is located just in the top of scene. Moving objects are detected by using background subtraction [14]. Principal axis correspondence [15] was implemented, and homography is recovered from a set of static correspondence points [9]. Cameras are elevated 28.32 (camera 1), 18.03 (camera 2) and 9.86 (camera 3) degrees respect to the center of ground plane. Camera 1 is located on a corner and the other two are orthogonally located on the sides of the ground plane. All of them are static and focusing to the center of scene. 3D modeling was done using Blender with Yafray as rendering machine. All generated images for sequence are in a resolution of 800x600 pixels. Examples of images generated by rendering are shown in figure 2, where division lines were outlined on ground plane to have a visual reference of zones, but they are not required for any other task.
(a) camera 1
(b) camera 2
(b) camera 3
Fig. 2. Images from cameras at frame number 40
Belief to be assigned by cameras is obtained from object’s axis projected on ground plane according to camera perspective, where λ is applied as coefficient to compute the length of projection. After calculating the length, homography is used to transform it to ground plane reference and then its intersection with zones is evaluated. All elements θx ∩ ... ∩ θy are included in ∅M , except for those in the form of θi ∩ θj for which had been directly assigned a belief by some of the cameras. Before fusing, assignation is normalized with respect to the unity to comply with DSm hybrid model definition. Results from tracking and fusion of a moving cube with proportions similar to human body are shown in figures 3 and 4, corresponding to a sequence of 220 frames. Object goes from the zone in the lower left corner of the ground plane to the upper right one. Zones γi ∈ Γ were enumerated from left to right and from up to down to be referred to in figures. Generalized basic belief assignment
122
E. Garcia and L. Altamirano
(a) Original positions
(b) Decisions in camera 1
(c) Decisions in camera 2
(d) Decisions in camera 3
Fig. 3. Original positions compared with decisions in cameras
value is represented by gray tones, going from white (high uncertainty) to black (zero uncertainty), while object’s position is plotted as a block line in row. It is common for a camera to assign mass to more than one hypothesis because of the perspective consideration. If we take frame number 100 as example, camera 1 assigns mass to zones 6, 10 and 11, which are adjacent between themselves as is showed in figure 3(b). Each camera proposes hypotheses and correspondent masses. Combining evidences from cameras reduces uncertainty while refines hypotheses compared to single camera, as can be seen in figure 4, where original positions are compared to the ones obtained by fusion. The fusion module improves the accuracy of decisions from cameras, even if target is missed in some frames (camera 3 lost object some times, see figure 3(d)). The tracking system operates at a rate of about 5 frames per second in a notebook with the following characteristics: Turion64, 2.0GHz, 1GB RAM; all
Decision Level Multiple Cameras Fusion
(a) Original positions
123
(b) Positions obtained by fusion
Fig. 4. Original positions compared with fusion result
processed within the same computer, but we suppose frame rate might be considerably higher if smart cameras or parallel processing is used.
5
Conclusions
In this paper an approach to fuse position coming from distributed cameras is proposed and results were shown. We propose the use of cameras as experts, and to work at decision level using Dezert-Smarandache Theory to combine beliefs. It has been shown that our model can reduce uncertainty from cameras by fusing zone based position. Further works are needed to examine the influence of other possible parameters to be included in the gbba function, such as distance from object to the camera, or even try other belief functions. Whenever execution time was not an objective of this work, tests showed it is possible to improve processing to reach real time video surveillance requirements. Acknowledgments. This work is partially supported by CONACYT under grant number 201562.
References 1. Dasarathy, B.V.: Sensor fusion potential exploitation-innovative architectures and illustrative applications. In: Proceedings of the IEEE. vol. 85, pp. 24–38 (January 1997) 2. Lv, F., Kang, J., Nevatia, R., Cohen, I., Medioni, G.: Automatic tracking and labeling of human activities in a video sequence. In: IEEE International Workshop on Performance Evaluation of Tracking and Surveillanc. conjunction with ECCV’04 (2004)
124
E. Garcia and L. Altamirano
3. Green, R., Guan, L.: Quantifying and recognizing human movement patterns from monocular video images - part i: A new framework for modeling human motion (2003) 4. Nguyen, N.T., Phung, D.Q., Venkatesh, S., Bui, H.: Learning and detecting activities from movement trajectories using the hierarchical hidden markov models. In: CVPR ’05. Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Washington, DC, USA, vol. 2, pp. 955–960. IEEE Computer Society, Los Alamitos, CA, USA (2005) 5. Yan, W., Forsyth, D.A.: Learning the behavior of users in a public space through video tracking. In: WACV-MOTION ’05. Proceedings of the Seventh IEEE Workshops on Application of Computer Vision (WACV/MOTION’05), Washington, DC, USA, vol. 1, pp. 370–377. IEEE Computer Society, Los Alamitos, CA, USA (2005) 6. Dezert, J.: Foundations for a new theory of plausible and paradoxical reasoning. Information and Security Journal 9 (2002) 7. Shafer, G.: A Mathematical Theory of Evidence. Princeton University, Princeton (1976) 8. Smarandache, F., Dezert, J.: An introduction to the dsm theory for the combination of paradoxical, uncertain, and imprecise sources of information (2006) 9. Stein, G., Lee, L., Romano, R.: Monitoring activities from multiple video streams: Establishing a common coordinate frame. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 758–767 (2000) 10. Bradshaw, K.J., Reid, I.D., Murray, D.W.: The active recovery of 3d motion trajectories and their use in prediction. IEEE Trans. Pattern Anal. Mach. Intell. 19(3), 219–234 (1997) 11. Krumm, J., Harris, S., Meyers, B., Brumitt, B., Hale, M., Shafer, S.: Multi-camera multiperson tracking for easyliving. In: Proceedings of the Third IEEE International Workshop on Visual Surveillance, pp. 3–10 (July 2000) 12. Khan, S., Shah, M.: Consistent labeling of tracked objects in multiple cameras with overlapping fields of view. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10), 1355–1360 (2003) 13. Black, J., Ellis, T.: Multi camera image tracking. Image Vision Comput. 24(11), 1256–1267 (2006) 14. Heikkila, J., Silven, O.: A real-time system for monitoring of cyclists and pedestrians. In: VS ’99. Proceedings of the Second IEEE Workshop on Visual Surveillance, Washington, DC, USA, p. 74. IEEE Computer Society, Los Alamitos, CA, USA (1999) 15. Hu, W., Hu, M., Zhou, X., Lou, J.: Principal axis-based correspondence between multiple cameras for people tracking. In: Tan, F.-T., Maybank, M.-S.(eds.) IEEE Trans. Pattern Anal. Mach. Intell., vol. 28(4), pp. 663 (2006)
Occlusion Removal in Video Microscopy Brian Eastwood and Russell M. Taylor II University of North Carolina at Chapel Hill Department of Computer Science Chapel Hill, NC, 27599-3175, USA {beastwoo,taylorr}@cs.unc.edu
Abstract. Video microscopy offers researchers a method to observe small-scale dynamic processes. It is often useful to remove unchanging portions of image sequences to improve the perception and analysis of moving features. We propose two processing methods (a local method and a global method) for detecting and removing partial stationary occlusions from video microscopy data using the bright-field microscope image model. In both techniques, we compute the relative light transmission across the image plane due to fixed, partially-transparent objects. The resulting transmission map enables reconstruction of a video in which the occlusions have been removed. We present experimental results that compare the effectiveness and applicability of our two approaches. Keywords: video processing, image enhancement, occlusion removal, light microscopy.
1
Introduction
Video microscopy figures prominently in many fields of research, such as biology, pathology, materials science, and physics. In some situations, an experiment imposes constraints that introduce undesirable artifacts into the captured images. For example, using long working distance lenses enables focusing further into a specimen, but also increases the depth of field, consequently including more of the specimen in the final image. Figure 1 shows frames of a video in which a microbead is attached to the beating cilia of epithelial lung cells. In this experiment, the microscope focuses through the substrate and cells, and these components modulate the image of the moving cilia. Other artifacts, such as debris on the image sensor (also seen in Figure 1), slide, or cover slip, may contribute to the final image. A typical microscopy video consists of one or more moving “foreground” specimen objects and static “background” impurities. (The use of foreground and background does not imply which object is physically in front of another; we choose the convention that assigns the object of interest, the specimen, to the foreground.) The background impurities will obscure the foreground specimen as the specimen moves across the image plane. An occluding object and specimen become darker where they overlap, but rarely does an object absorb all light. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 125–132, 2007. c Springer-Verlag Berlin Heidelberg 2007
126
B. Eastwood and R.M. Taylor II
Fig. 1. Ciliated epithelial lung cells move a bead in bright-field microscopy using a long working distance lens. The spot circled in the first frame is from dirt on the image sensor. The right-most image is the mean of 61 frames from this video.
In this paper, we present two methods for repairing video microscopy data that suffers from stationary partial occlusions. Our approaches use the image model of a bright-field microscope, and rely on motion to enable separation of the video into moving and stationary components. In addition to improving the visual clarity of microscopy videos, our technique is a useful pre-process for other image analysis tasks, such as optical flow computation. Our approaches have the added benefit—compared to background subtraction—of preserving the measured intensity relationships between image sensor locations, which is necessary for photometry. We compare the effectiveness of these methods, and describe which method suits different types of data.
2
Related Work
The details of image formation, such as noise and sensor calibration, are concerns in any sensitive imaging application [1]. Although sensor calibration considers image formation from the image sensor onwards, our work concentrates on image formation in the bright-field microscope before the image sensor. These methods, therefore, are complementary. Specifically, the linear response of the sensor should be validated before applying our techniques. Typical research microscope image sensors have such a calibrated response. Background removal techniques for microscopy include background subtraction and flat-fielding [2]. These approaches use specimen-free calibration images to remove artifacts due to illumination variations by frame subtraction or division. Our method depends on the presence of a moving specimen and works on existing microscopy data without calibration. Hill et al. use a moving average image for background subtraction to improve the visual clarity of microscopy images [3]. Our methods perform this function and at the same time preserve the image intensity relationships required for photometry. Schechner and Nayar compute the transmission map of a filter with spatiallyvarying transmittance [4]. Their initial estimate is based on image statistics, and we demonstrate in Section 3.2 how to improve this estimate. Their robust estimation relies on image registration with a rigid-body constraint, a condition that is rarely satisfied in biological microscopy. Shizawa’s multiple optical flow estimation technique also relies on rigid-body assuptions [5].
Occlusion Removal in Video Microscopy
127
A significant body of computer vision research focuses on macroscopic occlusion detection and processing [6,7]. In the macroscopic case, occlusion is typically complete. Complete occlusion constitutes a nonlinearity in image formation that is manifested by a complete loss of signal in the captured image. In microscopy, however, occluding objects do not block light completely. In the multiplicative image model, each object absorbs a constant proportion of light, and the signal from the occluded object transmits through the occlusion. Throughout this paper, when we use the term “occlusion” we mean a partial occlusion as encountered in bright-field microscopy, unless otherwise specified.
3
Image Model
In a bright-field microscope, a light source is tuned to uniformly illuminate a specimen across the field of view. A series of lens assemblies (e.g. condenser, objective, and ocular) focus the light transmitted through the specimen onto an image sensor [2]. Light interacting with any object in the optical path is either absorbed, transmitted, or diffracted. Our first-order model ignores diffraction effects; we discuss the consequences of this simplification below. The image formed on the image sensor is therefore I(x) = L(x)T (x) ,
(1)
T (x) = {T (x) : 0 ≤ T (x) ≤ 1, ∀x} ,
(2)
where L(x) is the (ideally uniform) illumination and T (x) defines the total light transmission at each image location x. This constitutes a multiplicative light model for the bright-field microscope. Considered over time, the transmission map, T (x, t), is a composite of transmission coefficients from stationary and moving objects. For a stationary transmission map, Tc (x), and time-varying specimen transmission, Ts (x, t), the image formed on the sensor is I(x, t) = L(x)Tc (x)Ts (x, t) .
(3)
Given the stationary transmission map, a repaired video, without the presence of stationary occlusions, is obtained by Ir (x, t) =
I(x, t) = L(x)Ts (x, t) . Tc (x)
(4)
The image repair problem reduces to finding an accurate estimate of the stationary transmission map, Tc (x); we present two estimation methods in the following sections. Note that Equation 4 amplifies the noise as well as the signal present in the image sequence I(x, t). The transmission map, therefore, provides a natural measure of relative error in the repaired intensity values.
128
B. Eastwood and R.M. Taylor II
As a consequence of ignoring diffraction effects, occlusions on the image sensor are modeled better than occlusions elsewhere in the optical path. Debris on lens elements are often completely out of focus, and their effects are distributed across the image plane. Light diffracted from partially-focused, stationary debris, however, may add intensity to a local image region. Experimental results in Section 5.2 demonstrate that our simplification yields useful information even in certain situations that exhibit diffraction effects. 3.1
Approach 1: A Local Transmission Metric
Within the mean intensity image from all frames in a microscopy video, as seen in Figure 1, the contribution of moving foreground objects is approximately zeromean within a local neighborhood. The predominant spatial variance is due to stationary background objects. This observation leads to a method for estimating the stationary transmission map. We assume that for a sequence of N images the mean intensity, I(x), is locally constant except at the boundaries of stationary background objects. Comparing the mean intensity of an occluded pixel to that of its neighbors determines the amplification necessary to bring the occluded pixel in line with its neighbors. One tested comparison operator is the median of the pixels in a neighborhood Ω, Il (x) = M edian I(xi ); xi ∈ Ω . (5) The per-pixel mean intensity provides a measure of how much light was transmitted to an image location while the neighborhood median provides an estimate of how much light was transmitted to the pixels in the same neighborhood. Careful choice of the neighborhood size, based on the expected size of occlusions, leads to an estimate of the stationary transmission map, Tc (x) =
I(x) . Il (x)
(6)
We refer to this technique as the Local Transmission Metric (LTM) method. 3.2
Approach 2: A Global Transmission Metric
We next describe a framework for image intensity comparisons and use it to formulate an alternative estimate of the stationary transmission map. For a single image formed by a multiplicative light model, as in Equation 1, the ratio of image intensities at two image locations measures relative light transmission. The logarithm transforms this ratio to a linear difference, as noted by Shizawa [5]. The 1D central ratio of intensity, analogous to the discrete central difference, is 1 I(x + Δx) 2Δx R(x) = (7) I(x − Δx) log[I(x + Δx)] − log[I(x − Δx)] log R(x) = (8) 2Δx d log[I(x)] ≈ , dx
Occlusion Removal in Video Microscopy
129
which is the derivative of the logarithm of image intensity. In 2D, the gradient replaces the derivative. We cannot obtain an accurate measure of the local stationary transmission gradient using a single frame from a video. Therefore, we use all available frames to compute a robust estimate, log R(x) = Est [∇ log[I(x, ti )]; i = 1..N ] .
(9)
Here, Est is any estimation of the gradient of the logarithm of intensity over all N frames, for example the mean or median. Other robust estimators may improve this method, depending on the nature of the motion in the video [8]. A successful estimator distinguishes between elements from stationary occlusions and the moving specimen. To compute the logarithm of transmission, we integrate this local estimate over the image plane and convert back to intensity space, log Tc (x) = Est [∇ log[I(x, ti )]; i = 1..N ] dxdy , (10) ∀x
Tc (x) = exp (log Tc (x)) .
(11)
We refer to this core method as the Gradient Logarithm Transmission (GLT) method and its variants as GLT-mean and GLT-median. Note that if the robust estimation, Est, is linear, the estimation and gradient commute, simplifying Equations 10 and 11 to Tc (x) = exp (Est[log I(x, ti ); i = 1..N ]) .
(12)
For the mean estimator, Equation 12 reduces to the geometric mean of all frames. No such simplification is possible for the median estimator.
4
Implementation
We implemented both of the techniques outlined in Section 3 as well as meanimage background subtraction in C++ using the Insight Toolkit [9] (ITK) as a framework. Some implementation details deserve mention. Frankot and Chellappa demonstrate how to enforce integrability of derivative estimates, as in Equation 10 [10]. This method computes the periodic surface with integrable derivatives closest to the derivative estimates. To handle occlusions with nonperiodic transmission maps, we pad our image data with a Hanning window [11]. The integration and subsequent exponentiation in Equation 11 means that we only recover the transmission map up to a scale factor. One choice for this scale factor enforces our assumption that objects only absorb light by setting the maximum transmission to one. Mean image background subtraction involves a shifting of intensity values, and consequently intensity ratios are not preserved. In our background subtraction implementation, we maintain the mean image intensity between input and output image sequences.
130
5
B. Eastwood and R.M. Taylor II
Evaluation
We evaluated the microscopy video repair algorithms using both synthetic and real data. Synthetic data provides a known ground truth by which to quantitatively assess the techniques. Real microscopy data must be qualitatively assessed. 5.1
Simulated Data
Our simulated data is generated from a normally-distributed noise field undergoing rigid transformations. Each test video consists of 180 frames of 256 × 256 pixels. To simulate microscopy image formation, we modulate these image sequences with fixed patterns as seen in Figure 2. We apply each of our video repair algorithms and compute the mean peak signal-to-noise ratio (PSNR) over all frames [12]. Higher PSNR values indicate a more faithful reconstruction of an image, and PSNR values between 30 and 40 dB are typical in the image processing literature [13].
Fig. 2. Sample test data of transforming noise fields. Left to right: no occlusion, small occlusions with varying transmission, large occlusion, step function occlusion.
Table 1 shows the reconstruction results for a subset of test cases. Because PSNR is sensitive to gain, we choose to scale the stationary transmission maps estimated by the GLT methods to obtain the highest PSNR. Using the case of a rotating noise field as an example, the data set with no occlusion indicates the quality of reconstruction obtainable by each method. All methods match performance in the unoccluded case when small occlusions of varying modulation are added to the video. The LTM method performs best in this example, followed closely by the GLT-mean. The presence of a large occlusion exposes the weakness in the LTM method, as large occlusions require large neighborhoods to ensure an accurate estimate of the unobstructed intensity at each image location. A neighborhood with a radius of 100 pixels was required to repair the test case with large occlusion. This took over an hour on a modern desktop computer, an unacceptable penalty in most situations. The GLT methods do not suffer from this scale dependency. The shape and non-periodic nature of the transmission map in the stepfunction modulated test case cannot be recovered by the LTM and non-windowed GLT-median methods. Padding with the Hanning window, as discussed in Section 4 resolves this issue for the GLT-median method.
Occlusion Removal in Video Microscopy
131
Table 1. Mean PSNR (dB) measurements for repairing simulated microscopy data; r is half the width of the square neighborhood used in the LTM method; window refers to padding with a Hanning window. The best result for each data set is indicated in bold face. Data set Transform type Occlusion type Background subtract LTM, r=15 LTM, r=100 GLT-mean GLT-median GLT-median, window
5.2
Noise none 42.12 46.12 — 41.83 39.66 35.19
Rotation small large 38.98 16.04 46.16 13.04 — 41.65 41.83 41.83 39.65 39.66 35.20 35.15
step 12.20 10.37 10.38 41.79 12.10 35.20
none 38.98 43.36 — 39.87 36.62 33.39
Translation small large 18.32 14.51 43.44 13.33 — 38.47 39.88 39.85 36.61 36.62 33.39 33.37
step 11.77 10.66 10.68 39.61 12.70 33.28
Microscopy Data
Figure 3 shows the result of repairing the video of beating cilia seen in Figure 1. The LTM method succeeds in removing most of the occlusion in the upper left of the frame. A halo remains from this diffraction effect as our light model only handles occlusions that absorb light. The GLT methods seem not to suffer from this limitation because they form a complete model of all non-moving components of the video. This diffraction, however, may affect the intensity scaling of the final image; here, we preserve the mean image intensity. Diffraction effects are seen in the moving bead as well, but this foreground component has no effect on our repair methods. Background subtraction and the GLT methods also remove the larger stationary background components from the video, which accentuates the motion of the cilia. The GLT-median is the only method that does not suffer from ghosting of the moving bead in the bottom-center of the images. This example highlights the strength of the GLT-median—when a foreground object covers an image location for fewer than half the frames in a video, its effect does not perturb the background of the repaired video.
Fig. 3. Repair of cilia microscopy video, 170 × 170 pixels, 61 frames. Left to right: original, frame repaired with background subtraction, LTM (r = 10), GLT-mean, GLTmedian.
132
6
B. Eastwood and R.M. Taylor II
Discussion
We have demonstrated two methods for repairing microscopy videos that exhibit advantages over background subtraction. Each method and variation has strengths which lead it to outperform the others in different scenarios. The LTM method performs best on videos with small, bounded occlusions, but handles large occlusions and diffraction effects poorly. The GLT methods work equally well on any size occlusion and at least visually improve some diffraction effects. The GLT-median estimation outperforms the GLT-mean estimation when a moving scene object covers an image region for less than half of the video. Research is ongoing to combine our repair method with optical flow to improve tracking in microscopy and provide more reliable reconstruction under total occlusion. Acknowledgments. This work is supported by NIH NIBIB grant P41EB002025-23A1. The authors thank David Hill for the use of the microscopy images.
References 1. Tsin, Y., Ramesh, V., Kanade, T.: Statistical calibration of CCD imaging process. IEEE ICCV 01, 480 (2001) 2. Inou´e, S., Spring, K.R.: Video Microscopy: The Fundamentals, 2nd edn. Springer, Heidelberg (1997) 3. Hill, D.B., Plaza, M.J., Bonin, K., Holzwarth, G.: Fast vesicle transport in pc12 neurites: velocities and forces. European Biophysics Journal 33(7), 623–632 (2004) 4. Schechner, Y.Y., Nayar, S.K.: Generalized mosaicing: High dynamic range in a wide field of view. IJCV 53(3), 245–267 (2003) 5. Shizawa, M., Mase, K.: Simultaneous multiple optical flow estimation. ICPR 1, 274–278 (1990) 6. Sun, J., Li, Y., Kang, S.B., Shum, H.Y.: Symmetric stereo matching for occlusion handling. IEEE CVPR 2, 399–406 (2005) 7. Jia, J., Wu, T.P., Tai, Y.W., Tang, C.K.: Video repairing: Inference of foreground and background under severe occlusion. IEEE CVPR 01, 364–371 (2004) 8. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1992) 9. Ibanez, L., Schroeder, W., Ng, L., Cates, J.: The ITK Software Guide. Kitware, Inc. 2nd edn. (2005), ISBN 1-930934-15-7 http://www.itk.org/ItkSoftwareGuide. pdf 10. Frankot, R.T., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. IEEE PAMI 10(4), 439–451 (1988) 11. Harris, F.J.: On the use of windows for harmonic analysis with the discrete fourier transform. In: Proceedings of the IEEE. 66(1), 51–83 (1978) 12. Bovik, A.: Handbook of image and video processing, 2nd edn. Academic Press, San Diego (2005) 13. Antonini, M., Barlaud, M., Mathieu, P., Daubechies, I.: Image coding using wavelet transform. IEEE TIP 1, 205–220 (1992)
A Modular Approach for Automating Video Analysis Gayathri Nadarajan1 and Arnaud Renouf2 1
CISA, School of Informatics, University of Edinburgh, Scotland
[email protected] 2 GREYC Laboratory (CNRS UMR 6072), Caen cedex France
[email protected] Abstract. Automating the steps involved in video processing has yet to be tackled with much success by vision developers and knowledge engineers. This is due to the difficulty in formulating vision problems and their solutions in a generalised manner. In this collaborated work, we introduce a modular approach that utilises ontologies to capture the goals, domain description and capabilities for performing video analysis. This modularisation is tested on real-world videos from an ecological source and proves useful in conceptualising and generalising video processing tasks. On a more significant note, this could be used in a framework for automatic video analysis in emerging infrastructures such as the Grid. Keywords: Knowledge-Based Vision, Ontological Engineering, Automatic Video Analysis, Ontology-Based Systems.
1
Introduction
The field of video analysis is becoming more and more important with the fast advancement in vision technologies and the increasing size of real-time data that need to be processed efficiently. Although the majority of vision developers focus on improving low-level techniques and algorithms that perform with extreme accuracy, there is a lack of effort in conceptualising the tasks involved in the process of video analysis itself, although much of image processing involves sequences of repeated sub-tasks. Ontological engineering [1], on the other hand, is concerned with providing formal conceptualisations of entities that are relevant to a particular problem domain so that these representations and the relationships between them are made explicit. Applying higher level knowledge or semantics would allow for better reasoning on the concepts by the sharing and reuse of existing knowledge and also lead to the discovery of new knowledge. To date, the problem of conceptualising image processing solutions has been tackled by the vision community in limited ways [2,3,4,5,6] and less addressed by the knowledge engineering community. However, this poses difficulty to visionbased researchers as they lack the expertise to modularise image processing problems using semantic-based approaches (e.g. ontologies). Thus a collaborated W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 133–140, 2007. c Springer-Verlag Berlin Heidelberg 2007
134
G. Nadarajan and A. Renouf
effort between ontology engineers and vision developers could bridge this current research gap that both fields have yet to tackle with much success. The main challenges lie in the fact that the most recent knowledge representation and reasoning methods (e.g. Semantic Web technologies, such as OWL [7]) lack maturity and the image processing formulation is hard to generalise [8,9]. In this work we tackle the problem of providing a modularised approach for video analysis using an example test case from the EcoGrid project [10]. This effort improves and extends existing work by the authors [9,11] by adding process modelling aspects, richer modularisation of ontologies while making use of extensive video processing terminologies. The long-term effect of this is to provide a semantic-based automatic video analysis framework that is flexible which would prove useful for emerging infrastructures such as the Grid [12].
2
Related Work
Attempts to solve automatically image processing problems were conducted within knowledge-based systems such as LLVE [8], CONNY [13], OCAPI [14], MVP [15] and BORG [16]. However these systems remain limited to a list of restricted and well known goals. Therefore a priori knowledge on the application context (domain-specific concepts such as sensor type, noise, lighting, etc) and on the goal to achieve were implicitly encoded in the knowledge base. This implicit knowledge restricts the range of application domains for these systems and it is one of the reasons for their failure [17]. More recent approaches bring more explicit modelling [2,3,4,5,6] but they are all restricted to the modelling of business objects description for tasks such as detection, segmentation, image retrieval, image annotation or recognition applications. They use ontologies that provide the concepts needed for this description (a visual concept ontology for object recognition in [3,5], a visual descriptor ontology for semantic annotation of images and videos in [18] or image processing primitives in [2]) or they capture the business knowledge through meetings with the specialists (e.g. use of the NIAM/ORM method in [4] to collect and map the business knowledge to the vision knowledge). But they do not completely tackle the problem of the application context description (just briefly in [3,5]) and the effects of this context on the images (environment, lighting, sensor, image format). Moreover they do not define the means to describe the image content when objects are a priori unknown or unusable, for instance in robotics, image retrieval or restoration applications. They also suppose that the objectives are well known (to detect, to extract or to recognise an object with a restricted set of constraints) and therefore they do not address their specification. To overcome these limitations, we aim at designing a modular approach for automating video and image processing. In such an approach, we have to make the formulation of the problem to be solved and the knowledge used by image processing experts during the design of the solution explicit. Vision experts design applications through trial-and-error cycles by implementing new solutions from scratch each time instead of reusing already developed ones. The lack of
A Modular Approach for Automating Video Analysis
135
application formulation and modelling is a reason for this. Image processing experts do not realise a complete and rigorous formulation of the applications. Therefore the reusability of the applications is very poor and modularisation is needed to improve this situation.
3
Modularisation
Our modularisation of the image processing field clearly separates the formulation of the problems and the solutions to be produced. The user, represented by domain experts, will be involved in providing the problem descriptions in high level terms while the system (utilising a Planner) will provide the solutions in low level vision terms. Domain knowledge has to be acquired from the domain expert and formalised in order to be able to propose a way of processing the images taken under consideration. Our first work [9] studied the formulation of image processing applications in order to propose an ontology that is used to express the objective of the domain expert (the goal part of the ontology) and define the image class to be processed (the input images and their variability description using the domain part of the ontology). It is used in the Hermes Project [19] which proposes a human-machine interface dedicated to domain experts inexperienced in the image processing field. Using this interface, they are able to formulate their goals and the description of their images using their domain knowledge. The goal, coupled with the domain description, will then be used to derive a set of image processing tasks that will solve the user request. Each task could be further decomposed into sub-tasks. We associate an image processing tool as possessing the capability to achieve a sub-task. In our current approach, we have opted to incorporate three ontologies to separate the goals from the capabilities and to provide meaning for the process within a semantically integrated system. Each ontology holds a vocabulary of classes of things that it represents and the relationships between them. The use of ontologies is beneficial because they provide a formal and explicit means to represent concepts, relationships and properties in a domain. They play an important role in fulfilling semantic interoperability, as highlighted in Section 1. A system with full ontological integration has several advantages. It allows for cross-checking between ontologies, addition of new concepts into the system and discovery of new knowledge within the system. A framework incorporating the ontologies with a Planning mechanism is proposed in [11]. 3.1
Goal Ontology
The goal ontology contains the high level goals and constraints that the user will communicate to the system. These are represented by the concepts Goal, Constraint Category, Constraint Descriptor and Constraint Qualifier in Fig. 1. The concept Task/Process links the goal to the processes that are associated with it. The instances of processes are contained within a process
136
G. Nadarajan and A. Renouf
Fig. 1. Goal Ontology
library [11] and will be selected based on task decomposition and performance criteria. These tasks will also be linked with the capability ontology (Section 3.3). 3.2
Domain Ontology
The domain ontology (Fig. 2) describes the concepts and relationships of the application area, such as the lighting conditions, colour information, position, orientation as well as spatial and temporal aspects. The user will input the domain description along with the goal of the problem. The system will use this ontology to build the user request based on the domain description before feeding it into the Planner which will be responsible for the solution generation.
Fig. 2. Domain Ontology
A Modular Approach for Automating Video Analysis
3.3
137
Capability Ontology
The capability ontology (Fig. 3) contains the classes of video and image processing techniques. Each technique (or capability) is associated with one or more tool. A tool is a software component that can perform a video or image processing task independently, or a technique within an integrated vision library that may be invoked with given parameters. This ontology will be used directly by the Planner in order to identify the tools that will be used to solve the problem.
Fig. 3. Capability Ontology
4
Test Case
The ontologies formulated above have been applied to a real-world scenario in the EcoGrid [10] that utilises state-of-the-art technologies to establish a cyberinfrastructure for ecological research. This includes the integration of geographically distributed sensors, computing power and storage resources into a uniform and secure platform. Scientists can conduct data acquisition, data analysis and data sharing on this platform. One of the challenges is that a vast amount of raw data of varying qualities needs to be analysed efficiently and effectively (Fig. 4). Manual processing by ecologists, however, would be too time-consuming as a minute’s video clip will take 15 minutes’ analysis time. This would not be feasible
Fig. 4. Sample images extracted from EcoGrid video clips. From left to right: cluttered, partial object present, whole object present and non-presence of activity.
138
G. Nadarajan and A. Renouf
for data that amount to 1.86 Terabytes per year. Thus an automatic mechanism for processing these videos must be done in the most efficient manner. A full set of requirements for providing such a mechanism has been outlined in [11]. This includes an adaptive, flexible and generic architecture to allow for various video processing tasks to be conducted in an intelligent and optimal manner. Walkthrough. Based on the devised ontologies, a walkthrough on how they are used to provide different levels of vocabulary for the users, vision tools and processes in a seamless and related manner is outlined here. The user, who is an ecologist, may have a high level goal such as “Detect fishes in the videos” in mind. This is represented and selected via the following selection-value pairs: (Goal: Detection) [Detail Level: Occurrence = all occurrences] [Performance Criteria: Processing Time = real time] As a first step, the user interacts with the system by providing input values for the goal and constraints. The goal can be one of the goals specified in the goal ontology (Fig. 1). While the constraints are additional parameters to specify rules or restrictions that apply to the goal. These include qualifiers e.g. Acceptable Errors, Optimisation Criteria, Detail Level and Quality Criteria contained within the goal ontology. Then the user describes the images to be processed. This description is given using the system which proposes descriptors contained in the domain ontology (Fig. 2). In our scenario, a description on the acquisition context and on the semantic content of the images is obtained; the lighting conditions changes very slowly, the camera is fixed (and also the background), images are degraded by a blocking effect due to the compression, and the fishes are regions whose colours are different from the background and are bigger than a minimal area (because when they are too small they are unusable). Thus, this goal is interpreted as “The detection of all the occurrences of non-background regions on a fixed background.” As soon as the formulation of the user’s problem is made, a sequence of processes for execution using task decomposition will be sought by a Planning mechanism. Fig. 5 contains a breakdown of high-level tasks for the goal detection.
Fig. 5. High-Level Sequence of Process for Goal “Detect Fishes in the Videos”
Each task within this sequence could be further decomposed into sub-tasks. For instance, the ’Segmentation’ task involves ’Background Subtraction’ (under Task/Process in the goal ontology), which is done in three steps; background model construction, current frame and background model differencing, and background model update. All these sub-tasks are point arithmetic operations in the capability ontology. When a sub-task can no longer be decomposed, the tool or technique identified for performing it can be applied directly on the video clip.
A Modular Approach for Automating Video Analysis
139
The tool or technique available within the system is represented in the capability ontology. In the case where more than one tool is available to perform the sub-task, the tool with the best performing capability is selected. In our case, the background model construction is done by averaging a series of successive images without any fish. Next, the difference between the current image and the background model is obtained. The model is updated using this current image to avoid future false detections due to the small changes of lighting conditions. False detections elimination (sub-task of the classification step) is achieved using a threshold on the minimal area given by the user. Fig. 6 shows two sample results obtained using the solution proposed by the modularised approach. As the detection is accurate, the same principles could be extended to achieve more complicated video processing tasks (such as motion analysis) as long as the ontologies capture the goals, domain descriptions and capabilities.
Fig. 6. Two Sample Results. From Left to Right for Each Row: Original Image, Background Model, Classified Result.
5
Conclusion
The formulation of the video analysis problem description and solution could be tackled by modularising them using separate but inter-related ontologies. The three ontologies presented (goal, domain and capability) are extensive and describe the different aspects of video analysis in meaningful ways. This would prove to be useful as it facilitates a means to formalise the video analysis process and promotes reusability of applications. We have demonstrated the application of this modularised approach using an example from an ecological domain. The ontologies are high-level and general enough to be tailored towards building application ontologies by vision experts for solving tasks more specific to their problem domains. On a broader context, this effort is a significant step towards providing a semantically-enhanced framework in emerging infrastructures such as the Grid which would allow for real-time distributed image processing.
140
G. Nadarajan and A. Renouf
References 1. Gomez-Perez, A., Fernandez-Lopez, M., Corcho, O.: Ontological Engineering: With Examples from the Areas of Knowledge Management, E-Commerce and the Semantic Web, (1st edn.) (2004) 2. Nouvel, A., Dalle, P.: An Interactive Approach For Image Ontology Definition. In: 13`eme Congr`es de Reconnaissance des Formes et Intelligence Artificielle, Angers, France, pp. 1023–1031 (2002) 3. Maillot, N., Thonnat, M., Boucher, A.: Towards Ontology Based Cognitive Vision (Long Version). Machine Vision and Applications 16(1), 33–40 (2004) 4. Bombardier, V., Lhoste, P., Mazaud, C.: Mod´elisation et int´egration de connaissances m´etier pour l’identification de d´efauts par r`egles linguistiques floues. Traitement du Signal 21(3), 227–247 (2004) 5. Hudelot, C.: Towards a Cognitive Vision Platform for Semantic Image Interpretation. Application to the Recognition of Biological Organisms. PhD thesis, NiceSophia Antipolis University (2005) 6. Town, C.: Ontological Inference for Image and Video Analysis. Mach. Vision Appl. 17(2), 94–115 (2006) 7. McGuinness, D., van Harmelen, F.: OWL Web Ontology Language. World Wide Web Consortium (W3C) (2004), http://www.w3.org/TR/owl-features/ 8. Matsuyama, T.: Expert Systems for Image Processing: Knowledge-Based Composition of Image Analysis Processes. CVGIP 48(1), 22–49 (1989) 9. Renouf, A., Clouard, R., Revenu, M.: How to Formulate Image Processing Applications? In: Proceedings of the International Conference on Computer Vision Systems, Bielefeld, Germany (2007) 10. EcoGrid National Center for High Performance Computing, Taiwan: http:// ecogrid.nchc.org.tw/ 11. Nadarajan, G., Chen-Burger, Y.H., Malone, J.: Semantic-Based Workflow Composition for Video Processing in the Grid. In: IEEE/WIC/ACM International Conference on Web Intelligence, pp. 161–165 (2006) 12. Foster, I.: The Grid 2 – Blueprint for a New Computing Infrastructure, 2nd edn. Morgan Kaufmann, San Francisco (2004) 13. Liedtke, C., Bl¨ omer, A.: Architecture of the Knowledge Based Configuration System for Image Analysis ”Conny”. In: ICPR’92, pp. 375–378 (1992) 14. Cl´ement, V., Thonnat, M.: A Knowledge-Based Approach to Integration of Image Procedures Processing. CVGIP: Image Understanding 57(2), 166–184 (1993) 15. Chien, S., Mortensen, H.: Automating Image Processing for Scientific Data Analysis of a large Image Database. IEEE PAMI 18(8), 854–859 (1996) 16. Clouard, R., Elmoataz, A., Porquet, C., Revenu, M.: Borg: A KnowledgeBased System for Automatic Generation of Image Processing Programs. IEEE PAMI 21(2), 128–144 (1999) 17. Draper, B., Hanson, A., Riseman, E.: Knowledge-directed vision : Control, learning, and integration. Proc. of IEEE 84, 1625–1681 (1996) 18. Bloehdorn, S., Petridis, K., Saathoff, C., Simou, N., Tzouvaras, V., Avrithis, Y., Handschuh, S., Kompatsiaris, Y., Staab, S., Strintzis, M.G.: Semantic Annotation of Images and Videos for Multimedia Analysis. In: G´ omez-P´erez, A., Euzenat, J. (eds.) ESWC 2005. LNCS, vol. 3532, pp. 592–607. Springer, Heidelberg (2005) 19. Renouf, A.: (Herm`es - a human-machine interface for the formulation of image processing applications) http://www.greyc.ensicaen.fr/∼ arenouf/Hermes
Rectified Reconstruction from Stereo Pairs and Robot Mapping Antonio Javier Gallego, Rafael Molina, Patricia Compa˜ n, and Carlos Villagr´ a Grupo de Inform´ atica Industrial e Inteligencia Artificial Universidad de Alicante, Ap.99, E-03080, Alicante, Spain {ajgallego, rmolina, company, villagra}@dccia.ua.es
Abstract. The reconstruction and mapping of real scenes is a crucial element in several fields such as robot navigation. Stereo vision can be a powerful solution. However the perspective effect arises, as well as other problems, when the reconstruction is tackled using depth maps obtained from stereo images. A new approach is proposed to avoid the perspective effect, based on a geometrical rectification using the vanishing point of the image. It also uses sub-pixel precision to solve the lack of information for distant objects. Finally, the method is applied to map a whole scene, introducing a cubic filter.
1
Introduction
Nowadays, a central aspect in artificial intelligence research is the perception of the environment by artificial systems. It has been considered as one of the most important problems for effective autonomous robotic navigation and reconstruction [1]. Most research on robot mapping makes use of powerful computers and expensive sensors, such as scanning laser rangefinders. Nevertheless, other sensors, such as stereoscopic sensors, can be used. Stereo cameras are cheap and provide information of both range and appearance. Several authors use stereo vision and disparity images to solve the 3D reconstruction or mapping problems. For instance, a first solution to three-dimensional reconstruction with stereo technology explores the possibility of composing several 3D views from the camera transforms, to build the so-called ”3D evidence grid” [2]. There are other approaches which infer 3D grids from stereo vision, due to the fact that appearance information is not provided by range finders. Hence, they add an additional camera to their mobile robots [3,4]. Moreover, a module of 3D recognition could be added to identify some objects. This technique is not exclusive of robotics, but it could be used in other applications such as automatic machine guidance or also for detection and estimation of vehicle movement [5]. Any image taken by a camera is deformed by the conical perspective effect, so direct reconstruction generates scenes with unreal aspect. There are very few works which focus on creating a good reconstruction and on obtaining a real appearance of the scene. However, some interesting works can be found [6,7], W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 141–148, 2007. c Springer-Verlag Berlin Heidelberg 2007
142
A.J. Gallego et al.
but none of them makes any type of perspective rectification. Specific objects are reconstructed instead of the whole scene, so the real structure of the environment is not recovered. Some na¨ıve perspective rectifications have already been used in other fields, to rectify roads and to obtain their real appearance [8]. This work is centred in the reconstruction of the structure of the scene showing its real aspect, using the information provided by the stereo images and the disparity maps (in fact depth maps, their duals, are used). Our proposal does not make assumptions about the scene nor the object structure, it does not segment objects trying to identify known shapes, only a stereo pair is needed and it is nor correspondence dependent. Perspective rectification method allows to eliminate the conical perspective of the scene and to remove the camera orientation. This way the algorithm recovers the structure of the scene and some crucial information such us object geometry, volume and depth.
2
Process Scheme
The reconstruction and mapping methods follow the process scheme shown in figure 1. In the upper part of the figure, the reconstruction method is presented (section 3). Given a pair of stereo images (LIk and RIk , where k is an index indicating the current image within a sequence) the depth map and the vanishing point position are calculated. A depth map Dk and the vanishing point position V Pk are obtained as a result. Then, a rectification process is applied in order to remove the effect of the conical perspective. The result of this stage is a rectified 3D matrix of voxels Rk representing the reconstructed scene. In the lower part of the figure, the mapping method is represented (section 4). Every reconstruction Rk is accumulated to finally obtain a whole mapping representation of the environment M . The camera position is needed to carry out this final step.
Fig. 1. Structure of both 3D reconstruction and mapping
3 3.1
3D Reconstruction Method Perspective Rectification Using a Vanishing Point
Let LI and RI (the k subindex is removed for simplicity) be a pair of stereo images (left and right image) of m × n pixels, and D a dense depth map obtained by any method (in this approach, the depth image is computed using
Rectified Reconstruction from Stereo Pairs and Robot Mapping
143
multi-resolution and energy function [9]). Each value D(x, y) of the depth map contains the depth associated to the pixel (x, y) of the reference image (left image) [10,11,12,13]. The reconstruction of the scene can be done by labelling a 3D matrix of voxels R representing the spatial occupation. However, due to the conical perspective, it would produce an unwanted effect in the reconstruction (figure 2(a) and 2(b)). So, to correctly perform the 3D reconstruction the perspective must be rectified making a correction to the pixel’s coordinates. This way the obtained result will show the same aspect as the real scene (figure 2(c)).
(a)
(b)
(c)
Fig. 2. Rectification scheme: (a) shows the effect of the conical perspective, (b) shows a non-rectified scene seen from above (with only x and z coordinates shown), and (c) shows the result which is desired after rectification (with parallel walls). This image also illustrates the compass angle of the camera in the case of a lateral view.
Figure 2(b) shows the scheme of the process, in which the point Q (current pixel being processed obtained from the input depth map) with coordinates (x, y, D(x, y)) is rectified to obtain Q . The position of the scene vanishing point V P is calculated using the method proposed in [14]. It uses a Bayesian model which combines knowledge of the 3D geometry of world with statistical knowledge of edges in images. The method returns an angle (called as Ψ ) which defines the orientation of the camera. This angle is transformed to Cartesian coordinates to obtain the position (x, y) of scene vanishing point V P . For the depth of point V P the maximum depth value (Dmax ) of the whole depth map is used. Once the V P is obtained, a line L is traced through V P and Q. Next, the intersection of the line L with the x-y plane is calculated, obtaining in this way the point P . Starting from P , the line L is rotated to be perpendicular to the x-y plane. After this process, the new point Q = (x , y , z ) is calculated as follows: ⎧ ⎪ ⎨ xQ = xV P + zV P yQ = yV P + zV P ⎪ ⎩ zQ = zQ
xV P −xQ zQ −zV P yV P −yQ zQ −zV P
(1)
144
3.2
A.J. Gallego et al.
3D Reconstruction Using Sub-pixel Precision
Once the depth map (D) and the vanishing point (V P ) are calculated (they can be obtained in parallel), the reconstruction process is performed, including the perspective rectification (section 3.1). The reconstruction method returns a 3D matrix (R) containing the space occupation of the final result. R is initialized to zero and, then, it is filled as follows: R(x , y , D(x, y)) = 1 where (x , y ) are the rectified coordinates of (x, y), ∀x, y ∈ R/{0 ≤ x ≤ m − 1, 0 ≤ y ≤ n − 1}. The final step is obtaining the real units. This is a direct calculation if the camera parameters are known. The most important drawback is the fact that when the perspective rectification corrects the pixels’ coordinates, the voxels are separated in the 3D representation (figure 3(b)). This is due to the discreteness of depth maps. In fact, pixels corresponding to a distant object are split, leaving a hole whose dimensions increase as the distance to the object increases. To minimize these problems a sub-pixel precision technique is proposed to calculate the position of n fictitious pixels between two consecutive pixels. The precision used for the reconstruction is calculated using the equation 1 − (z/Dmax ), which returns the minimum value when the pixel is in the foreground and the maximum one when it is in the background. All the steps of the sub-pixel reconstruction method can be summarized as follows: 1. D := CalculateDepthM ap(LI, RI) 2. V P := CalculateV anishingP oint(LI) 3. while ( x ≤ m − 1 ) (a) while ( y ≤ n − 1 ) i. (x , y ) := Rectif y(x, y, V P ) ii. R(x , y , D(x, y)) = 1 iii. y = y + 1 − (D(x, y)/Dmax ) (b) x = x + 1 − (D(x, y)/Dmax ) 4. Display(ObtainRealU nits(R))
(a)
(b)
(c)
Fig. 3. Example of reconstruction using the depth map of figure (a). Figure (b) shows the reconstruction without using sub-pixel precision, in which the voxels are separated due to the perspective rectification. In figure (c) the sub-pixel precision has been used. The result has a more realistic appearance because holes have been filled.
Rectified Reconstruction from Stereo Pairs and Robot Mapping
4
145
Mapping Algorithm
The mapping algorithm is an application of the reconstruction method. It demonstrates the utility of the perspective rectification and the advantages of its application to this kind of problem. A significant example is shown in figure 4, where (a) shows two consecutive images of a corridor (there is also a column) and the result of their intersection. Since the scene is not rectified, the intersection of the columns and walls are not coincident. The rectification of the images in (a) is presented in (b). In this case, the intersection is perfectly coincident, so the mapping algorithm is significantly improved.
(a)
(b)
Fig. 4. Mapping comparison with and without rectification
In order to do the 3D mapping of the scene, N stereo pairs (LI0 , RI0 ), (LI1 , RI1 ), ..., (LIN −1 , RIN −1 ) of the environment are taken. Each of these images is captured at a fixed distance. Once a stereo pair (LIk , RIk ) is obtained, its corresponding depth map Dk is calculated and added to the Σ list which stores all the depth maps. Next, the algorithm of perspective rectification is used in order to compute the rectified matrix Rk of each depth map. For each matrix Rk its intersection with the previous matrix is calculated (Rk−1 ∩ Rk ), and its result is added to the main matrix Mmap which represents the mapping of the scene. In this approach the position of the frames is obtained from robot odometry. The system only needs the relative position of the next frame to do the reconstruction from the sequence of images. In order to reduce the effect of possible odometry errors the algorithm uses a cubic filter. This filter F (explained below) is applied to the whole matrix Mmap , which discretizes the three-dimensional matrix and transforms it into a grid of rectangular cubes. Lastly, the result (Mmap ) is represented according to the space occupation of this matrix and calculating its equivalence in real units (metres). Cubic filter F applies the equation g(x, y, z) := Σ(i,j,k)∈S f (i, j, k) to each cube of the matrix, where S represents the set of point coordinates which are located in the neighbourhood of g(x, y, z), including the point in question. In this way the space occupation of each cube is in the centre, and each cell contains the set of readings of that portion of the space. The use of these cells instead of a unique sample let the system avoid possible odometry errors. The number of readings is referred to as “votes”, and represents the probability of space occupation.
146
5
A.J. Gallego et al.
Experimentation and Results
In this section the experiment results are shown. Figure 5 shows a reconstruction comparison using a synthetic disparity map (a) which simulates a corridor. This example clearly shows the effect of the perspective rectification. Figure (b) shows the segmentation used to calculate the vanishing point, which is estimated to be -4o . In figures (c), (d) and (e) the rectification effect is compared: (c) shows a non-rectified reconstruction (seen from above), and (d) and (e) show a top and an oblique view of the correct result after the rectification. As it can be seen, the walls are perfectly rectified, becoming parallel as expected.
(a)
(c)
(b)
(d)
(e)
Fig. 5. Perspective rectification comparison using a corridor depth map
To do the mapping experimentation two sequences of 30 images obtained from two different corridors have been used. All the images were taken using the Pioneer P3-AT robot and the Bumblebee HICOL-60 stereo camera (figure 6(a)). Figure 6(b) shows four images of one of these sequences as well as their depth maps. The main objective is that the walls, floor and roof appear without slope in the reconstruction, and that there should not be any obstacle (noise) in the corridor. It is also important that the columns (represented by circles in the plan (c)) and the coffee machine (represented by a rectangle in the plan (d)) are detected correctly. Figures (c) and (d) show the mapping results of the two sequences and their respective corridor maps. For these examples a cubic filter size of 3 × 3 × 3 and a number of votes of 5 have been used. These results show a good definition of the corridors because the walls are limited and the in-between area can be seen. Moreover, the columns and the coffee machine can be distinguished on the right hand side of each one of the results. To conduct the experiments, a Pentium IV 3,20GHz with 2GB of RAM and a 512MB graphic card have been used. The reconstruction of the maps have been made using a 320 × 240 × 256 voxels matrix and depth maps with a size of 320×240 pixels. Moreover, it is important to note that only non-null pixels (finite
Rectified Reconstruction from Stereo Pairs and Robot Mapping
(c)
147
(d)
Fig. 6. (a) Pioneer Robot P3-AT with stereo camera. (b) Sequence of images for the mapping. (c,d) Mapping results of the corridors.
depth) in the depth map are processed. The computational cost linearly depends on the size of the input images and on the precision of the reconstruction. So the algorithm obtains a good performance: To process just one sequence of 30 images (each image has a level of 70% of processed data) the algorithm takes approximately 9 seconds. This time depends on the precision of the final 3D reconstruction. So, the process time of an individual reconstruction is less than 0.3 seconds.
6
Conclusions
A new method to reconstruct 3D scenes from stereo images has been presented, as well as an algorithm for environment mapping. These methods use geometrical rectification to eliminate the effect of conical perspective. Due to the fact that the vanishing point position is calculated, rectification can remove the camera orientation and obtain a front view of the scene. It is important to notice that the final quality of the reconstructed image depends on the quality of the depth map. In future experiments, better depth images will probably improve the final result. Moreover, other reconstruction algorithms with different geometric primitives will be implemented to be able to compare the results. The results show how this process corrects the perspective effect and how it helps to improve the matching in the mapping algorithm. An advantage of this method is that it is nor correspondence dependent. In addition, it could probably be used for real-time applications due to the low computational burden
148
A.J. Gallego et al.
and to the good performance. The cubic filter is very useful to solve odometry problems in the mapping. This is a statistical approach used to compute the matrix of the occupancy evidence. The mapping algorithm will be improved in future experiments in order to consider moving objects, robot drift and other kind of problems. The main interest of the method is that some crucial information from the scene, such as object geometry, volume and depth, is retrieved. For these reasons the proposed methods are useful in a wide range of applications, such as Augmented Reality (AR) and autonomous robot navigation. As future work we want to use the obtained results in this type of applications. Acknowledgments. This work has been done with the support of the Spanish Generalitat Valenciana, Project GV06/158.
References 1. P´erez, J., Castellanos, J., Montiel, J., Neira, J., Tard´ os, J.: Continuous mobile robot localization: Vision vs. laser. ICRA (1999) 2. Moravec, H.P.: Robot spatial perception by stereoscopic vision and 3D evidence grids. The Robotics Institute Carnegie Mellon University. Pittsburgh, PA (1996) 3. Stephen Se, D., Lowe, J.: Little: Vision-based mobile robot localization and mapping using scale-invariant features. ICRA (2001) 4. Martin, C., Thrun, S.: Real-time acquisition of compact volumetric maps with mobile robots. ICRA (2002) 5. Martinsanz, G.P., de la Cruz Garc´ıa, J.M.: Visi´ on por computador: im´ agenes digitales y aplicaciones. Ra-Ma, D.L.( ed.) Madrid (2001) 6. Vogiatzis, G., Torr, P.H.S., Cipolla, R.: Multi-view stereo via Volumetric Graphcuts. CVPR, pp. 391–398 (2005) 7. Sinha, S., Pollefeys, M.: Multi-view Reconstruction using Photo-consistency and Exact Silhouette Constraints: A Maximum-Flow Formulation. ICCV (2005) 8. Broggi, A.: Robust Real-Time Lane and Road Detection in Critical Shadow Conditions. In: Proceedings IEEE International Symposium on Computer Vision, Coral Gables, Florida. IEEE Computer Society, Los Alamitos, CA, USA (1995) 9. Compa˜ n, P., Satorre, R., Rizo, R.: Disparity estimation in stereoscopic vision by simulated annealing. In: Artificial Intelligence research and development, pp. 160– 167. IOS Press, Amsterdam, Trento, Italy (2003) 10. Trucco, E., Verri, A.: Introductory techniques for 3-D Computer Vision. PrenticeHall, Englewood Cliffs (1998) 11. Cox, I., Ignoran, S., Rao, S.: A maximum likelihood stereo algorithm. Computer Vision and Image Understanding, 63 (1996) 12. Faugeras, O.: Three-dimensional computer vision: a geometric viewpoint. The MIT Press, Cambridge, Massachusetts (1993) 13. Gallego S´ anchez, A.J., Molina Carmona, R., Villagr´ a Arnedo, C.: ThreeDimensional Mapping from Stereo Images with Geometrical Rectification. In: Perales, F.J., Fisher, R.B. (eds.) AMDO 2006. LNCS, vol. 4069, pp. 213–222. Springer, Heidelberg (2006) 14. Coughlan, J., Yuille, A.L.: Manhattan World: Orientation and Outlier Detection by Bayesian Inference. Neural Computation. 15(5), 1063–1088 (2003)
Estimation Track–Before–Detect Motion Capture Systems State Space Spatial Component Przemyslaw Mazurek Szczecin University of Technology, Chair of Signal Processing and Multimedia Engineering, 26. Kwietnia 10, 71126 Szczecin, Poland {przemyslaw.mazurek}@ps.pl http://www.media.ps.pl
Abstract. In the paper spatial component estimation for Track–Before– Detect (TBD) based motion capture systems is presented. Using Likelihood Ratio TBD algorithm it is possible to track markers at low Signal–to-Noise Ratio level that is a typical case in motion capture system. Three kinds of TBD systems are analyzed and compared: full frame processing, single camera optimized and multiple cameras optimized. In the article separate TBD processing for every camera is assumed. Keywords: Estimation, Track–Before–Detect, Motion Capture.
1
Introduction
Motion capture systems are very important in 3D motion acquisition and are used in computer graphic animation (movies, games), biomechanic (sport, medical), ergonomy and many other applications. There are sort of motion capture methods, among them marker based optical tracking systems are the most popular ones due to their high reliability. Marker is a small infrared or visible light reflective ball or emitting device like LED. Using a set of internally and externally calibrated cameras, markers can be detected, tracked, and after triangulation 3D space positions of all markers can be obtained. Due to acquisition constraints (camera properties, marker properties and the geometry of scene) markers usually are quite large (e.g. table tennis balls) and small number of them is used. Increasing the number of small markers is important for improving motion data in application. Small markers are hard to detect and there are some human related constraints - it is impossible to use high energy LED’s due to safety of actor’s eye. Solving this problem is hard when using classical tracking systems based on acquisition, detection, tracking and assignment processes performed separately. Small energy obtained from markers limits the possibilities of using threshold detection based techniques. Using another tracking scheme (Track–Before–Detect) based on tracking at first and then the detection it is possible to track markers even if Signal–to–Noise Ratio (SNR) is near or smaller than one. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 149–156, 2007. c Springer-Verlag Berlin Heidelberg 2007
150
P. Mazurek
The most popular TBD algorithms are Spatio–Temporal Filters [1], Likelihood Ratio TBD [2] and Particle Filters [4,3]. Current author’s work [6] shows some methods for using Likelihood Ratio TBD for motion capture system with camera state space tracking.
2
Geometry of Circularly Motion Capture System
The typical case of motion capture cameras is the circular placement of cameras. In the center of the scene a performance area is located which can be used by actor and there is a maximal marker’s visibility obtained. Camera internal and external parameters define how large is this area and define 3D spatial resolution. Additionally room’s minimal size can be calculated.
Rc HFOV
R eff
h 2
VFOV 2
O O
O
Reff SIDE Rc
Fig. 1. Geometry of scene (top and side views). Performance area is shaded. C – center of scene; Ref f – effective circle radius that guarantees maximal visibility of markers by all cameras; Rc – cameras’ circle radius; h – height of performance area.
Camera’s basic internal parameters can be defined by Horizontal Field of View (HF OV ), Vertical Field of View (V F OV ), horizontal and vertical pixel resolution. Using wide FOV cameras Ref f is interesting but spatial resolution is reduced [5]. Preferably cameras should be placed at large distance and narrow FOV should be used for optimal spatial resolution (almost constant in performance area). This is one of the reasons why large rooms are used. Alternatively the number of cameras can be increased and it is allowed that marker can be visible by less number of cameras. Two effective radiuses in the system can be found but only the minimal one is correct and describes the reduction of effective area:
Estimation TBD Motion Capture Systems State Space Spatial Component
Ref f T OP = Rc sin Ref f SIDE = Rc −
HF OV 2
2 tan
151
(1)
h V F OV
(2)
2
Ref f = min{Ref f T OP , Ref f SIDE }
(3)
Rc HFOV
R eff
h 2
VFOV 2 O O
O
Reff Reff SIDE Rc
Fig. 2. Two cases of performance area’s reduction (top and side views)
It should be noticed that performance area’s height h is not equal to the actor’s height because if actor’s hands are up or actor makes a jump some markers will be located higher than actor’s height. Increasing the number of cameras the performance area changes towards a cylinder. Exactly it is a cylinder with additional cone over the cylinder if no height of room limitations occur, but for simplification only the cylinder will be assumed (cone will be omitted).
3
Track–Before–Detect Algorithm for Motion Capture Systems
Detection in classical tracking systems is simple if SNR is high enough and most of computations are related to tracking, assignment and triangulation only. In TBD systems algorithm tries to track markers first in the state space without knowledge if a marker is located in this state space cell or not. This process is computationally demanding. The Likelihood Ratio TBD algorithm, detailed described in the book [2], uses recurrent method for reduction of computations and every step has a constant cost. Pseudoalgorithm can be written in the following form:
152
P. Mazurek
Start // Initial Likelihood Ratio: Λ(t0 , s) =
p(k = 0, s) p(k = 0, φ)
f or s ∈ S
(4)
For k ≥ 1 and s ∈ S // Motion Update: −
qk (s|sk−1 )Λ(k − 1, sk−1 )dsk−1
Λ (k, s) = qk (s|φ) +
(5)
S
// Information Update: Λ(k, s) = L (yk |s)Λ− (k, s)
(6)
EndFor Stop The most significant cost of algorithm is the integral of motion update. Markov matrix describes motion vectors and this matrix is quite large if assumed dynamic of marker is complex. This is a serious limitation for contemporary available computers because measurement likelihood is a camera frame size matrix. Assuming 1 megapixel camera and 50 fps rate and only 100 motion vectors there are 5 giga basic MAC (Multiply and Accumulate) operations - this is similar to the limits of currently available PC computers (there are no assignment, deghosting, information update and acquisition costs). Multiplying this result by the number of cameras (e.g. 16) roughly estimated cost can be about 30–100 computers for real–time motion capture systems. Additionally system maintenance cost also increases. Proposed cost reduction solution [6] will be briefly described in this paper. Estimation of cost reduction for two systems in comparison to full frame processing will be derived in this paper. 3.1
TBD System Without Optimization (Full Frame Processing)
This is the most trivial system because whole state space is processed but it is important to show how other knowledge based methods can improve the overall system. The general cost formula if motion vectors do not depend on spatial position is: Ctot = f (CJ , CV ) ≈ CJ CV
(7)
where CJ is spatial related cost and CV is velocity related cost for fixed grid state space. If Xr and Yr define camera pixel resolution then for N number of cameras the cost is: Ctot ∼ N Xr Yr CV
(8)
Estimation TBD Motion Capture Systems State Space Spatial Component
153
Decrease of the number of cameras and resolution is not desired. Optimization of CV is an interesting research area but it is outside of this paper’s scope. Using only the knowledge about importance of frame area can also reduce cost. 3.2
TBD System with Separate Optimization
This system uses additional knowledge related to spatial part of TBD state space. The idea of this algorithm, examples and requirements are considered in the author’s paper [6] but will be sketched briefly here also. In typical motion capture systems everything is black - black floor, walls, actors’ clothes (black in visible light or infrared). Bright elements are only markers. Using background estimation actor can be extracted from acquired image if there is a difference between background and actor. Only pixels occupied by actor and related state space importance areas should be considered for tracking by TBD algorithm. The worst case will be analyzed. An actor can be modeled by cylinder or ellipsoid. Independently actor’s orientation side view will consist same amount of pixels. The simplest case is cylinder for modeling actor volume. Depending on the pose the number of pixels changes but there is the upper limit that can be estimated by measurements and described by actor’s height and diameter. If actor has hands up the vertical size increases but the horizontal one decreases. If actor is in e.g. da–Vinci pose the vertical and horizontal sizes increase but the field (number of pixels) is always limited. Due to actor’s mask processing and noises the cost will be a little overestimated. The number of pixels in camera is: Pr = Xr Yr AV F OV AHF OV Xa = Xr V F OV HF OV The number of actor’s pixels is: Ya = Yr
Pa = Xa Ya = Xr Yr where approximately:
AV F OV AHF OV V F OV HF OV
(9) (10)
(11)
ha da AV F OV = 2arc tan HV F OV = 2arc tan 2 (Rc − Ref f ) 2 (Rc − Ref f ) (12) Independently on orientation pose (upper boundary) the number of pixels is AV F OV AHF OV (13) V F OV HF OV Maximal angular size of actor AV F OVmax and AHF OVmax is achieved if actor to camera distance is Rc − Ref f . Pa = Xr Yr
Ct ot ∼ N Xr Yr
AV F OVmax AHF OVmax CV V F OV HF OV
(14)
154
P. Mazurek
For example if actor occupies maximally 20% of scene TBD cost reduces to 20% of maximal cost. This relation is linear. For the worst case it should be assumed that this maximal occupation is fixed for all cameras independently on actor’s position. Of course if actor is angularly smaller due to larger distance to camera then less computation are required by system but the amount of computation power is fixed due to the worst case. 3.3
TBD System with Multiple Camera Optimization
If the position of actor on stage is available it is possible to reduce the amount of computations. Considering two oppositely placed cameras at 2Rc distance and actor moving from one to another, cumulative size of actor (acquired from two cameras) changes but is lower than in previous method. The field of view occupied by actor is position dependent so instead of fixed coefficients AV F OVmax and AHF OVmax actor’s position (xa , ya ) dependent functions should be used. Ct ot ∼ Xr Yr CV
N AV F OVi (xa , ya ) AHF OVi (xa , ya ) i=1
V F OV
HF OV
(15)
If Rc is the largest distance and V F OV and HF OV are not extremely large, actor moving from one camera towards the opposite one do not change significantly the angular size on view from the camera placed perpendicularly to actor’s motion. The largest values AV F OVi (xa , ya ) and AV HF OVi (xa , ya ) are located on Rc circle.
Reff A O
Rc
C
Fig. 3. Top view of performance area. A - actor position; C - camera position
The distance between the camera and actor is:
2 2 diAC (φi ) = Ref f + Rc − 2Ref f Rc cos(φ)
(16)
Estimation TBD Motion Capture Systems State Space Spatial Component
For vertical direction: i = 0..(N − 1) – camera index ha AV F OVi = 2arc tan 2diAC (φi )
155
(17)
Because cameras are equally spaced on Rc circle: 2πi (18) N where α is the phase that include basis axis. If this axis is on OC line then α = 0 for camera number i = 0. ⎛ ⎞ N −1 h a AV F OVtotmax = 2 arc tan ⎝
2πi ⎠ (19) 2 2 − 2R 2 R + R R cos i=0 ef f c c ef f N φi = α +
Maxima positions of this functions are: 2πi N For AHF OVtotmax similar formula can be derived: ⎛ ⎞ N −1 d a AHF OVtotmax = 2 arc tan ⎝
2πi ⎠ 2 2 − 2R 2 R + R R cos i=0 ef f c c ef f N arg max {AV F OVtot } =
(20)
(21)
The cost of computation for considered system is described by formula: Ct ot ∼ N Xr Yr
4
AV F OVtotmax AHF OVtotmax CV V F OV HF OV
(22)
Comparison Example
Introducing Mean of Actor FOV formulas as: M AV F OVmax =
N −1 N −1 1 1 AV F OVi M AHF OVmax = AHF OVi N i=0 N i=0
(23)
the total cost is: Xr Yr CV M AV F OVmax M AHF OVmax V F OV HF OV and because the first part of formula is constant for all considered cases: Ct ot ∼
Ct ot ∼ M AV F OVmax M AHF OVmax
(24)
(25)
Total cost related to full frame processing is 5% for single camera optimization and about 1% for multiple camera optimization. Presented derivation of cost formulas are true for single actor, if there are more actors reduction cost will be smaller. Only the first method (full frame processing) is actor number independent. Assuming 5 actors and 100 computers for full frame processing using multiple camera optimization the number of computers decreases to 5.
156
P. Mazurek
180
180 Reff limit No optimisation and VFOV boundary Single camera optimisation Multiple optimisation (4xcam) Multiple optimisation (8xcam) Multiple optimisation (16xcam)
MAVFOVmax (angle)
140 120
160 140 MAHFOVmax (angle)
160
100 80 60
120 100 80 60
40
40
20
20
0 0
20
40
60 Reff
80
100
Reff limit No optimisation and VFOV boundary Single camera optimisation Multiple optimisation (4xcam) Multiple optimisation (8xcam) Multiple optimisation (16xcam)
0 0
20
40
60
80
100
Reff
Fig. 4. MAVFOV and MAHFOV costs for analyzed example. V F OV = 60, HF OV = 80 - camera fields of view (4:3); ha = 25 - height of actor; ha = 5 - diameter of actor; hp = 30 - height of performance area; Rc - camera circle radius; Ref f - calculated effective radius.
5
Conclusions
In the paper a method for estimation of fixed computation cost for multiple cameras circularly placed motion capture system with TBD algorithm is proposed. It is crucial for real–time systems where the worst case should be known and appropriate amount of computation power should be reserved not by tests but by calculations. Presented example (that uses derived formulas) shows significant reduction of computation cost that is very important due to considered TBD algorithm requirements. Acknowledgments. This work is supported by the MNiSW grant N514 004 32/0434 (Poland).
References 1. Barniv, Y.: Dynamic Programming Algorithm for Detecting Dim Moving Targets. In: Bar–Shalom Y. (ed.): Multitarget–Multisensor Tracking. Artech House (1990) 2. Stone, L.D., Barlow, C.A., Corwin, T.L.: Bayesian Multiple Target Tracking. Artech House (1999) 3. Ristic, B., Arulampalam, S., Gordon, N.: Bayesian Beyound the Kalman Filter: Particle Filters for Tracking Applications. Artech House (2004) 4. Doucet, A., De Freitas, N., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2005) 5. Mazurek, P.: Optimization of multiple camera 3–D space locations for motion capture room environment. 28. Midzynarodowa Konferencja z Podstaw Elektrotechniki i Teorii Obwodow IC-SPETO’2005, Gliwice-Ustron, pp. 403–406 (2005) 6. Mazurek, P.: Likelihood Ratio Track-Before-Detect Algorithm for Motion Capture Systems, Advanced Computer Systems ACS’2007 Multi-Conference. Miedzyzdroje (submitted 2007)
Real-Time Active Shape Models for Segmentation of 3D Cardiac Ultrasound Jøger Hanseg˚ ard1, Fredrik Orderud2, and Stein I. Rabben3 1
2
University of Oslo, Norway
[email protected] Norwegian University of Science and Technology, Norway
[email protected] 3 GE Vingmed Ultrasound, Norway
[email protected] Abstract. We present a fully automatic real-time algorithm for robust and accurate left ventricular segmentation in three-dimensional (3D) cardiac ultrasound. Segmentation is performed in a sequential state estimation fashion using an extended Kalman filter to recursively predict and update the parameters of a 3D Active Shape Model (ASM) in real-time. The ASM was trained by tracing the left ventricle in 31 patients, and provided a compact and physiological realistic shape space. The feasibility of the proposed algorithm was evaluated in 21 patients, and compared to manually verified segmentations from a custom-made semi-automatic segmentation algorithm. Successful segmentation was achieved in all cases. The limits of agreement (mean±1.96SD) for the point-to-surface distance were 2.2±1.1 mm. For volumes, the correlation coefficient was 0.95 and the limits of agreement were 3.4±20 ml. Real-time segmentation of 25 frames per second was achieved with a CPU load of 22%.
1
Introduction
Left ventricular (LV) volumes and ejection fraction (EF) are among the most important parameters in diagnosis and prognosis of heart diseases. Recently, realtime three-dimensional (3D) echocardiography was introduced. Segmentation of the LV in 3D echocardiographic data has become feasible, but due to poor image quality, commercially available tools are based upon a semi-automatic approach [1,2]. Furthermore, most reported methods are using iterative and computationally expensive fitting schemes. These factors make real-time segmentation in 3D cardiac ultrasound challenging. Prior work by Blake et al. [3,4] and Jacob et al. [5,6], have shown that a state estimation approach is well suited for real-time segmentation in 2D imagery. They used a Kalman filter, which requires only a single iteration, to track the parameters of a trained deformable model based on principal component analysis (PCA), also known as Active Shape Models (ASMs) [7]. ASMs can be trained on manually traced LV contours, resulting in a sub-space of physiologically probable shapes, effectively exploiting expert knowledge of the LV anatomy W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 157–164, 2007. c Springer-Verlag Berlin Heidelberg 2007
158
J. Hanseg˚ ard, F. Orderud, and S.I. Rabben
and function. For segmentation of 3D cardiac data, Van Assen et al. [8] introduced the 3D ASM. However, there are to our knowledge no reports of real-time implementations of 3D ASMs. Based on the work in [3,4,5,6], real-time LV segmentation of 3D cardiac ultrasound was recently introduced by Orderud [9]. He used an extended Kalman filter for robust tracking of a rigid ellipsoid LV model. Later this framework has been extended to use a flexible spline-based LV model coupled with a global pose transform to improve local segmentation accuracy [10]. However, expert knowledge of LV anatomy could not be modeled directly. To utilize expert knowledge of LV anatomy during segmentation, we propose to use a 3D ASM for real-time segmentation of 3D echocardiograms, by extending the framework described in Orderud [9]. The 3D ASM, trained on LV shapes traced by an expert, gives a compact deformable model which is restricted to physiologically realistic shapes. This model is fitted to the target data in realtime using a Kalman filter. The feasibility of the algorithm is demonstrated in 21 patients, where we achieve real-time segmentation of the LV shape, and instantaneous measurements of LV volumes and EF.
2
Shape Model
A set of 496 triangulated LV training meshes were obtained from 31 patients using a custom-made segmentation tool (GE Vingmed Ultrasound, Norway). The training tool provides manual editing capabilities. When necessary, the user hence did manual editing of the segmentation to make it equivalent to manual tracing. Building the ASM requires pair-wise point correspondence between shapes from different patients [7,8]. We developed a reparametrization algorithm for converting triangulated LV training shapes into quadrilateral meshes. This algorithm produced meshes with 15 longitudinal and 20 circumferential segments, with vertices approximately identifying unique anatomical positions. The meshes were aligned separately to remove trivial pose variations, such as scaling, translation and rotation.
pl(xl) q- i
n- i Tl
p(x)
qi(xl) Tg
Fig. 1. A point p(x) on the ASM is generated by first applying a local transformation Tl described by the ASM state vector xl on the mean shape, followed by a global pose transformation Tg to obtain a shape in real-world coordinates
Real-Time ASMs for Segmentation of 3D Cardiac Ultrasound
159
From the aligned training set, the mean vertex position qi was computed, and PCA was applied on the vertex distribution to obtain the Nx most dominant eigenvectors. In normalized coordinates, the ASM can be written on the form qi (xl ) = qi + Ai xl ,
(1)
where the position of a vertex qi is expressed as a linear combination of the associated subspace of the Nx most dominating eigenvectors combined into the 3 × Nx deformation matrix Ai . Here, xl is the local state vector of the ASM. The expression for the ASM can be optimized assuming that the deformation at vertex qi is primarily directed along the corresponding surface normal ni of the average mesh. This is done by projecting the deformation matrix Ai onto the surface normal, giving an Nx -dimensional vector of projected deformation T modes A⊥ i = ni Ai . The optimized expression for the ASM can now be written on the form qi (xl ) = qi + ni (A⊥ i xl ) ,
(2)
reducing the number of multiplications by a factor of three. Due to the quadrilateral mesh structure of the ASM, a continuous surface is obtained using a linear tensor product spline interpolant. An arbitrary point on the ASM in normalized coordinates can be expressed as pl (xl ) = Tl |(u,v) where (u, v) represents the parametric position on the surface, and the local transformation Tl includes the deformation and interpolation applied to the mean mesh. By coupling this model with a global pose transformation Tg with parameters xg including translation, rotation, and scaling, we obtain a surface p(x) = Tg (pl (xl ), xg )
(3)
in real-world coordinates, with a composite state vector xT ≡ xTg , xTl . An illustration showing the steps required to generate the ASM is shown in Fig. 1. In our experiments we used 20 eigenvectors, describing 98% of the total variation within the training set.
3
Tracking Algorithm
The tracking algorithm extends prior work by Orderud [9,10], to enable usage of 3D ASMs in the Kalman filter for real-time segmentation. This is accomplished by using the ASM shape parameters xl directly, in addition to the global pose parameters xg , in the Kalman filter state vector. 3.1
Motion Model
Modeling of motion in addition to position can be accomplished in the prediction stage of the Kalman filter by augmenting the state vector to contain the last two
160
J. Hanseg˚ ard, F. Orderud, and S.I. Rabben
successive state estimates. A motion model which predicts the state x ¯ at timestep k + 1, is then expressed as x ¯k+1 − x0 = A1 (ˆ xk − x0 ) + A2 (ˆ xk−1 − x0 ) ,
(4)
where x ˆk is the estimated state from timestep k. Tuning of properties, like damping and regularization towards the mean state x0 for all deformation parameters, can then be accomplished by adjusting the coefficients in matrices A1 and A2 . Prediction uncertainty can similarly be adjusted by manipulating the process noise covariance matrix B0 used in the associated covariance update equation. The latter will then restrict the change rate of parameter values. 3.2
Measurement Processing
Edge-detection is based on normal displacement measurements vi [4], which are calculated by measuring the radial distance between detected edge-points pobs,i and the contour surface pi along selected search normals ni . These displacements are coupled with associated measurement noise ri to weight the importance of each edge, based on a measure of edge confidence. Measurement vectors are calculated by taking the normal projection of the composite state-space Jacobian for the contour points ∂Tg (pl , xg )i ∂Tg (pl , xg )i hTi = nTi , (5) ∂xg ∂xl which is the concatenation of a global and a local state-space Jacobi matrix. The global Jacobian is trivially the state-space derivative of the global pose transformation, while the local Jacobian has to be derived, using the chain-rule for multivariate calculus, to propagate surface points on the spline through mesh vertices, and finally to the ASM shape parameters: ∂Tg (pl , xg ) ∂Tl (xl )n ∂Tg (pl , xg ) ¯ k A⊥ = ·n (6) j . ∂xl ∂p ∂q l,n j n∈x,y,z j∈1..Nq
Here, ∂Tg (pl , xg )/∂pl is the spatial derivative of the global transformation, and ∂Tl (xl )/∂qj is the spatial mesh vertex derivative of the spline interpolant. 3.3
Measurement Assimilation and State Update
All measurements are assimilated in information space prior to the state update step. Assumption of independent measurements leads to very efficient processing, allowing summation of all measurement information into an information vector and matrix of dimensions invariant to the number of measurements: HT R−1 v = i hi ri−1 vi (7) −1 T T −1 H R H = i hi ri hi . (8)
Real-Time ASMs for Segmentation of 3D Cardiac Ultrasound
161
The updated state estimate x ˆ at timestep k can then by computed by using the information filter formula for measurement update [11], and the updated ˆ is calculated directly in information space: error covariance matrix P ˆ k HT R−1 v x ˆk = x ¯k + P ˆ −1 = P ¯ −1 + HT R−1 H . P k
k
(9) (10)
Using this form, we avoid inversion of matrices with dimensions larger than the state dimension.
4 4.1
Evaluation Data Material
For evaluation of the proposed algorithm, apical 3D echocardiograms of one cardiac cycle from 21 adult patients (11 diagnosed with heart disease) were recorded using a Vivid 7 scanner (GE Vingmed Ultrasound, Norway) with a 3D transducer (3V). In all patients, meshes corresponding to the endocardial boundary were determined using a custom-made semi-automatic segmentation tool (GE Vingmed Ultrasound, Norway). The segmentations were, if needed, manually adjusted by an expert to serve as independent references equivalent to manual tracing. 4.2
Experimental Setup and Analysis
Edge measurements were done perpendicular to the mesh surface within a distance of ±1.5 cm to the surface at approximately 450 locations, using a simple edge model based on the transition criterion [12]. The ASM was initialized to the mean shape, and positioned in the middle of the volume in the first frame. Segmentation was performed on the evaluation set by running the algorithm for a couple of heartbeats, to give the ASM enough time to lock on to the LV. The accuracy of the ASM was assessed using the mean of absolute point-tomesh distances between the ASM and the reference, averaged over one cardiac cycle. Volume differences (bias) between the ASM and the reference were calculated for each frame. End-diastolic volume (EDV), end-systolic volume (ESV), and EF ((EDV − ESV)/EDV·100%) were compared to the manually verified reference (two-tailed t-test assuming zero difference), with 95% limits of agreement (1.96 standard deviations (SD)). EDV and ESV were computed as the maximum and minimum volume within the cardiac cycle respectively.
5
Results
We observed that common challenges with 3D cardiac ultrasound, such as dropouts, shadows, and speckle noise were handled remarkably well, and segmentation was successful in all of the 21 patients. Some examples are shown in Fig. 2(b-d).
J. Hanseg˚ ard, F. Orderud, and S.I. Rabben
Volume comparison
200
162
●●
VASM [ml] 100 150
● ●● ● ● ● ● ● ●● ●●● ●● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●●●● ● ● ●● ● ●● ●●●● ● ● ● ●● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●●● ●● ●●●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ●●●●● ● ●● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●●●● ●● ●● ●● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ●
● ● ●
0
50
●
0
50
100 150 Vref [ml]
200
(a)
(b)
(c)
(d)
Fig. 2. LV volumes obtained by the ASM (VASM ) in all 21 patients is compared to the reference (Vref ) and shown with the identity line (dashed) in (a). In (b-d), the enddiastolic segmentation (yellow) in three patients is compared to the reference (red) and shown along with the volume curve for one cardiac cycle.
The limits of agreement (mean±1.96SD) for the point-to-surface distance were 2.2±1.1 mm, indicating good overall agreement between the ASM and the reference. From Tab. 1, column 2, we see that the limits of agreement for volumes were 3.4±20 ml, with a strong correlation (r=0.95). The volume correspondence between the ASM and the reference is shown in Fig. 2(a). Table 1. Segmentation results showing ASM versus reference Volume [ml] Difference (mean±1.96SD) Correlation coeff. (r) *
∗
3.4 ±20 0.95
Significantly different from 0, p 1 is found and the point Qi is introduced to the mesh as new vertex. Iterating this process results in construction of several new points along the edge. Applied to every edge in the current mesh, a large set of points is obtained.
3
Delaunay-Based Vector Segmentation
Proposed segmentation scheme is based on the DT described above. The image is partitioned into regions whose characteristics, such as intensity and texture, are similar, while the mesh is adapted to the underlying structure of the volumetric image data. Each image region rk consists of a set of tetrahedra t1 , . . . , tn . Based on the introduced principles, the adaptive segmentation is proposed as follows 1. 3D edge and corner detection - candidate vertices lying on regions boundaries, meaningful edges and corners are located. Candidates can be found by various edge and corner detection algorithms extended to the 3D space. In our experiments, the well known Canny edge detector has been used which was designed to be an optimal edge detector according to predefined criterions. 2. Initial Delaunay triangulation - tetrahedral mesh is constructed from the preselected set of candidate vertices. 3. Iterative adaptation - the tessellation grid is adapted to the image structure. 4. Tetrahedra grouping - classification of tetrahedra into image regions. 3.1
Iterative Adaptation
Following three main steps are repeated until the triangulation satisfies some convergence criterion. Isotropic Edge Splitting. In this phase, the isotropic meshing algorithm creating new points along existing edges and another well known technique of tetrahedral mesh optimization, splitting of maximal edges, are combined together. Instead of maximal edges, those edges crossing significant image edges are divided. A new vertex is inserted to the mesh in the point of intersection This approach is partially similar to the CDT. The whole isotropic edge splitting process can be briefly formulated as follows
Delaunay-Based Vector Segmentation of Volumetric Medical Images
265
Fig. 3. Initial mesh (a) and result of the iterative adaptation (b)
1. Sequentially process every edge AB in the current triangulation. – Find intersection points Pi of the edge and all image edges. – Introduce the sub-edges AP1 , P1 P2 , . . . , Pn B and divide them in the sense of isotropic meshing algorithm (Section 2.1). 2. Filter the set of newly created points to discard vertices too close to any other point respecting the control space metric. 3. Insert points to the mesh, continue starting from the step 1 until convergence. Edges that are almost parallel with the image edge remain unchanged to prevent degradation of the mesh. The control space H can be defined as a distance function representing proximity of an image edge at the point P . Because the length of any tetrahedron edge emanating from the point P must be smaller, or equal to this value, the edge cannot cross any image edge. Value of the sizing field increases with distance from the closest image edge. Therefore, bigger tetrahedra appear in the center of large homogenous regions. If the point P lies exactly on an image edge, the control space is equal to zero, h(P ) = 0. Because it is not possible to create edges having zero length, a minimal edge length must be given. Vertex moving. The vertex moving causes that vertices close to an image edge are attracted directly to this edge. Important effect of this process is that there are many vertices lying on image edges, while no vertices lie in a close vicinity of image edges. Such optimization results in better approximation of region boundaries in the final triangulation. Tetrahedral Mesh Optimization. Previous step of the iterative adaptation may cause that thin tetrahedra ineffective for further processing appear in the mesh. At the beginning, the quality of all tetrahedra is estimated. There are many measures of the quality regarding the ideal shape. The most general one is ratio of the longest tetrahedron edge and the radius of its inscribed circle [1]. Then, those tetrahedra having the normalized quality below a given threshold are optimized. The new point in the center of circumsphere is added to the mesh.
266
3.2
ˇ M. Spanˇ el et al.
Tetrahedra Grouping
Every tetrahedron ti of the mesh is characterized by a feature vector f i . Individual features detail image structure of the tetrahedron and its close neighbourhood. In fact, the first two components are mean value μ(ti ) and intensity variance σ(ti ) of voxels inside the tetrahedron. Others features may cover image texture and spatial configuration of adjacent tetrahedra. Feature vectors may be grouped by the help of Any conventional supervised or unsupervised clustering algorithm that classifies feature vectors into a certain number of classes. Region Growing and Merging. Topology of the tetrahedral mesh is suitable for region-based image segmentation techniques such as region growing and merging. Instead of pixels and the traditional 4– and 8– pixel connectedness, tetrahedra and graph adjacency are used. Definition 2. Let fi and fj be two feature vectors extracted for a non-empty group of adjacent tetrahedra. Similarity measure S(fi , fj ) is a function whose value is larger as the difference between feature vectors decreases. Basic similarity criteria are the mean intensity value Smean and statistical test of the similarity based on voxel value variance Svar . Smean = exp(−
1 |μr − μrj |2 ) 2ρ2 i
Svar =
σri σrj σr2i,j
(1)
The parameter ρ affects sensitivity of the measure and σri,j is variance of the intensity in a joint region ri ∪ rj . Region growing starts with seed tetrahedra and grows the regions from them. Adjacent tetrahedra are added to the region if they satisfy a chosen criteria of similarity. In region merging, neighboring regions are examined to determine if they can be merged together. All adjacent regions are examined and similarity of both feature vectors is estimated. If the measure is greater than a given experimentally chosen threshold, both regions are merged and feature vector for the newly formed region is calculated. The merging phase continues until until no more regions are merged, or some stopping criterion is met.
4
Experimental Results
Two classifiers were designed for the unsupervised clustering of feature vectors into image segments. First, the Fuzzy C-means (FCM) algorithm [9], and second, the Gaussian Mixture Model (GMM) optimized by the popular ExpectationMaximization (EM) algorithm [10]. In addition, the region growing initially reduces a large number of tetrahedra in the mesh, and after the clustering phase, the region merging algorithm is applied to get the final segmentation. The segmentation algorithm was tested on a number of real CT imaging data having resolution mostly 512x512 pixels per slice. Initialization of the tetrahedral mesh, its iterative adaptation and classification take approximately 10 − 15
Delaunay-Based Vector Segmentation of Volumetric Medical Images
a)
267
b)
c)
d)
Fig. 4. Result of the vector segmentation. Polygonal surfaces of meaningful image segments extracted directly from the classified tetrahedral mesh: soft tissues (a,b), hard tissues (c), and bacground/air (d).
minutes on standard PC with P4 1.6GHz processor depending on concrete number of image slices (approximately 120 − 150 slices were analyzed).
5
Conclusion
The vector segmentation algorithm based on adaptive Delaunay triangulation is proposed in the paper. Tetrahedral mesh is used to divide a volumetric image into several disjoint regions whose characteristics are similar. Certain methods for improving quality of the mesh and its adaptation to the underlying image structure are also described. Direct vector representation of image regions makes possible to eliminate difficult process of raster data vectorization. The 3D geometrical model can be created directly from the segmented vector data. More effective representation of the image structure is obtained. Tetrahedral mesh approximates the raster
268
ˇ M. Spanˇ el et al.
data. This representation should decrease complexity of any classification algorithm that processes a reduced number of tetrahedra instead of voxels. Another advantage of the vector representation, important in medical imaging, is comfortable manual correction of the final segmentation. Simple modifications of the tetrahedral mesh, such as adding new vertices, removing old ones, and manual reclassification of tetrahedra, allow easy correction of the segmentation. The geometrical model and manual corrections can be made in parallel All changes in the mesh take effect on the model without any longer delay.
6
Future Work
Since the classification is performed within local vicinity of processed tetrahedron, improvements can be made by incorporating global principles. Viewing the mesh as undirected graph with nodes and edges weighted according to feature vectors would allow to use graph algorithms (graph cuts, path algorithms, etc.) for the segmentation. Acknowledgements. The authors were supported by the Faculty of Information Technology, BUT under project MSM0021630528 CZ and CESNET association under project No. MSM6383917201 CZ.
References 1. Krˇsek, P.: Direct Creating of FEM Models from CT/MR Data for Biomechanics Applications. PhD thesis, Vutium, Brno University of Technology, Brno, Czech Republic (2001) 2. McInerney, T., Terzopoulos, D.: Deformable models in medical image analysis: A survey. Medical Image Analysis (1996) 3. Cohen, L., Cohen, I.: Finite element methods for active contour models and baloons for 2d and 3d images. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1131–1147 (1993) 4. Lachaud, J.O., Montanvert, A.: Volumic segmentation using hierarchical representation and triangulated surface. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1064, pp. 137–146. Springer, Heidelberg (1996) 5. Davoine, F., Chassery, J.M.: Adaptive Delaunay triangulation for attractor image coding. In: Proceedings of the 12th International Conference on Pattern Recognition, Jerusalem, Israel, pp. 801–803 (1994) 6. Gevers, T.: Adaptive image segmentation by combining photometric invariant region and edge information. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 7. Owen, S.: A survey of unstructured mesh generation technology. In: Proceedings of the Seventh International Meshing Roundtable, Dearborn, Michigan, Sandia National Laboratories (October 1998)
Delaunay-Based Vector Segmentation of Volumetric Medical Images
269
8. George, P.L., Borouchaki, H.: Delaunay Triangulation and Meshing: Application to Finite Elements. Editions HERMES, Paris, France (1998) 9. Pham, D.L., Prince, J.L.: Adaptive fuzzy segmentation of magnetic resonance images. In: IEEE Transactions on Medical Imaging 18 (September 1999) 10. Ng, S.K., McLachlan, G.J.: On some variants of the em algorithm for fitting mixture models. Austrian Journal of Statistics 23, 143–161 (2003)
Non-uniform Resolution Recovery Using Median Priors in Tomographic Image Reconstruction Methods Munir Ahmad and Andrew Todd-Pokropek Medical Physics and Bioengineering, University College London, UK
Abstract. Penalized-Likelihood (PL) image reconstruction methods produce better quality images than analytical methods. However, these methods produce images with non-uniform resolution properties. A number of prior functions have been used to recover for this reconstructed resolution non-uniformity. Quadratic Priors (QPs) are preferred due to their simplicity and their resolution characteristics have been studied extensively. Images reconstructed using QPs still exhibit non-uniform resolution properties. Here, we propose median priors (MPs) in place of QPs and evaluate their resolution characteristics. Although, they produce images with non-uniform reconstructed resolution, their recovered resolution is better than the QPs. We have also implemented MPs in a modified penalty frame work, proposed for QPs, and have shown that they produce images with almost uniform resolution. Due to their automatic edge preservation and better quantitative properties, MPs might be preferred over QPs for penalize-likelihood image reconstruction. Keywords: Penalized-Likelihood, Quadratic Priors, Median Priors, Resolution, Edge preservation.
1 Introduction Statistical image reconstruction methods produce images with superior noise characteristics as compared to the analytical reconstruction methods [1]. However, images reconstructed by the maximum-likelihood expectation maximization (MLEM) based reconstruction methods become increasingly noisy [2] as we proceed iterating and these reconstructed images exhibit non-uniform resolution characteristics across the field of view as well. This non-uniform resolution effect makes the image comparison difficult, produces shape distortions and consequently hinders true activity quantification [3]. This positional resolution non-uniformity is induced due to a blend of physical and detector intrinsic effects as depth dependent non-uniform attenuation in SPECT and non-uniform sampling and crystal penetration effects in PET [4] etc. These effects make the system response space-variant and induce resolution nonuniformity in the reconstructed images even in space-invariant systems. It is easy to incorporate these image degrading effects in the system model [5] with iterative reconstruction methods, hence, using more accurate representation of the real systems. Different types of regularization methods have been used to overcome these effects produced by MLEM based reconstruction methods [6], [7], [8]. PL methods use an additional penalty term added to the objective function and they are considered better W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 270 – 277, 2007. © Springer-Verlag Berlin Heidelberg 2007
Non-uniform Resolution Recovery Using Median Priors
271
as compared to the other methods due to a robust control of the local image characteristics. In this method a penalizing term is added to the objective function [9]. A number of penalty functions have been used by different researchers [10]. These functions span both quadratic and non-quadratic functions and generally assume a Gaussian distribution in a local neighborhood of pixels and assume a locally smooth image. Fessler and Rogers have exhaustively studied the resolution properties of quadratic priors [9]. In spite of their simplicity, QPs penalize all image pixels with equal strength and produce anisotropic smoothing resulting in a non-uniform reconstructed resolution. At the same time they smooth the edges, a salient image feature, equally, which is not desired. Although, non-quadratic priors can be used for edge preservation, they have more than one optimization parameter. MPs have been implemented and shown to have good quantities image characteristics as compared to the QPs [3], however, there resolution properties have not been fully explored. MPs automatically take care of the edges as they assume their root images monotonic instead of smooth. Here we have evaluated resolution characteristics of the MPs in PL image reconstruction methods as an extension of our previous work [11], [12]. We have also use MPs in modified penalty framework to get images with almost uniform resolution properties. 1.1 Local Impulse Response In emission computed tomography, events are considered as independent Poisson measurements with mean, {Yi | i=1,2,..,N } of the expected number for counts in each of N number of lines of response (LORs), with emission rates of {xj , j=1,2,…,M} in M pixels; N
Yi = ∑ aij x j + ri .
i = 1,2,...., M
(1)
j =1
where, aij is the probability of an event from pixel j to be detected in detector (LOR) i and ri is factor representing background or random events. For measurements yi, the reconstruction problem reduces to evaluate xj. PL image reconstruction methods maximize an objective function of the form;
ψ = arg max L( x, y ) − β P( x).
(2)
x∈λ
Here, L(x, y) is the likelihood function, λ is a set of feasible images and β is hyper parameter and controls the influence of penalty term. Assuming independent Poisson measurement model, the likelihood function becomes [10]; N
L( x, y ) = ∑ yi log Yi − Yi .
(3)
i =1
And, considering pair-wise pixel differences for penalization, the roughness penalty would be [9];
272
M. Ahmad and A. Todd-Pokropek M
Nb
P( x) = ∑∑ w jtφ ( x j − xt ).
(4)
j =1 t =1
Here, Nb is a small neighborhood around the pixel j and φ is known as potential function and generally assumed to be symmetric and convex with wjt = 1 for horizontal and vertical pixels and wjt = 1/ 2 for diagonal pixels. Using these definitions, Fessler [9] developed a local impulse response function to evaluate response of the system to an impulse. This impulse response can be given as;
[
]
l j ( x ) = At D A + β R '' ( x )
−1
At D A e j .
(5)
Where, A is the system matrix, D[ c 2 / y i ] is a diagonal matrix with c as the attenuation j
correction terms, and e is a unit vector of M dimensions. function P(x).
R '' is Hessian of the penalty
1.2 Quadratic Priors Quadratic prior is the simplest form of the prior function and can be expressed as 1 φ ( x ) = x 2 .The most noticeable property of this function is that its Hessian be2 comes independent of the object and can be evaluated before the reconstructions are carried out. In matrix form the QPs terms will become; ⎧ ∑ w jt ⎪ R = ⎨ Nb jt ⎪ − w jt ⎩
when
t = j
when
t ≠ j.
(6)
We have evaluated impulse responses at given locations using this prior formation and compared the results with MP. 1.3 Median Priors Instead of using pair-wise pixel differences, pixel values might be penalized with reference to the median in a local neighborhood. In that case the prior would be called Median Prior (MP) and will have the following form [3] given in (7). We have implemented this prior function and evaluated impulse responses at the specified locations.
R jt
⎧ 1 ⎪ = ⎨M j ⎪0 ⎩
when t = j
(7)
when t ≠ j .
2 Materials and Methods We used a 64 x 64 pixel phantom image, shown in Figure 1, as our reference to compare the resolution properties of different prior functions. This phantom contains a
Non-uniform Resolution Recovery Using Median Priors
273
higher activity and a lower activity discs in a moderately hot background of values 1, 3 and 2. This very simple phantom was selected to evaluate the effect of different activity levels and varying attenuation. Pixel size was selected as one mm to incorporate varying attenuation. We have selected a space-invariant PET system response to compare the priors. We selected our impulse response locations such that they may compensate for space-variant effects. A sinogram was generated by simulating 64 views at 64 equally spaced angles over 180o and the detector width was assumed equal to the width of a single pixel. For quadratic and median priors we selected a 3 x 3 pixels first order neighborhood centered at the pixel of interest. 2.1 Modified Penalty A brief description of the modified penalty, used by Roger and Fessler [9], is given below. If system’s geometric response is approximately shift invariant, as for PET near the centre, the Fisher Information Matrix (FIM) can be given as; FIM ij = A t DA = ∑ a 2 ij q i .
(8)
Fig. 1. Phantom image with its sinogram used to evaluate resolution properties of different priors. Small crosses represent the points where we evaluated the impulse response.
They developed a modified penalty by multiplying quadratic penalty with a term which reflects a normalized forward projection of the D [ q i = c 2 / y i ] matrix terms as follows; M
Nb
P ( x ) = ∑ ∑ w jt φ ( x j − x t ) α j . j =1 t =1
(9)
where
αj =
∑a
2 ij
qi
i
∑a
2
.
ij
i
This modified penalty makes use of the local certainty and changes the impulse response depending on the local properties. It produces impulse responses of almost equal heights and of uniform resolution at different spatial locations. For MPs, qi term might be replaced with local median.
274
M. Ahmad and A. Todd-Pokropek
2.2 β Parameter Value Hyper parameter β defines a compromise between resolution and noise and is a critical selection for our study. To estimate a suitable parameter for both of the priors, resolution was plotted against β at the centre of the image in Figure 2. A value was selected where the responses have approximately same resolution for both of the priors. This plot shows that local resolution is more prone to variations with the standard QPs and the modified QPs. However, our proposed MPs and Modified MPs are less sensitive to the parameter value.
Fig. 2. Resolution (FWHM in pixels) is plotted against beta parameter for different priors. We can observe that if we go on increasing the value of beta, resolution gets verse with QPs or modified QPs.
3 Results and Discussion Impulse responses were evaluated at three given points inside the phantom using (6) and they are plotted together to show a relative comparison. Figure 3, presents impulse responses evaluated using standard space-invariant QPs and the standard MPs. It is obvious that these responses are highly non-uniform with non-uniform heights. For a comparison of FWHMs, we plotted resolution at the three locations for three different beta values and are shown in Figure 4. We can clearly see that the resolution is space-variant and have different values at those locations. It is also clear that for standard QPs, variations are more sensitive to the beta values than the MPs. This might be due to the penalizing scheme of MPs as they penalize the pixel values with reference to the local median. Also they use a defined small neighborhood support for the calculation of local median. So, the mean of the underlying distribution varies with the median, which is not the same for the quadratic penalty, where the mean is always at the centre pixel of neighborhood. These responses, also, seem to be heavily asymmetric, which indicates non-isotropic smoothing, provided by these priors.
Non-uniform Resolution Recovery Using Median Priors
275
Fig. 3. Impulse responses evaluated at three given locations using standard QPs and MPs. The responses are highly non-uniform different activity level and non-uniform attenuation effects with other space-variant effects. Impulse Responses evaluated at three specified points using Median Priors are more uniform in height and look more symmetric.
Fig. 4. Resolution (FWHM in pixels) evaluated at three locations for three different beta parameter values with QPs, MPs, ModQPs and ModMPs. A clear non-uniform resolution response is observed for both the priors. However, recovered resolution is better with MPs than QPs. Also, it shows less resolution dependence on parameter value. For modified priors, resolution non-uniformity has been recovered with both priors, however, greater smoothing is clear in high count region.
Impulse responses, evaluated using modified QPs and modified MPs, are presented in Figure 5. These responses appear to have approximately uniform resolution at these three locations. It is interesting to observe that the imposed smoothing seems to be same in both the cases, hence; responses are apparently symmetric as compared to the standard QPs and MPs. Figure 4, also presents resolution (FWHM in pixels) evaluated at the given locations for three different parameter values with both modified QPs and modified MPs. We can observe that resolution is much uniform with modified priors. Also, there is an improvement in recovered resolution. However, unremarkably, smoothing is higher in the high count regions.
276
M. Ahmad and A. Todd-Pokropek
Fig. 5. Impulse responses evaluated using Modified QPs and Modifies MPs at the given locations. These responses are seemingly uniform having almost uniform resolution and much symmetric responses.
4 Conclusions Quadratic priors have been developed and evaluated extensively. They are preferred due to the independence of their Hessian on the object values. However, they produce images with non-uniform resolution characteristics across the image. In this study, we evaluated resolution characteristics of median priors as an extension of our previous work and compared their response with quadratic priors. Our results show that median priors produce better impulse responses in sense of recovered resolution than the quadratic priors. However, resolution non-uniformity still persists across the field of view with median priors. A modified median prior has been introduced an evaluated. The results show a clear recovery for non-uniform resolution and responses are approximately same at different locations within the field of view. Although, there is a little non-uniformity remains which reflects more smoothing in the higher activity region. With the median priors, we achieved about the same uniform responses as of the quadratic priors, median priors have other advantages, such as edge preservation or good quantitative properties and might be preferred over quadratic priors. Acknowledgement. This work is funded under a Split PhD Scholarship from the Government of Pakistan and author M Ahmad is carrying out this work for PhD. Authors are thankful to J. A. Fessler for his kind discussions and guidance during this work.
References 1. Stayman, W.J., Fessler, J.A.: Compensation for non-uniform resolution using penalizedlikelihood reconstruction in space-variant imaging systems. IEEE Trans. Med. Imag. 3, 269–280 (2004) 2. Snyder, D.L., Miller, M.I., Thomas, L.J., Politte, D.G.: Noise and edge artifacts in maximum-likelihood reconstructions for emission tomography. IEEE Trans. Med. Imag. 6, 228–238 (1987) 3. Alenius, S., Ruotsalainen, U., Astola, J.: Generalization of median root prior reconstruction. IEEE Trans. Med. Imag. 11, 1413–1420 (2002)
Non-uniform Resolution Recovery Using Median Priors
277
4. Karuta, B., Lecomte, R.: Effect of detector weighting functions on the point spread function of high-resolution PET tomographs. IEEE Trans. Med. Imag. 11, 379–385 (1992) 5. Muncuoglu, E. U., Richard M. L., Simon R. C., Hoffman E.: Accurate geometric and physical response modeling for statistical image reconstruction in high resolution PET, IEEE Trans. Med. Imag., pp. 1569–1573 (1997) 6. Veklerov, E., Llacer, J.: Stopping rule for the MLE algorithm based on statistical hypothesis testing. IEEE Trans. Med. Imag. 6, 313–319 (1987) 7. Hebert, T.J.: Statistical stopping criteria for iterative maximum likelihood reconstruction of emission images. Phys. Med. Biol. 35, 1221–1232 (1990) 8. Snyder, D.L., Miller, M.I.: The use of sieves to stabilize images produced with the EM algorithm for emission tomography. IEEE Trans. Nucl. Sci. 32, 3864–3871 (1985) 9. Fessler, J.A., Rogers, W.L.: Spatial resolution properties of penalized-likelihood image reconstruction methods: Space-invariant tomographs. IEEE Trans. Imag. Proc. 5, 1346–1358 (1996) 10. Lange, L.: Convergence of EM image reconstruction algorithms with Gibbs smoothing. IEEE Trans. Med. Imag. 9, 439–446 (1991) 11. Ahmad, M., Todd Pokropek, A.: Impulse response investigations of median and quadratic priors in penalized-likelihood image reconstruction methods. In: 11th Symposium on Radiation Measurements and Applications (SORMA) held at University of Michigan, Ann Arbor, USA (2006) 12. Ahmad M., Todd Pokropek, A.: Partial Volume Correction using median priors in penalized-likelihood image reconstruction methods. In: IEEE Nucl. Sci. Sym. and Med. Imag. Conference 2006 at San Diego, California, USA (2006)
Detection of Postmenopausal Alteration of Bone Structure in Digitized X-rays Constantin Vertan1 , Ion S¸tefan2 , and Laura Florea1 1
Image Processing and Analysis Laboratory, University “Politehnica” of Bucharest, Romania 2 Baia Mare County Hospital, Romania
Abstract. The goal of this research is to investigate the effectiveness of trabecular bone characterization in X-ray images acquired by consumer digital cameras from the radiological films by the joint use of fractal and statistic parameters. We propose the classification of patients in the pre- and post-menopausal groups, based on the trabecular structure of the calcaneum bone. The bone structure is locally characterized in clinically-significant regions of interest by the usual fractal dimension and a parametric model of the gray-level histogram. The classification yields a 8.33% miss-detection and 16.66% false alarm rate.
1
Introduction
Trabecular architecture is accepted as one of the main factors defining bone quality [1]. During the accelerated phase of postmenopausal bone loss, women may lose up to 25% of trabecular bone. Trabecular architecture suffers dramatic changes in terms of a decrease in connectivity, thinning, perforation or loss of trabeculae [2]. In the case of the bone structure investigation, the amount of highly mineralized bone parts and the density of the trabeculae (bar, rod, bundle of fibers, or septal membrane composing the bone structure) are of paramount importance in estimating the quality of the bone. Three-dimensional trabecular architecture can now be assessed using high resolution digital imaging techniques such as high resolution magnetic resonance (hrMR) and micro-magnetic resonance (microMR), high resolution computed tomography (hrCT), as well as micro-computed tomography (micro-CT). Along with these highly sophisticated imaging techniques, the computer analysis of the radiographic patterns (i.e. the texture) of trabecular bone in clinically significant parts of the X-ray image still constitutes a valid field of research [3], [4]. The aim of the present work is to determine if fractal analysis and basic statistical parameters of the textural appearance of the trabecular bone structure, as it is captured on the plain lateral radiograph of the calcaneus (shown in figure 1), can be used to identify the apparition of postmenopausal alterations. Accordingly, the patients are classified into a pre- or postmenopausal group.
Part of this work was supported by the CEEX VIASAN grant 69/2006.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 278–284, 2007. c Springer-Verlag Berlin Heidelberg 2007
Detection of Postmenopausal Alteration of Bone Structure
279
Fig. 1. Typical calcaneum X-ray image (8 bpp gray-level digital photography of original X-ray film). Three square 150 x 150 pixel, clinically relevant regions of analysis are selected for bone analysis and pre/ post menopausal classification and subsequent osteoporosis degree grading: region 0 in the thalamic area, region 1 in the Ward triangle area, region 2 in the plantar and thalamic trabecular system; the analysis regions are white-marked.
Given the severe consequences of osteoporosis, and its high probability of occurrence in post-menopause, any automated method able to detect subtle changes of the trabecular texture accompanying this process could be important to an early recognition and improved therapy outcomes. If inexpensive radiographs obtained during routine clinical practice for trauma of the foot or ankle or complaints associated with other diseases, could prove to be the source of such information, this would provide a more accessible and cheap tool in the hand of medical staff. The remainder of this contribution is organized as follows: Section 2 will present the simple texture description approaches used for the trabecular bone description and Section 3 will describe the experiments of calcaneum X-ray image classification according to the pre- or post-menopausal condition. The paper ends with some conclusions and guidelines for further work.
2
Trabecular Bone Textural Structure Description
As mentioned in the introduction, the texture-like description of the bone trabecular structure seems a natural choice [4], [5]. Texture [6] still lacks a specific, rigorous definition; it can be described either as a mostly regular spatial replication of a basis pattern (the “texon” or “texel”), which is obviously not the current case, or as an irregular (or random) distribution of pixel values. According to the assumed model (deterministic or stochastic), the texture attributes can be derived using either characteristic visual cues (periodicity, orientation, principal
280
C. Vertan, I. S ¸ tefan, and L. Florea
axes, symmetry axes) or statistical measures, such as probability density functions (histograms, co-occurrence matrices, runlength matrices), or spectral energies (or many other statistical based models). More recently, fractal geometry has also been used to characterize bone topological features [5], using the underlying assumption is that any natural structure exhibits some scale-dependent selfsimilarity properties [7], which can be measured by fractal parameters (fractal dimension, fractal lacunarity). The specificity of the problem under investigation suggests the description of the trabecular bone structure by fractal parameters (as measure of the organization) and some first-order statistical measures of the gray level values (as measure of the mineralization). 2.1
The Fractal Texture Description
The fractal geometry was first introduced by Mandelbrot and had a strong impact in environmental sciences and physics [8]. Fractals provide a mathematical background for the study of irregular and complex shapes (such as textures) present in nature. A finite set A, A ⊆ Rn , is called self-similar if it is the reunion of N self copies, down-scaled by a factor r. The fractal dimension D of the set A is then defined by the equation: N rD = 1
(1)
One of the practical methods of computing the fractal dimension is the boxcounting method [8]. The basic idea behind the algorithm is to count the number of n-dimensional, S-sized boxes (hyper-parallelepipeds) that completely cover the set A. If the maximal dimension of the set A (according to the coordinate axes of the Rn space) is Smax , the scaling factor of the boxes with respect to the set A is r = S/Smax . Thus, the fractal dimension equation (1) can be put in a logarithmic form as: log N (S) = log r−D = D log Smax − D log S.
(2)
The fractal dimension D is then computed as the slope of the regression line that approximates the values N (S) with respect to the various box sizes S in a log-log plot. In the case of the gray level images (such as X-ray images), the space dimension is n = 3 (two position coordinates of the pixels within the image spatial support and a value-coordinate of the gray levels). An alternative to the box-counting method is the spectral estimation of the fractal dimension via the Hurst coefficient. The method is based on the matching of the image power spectral density, computed according to angular sectors of its Fourier spectrum, to the theoretical form of the power spectral density of a general fractal with known fractal dimension [7]. Obviously the method takes advantage of the fast computation of the spectrum of the regions of interest via the fast Fourier transform.
Detection of Postmenopausal Alteration of Bone Structure
281
log E
log f
a)
b)
Fig. 2. a) Fourier spectrum of a region of interest cropped from a calcaneum Xray; b) Log-log scale plot of energy vs. polar frequency (continous line) and the regression line (dashed line). The Hurst coefficient is H = 1.465, and the fractal dimension is D = 2.535.
2.2
The Histogram Parametric Model of Texture Description
The distribution of gray-levels within the digital image of the imaged calcaneum bone is of particular importance to the characterization of bone mineralization, since the gray-levels are directly related to the X-ray absorption in bone regions. The overall aspect of the gray-level histogram (probability density function of the gray-levels) changes significantly (see figure 3) between different levels of mineralization determined by the pre- and post-menopausal stages. As it can be seen from the plots in figure 3, the histograms exhibit different slopes and localization of the central mode. One mathematical model of practical interest for the characterization of a gray-level histogram in the defined regions of interest is the generalized exponential: h(x) = exp(ax2 + bx + c).
(3)
The most important parameters of the model (i.e. a and b) can be easily computed by the linear approximation (slope and intercept) of the derivative of the histogram’s logarithm, since: ∂ log h(x) = 2ax + b. ∂x
(4)
We shall use the a and b parameters from (4) as textural features describing the selected regions of interest.
3
Results
We analyzed a 24 cases database, composed by two sets of 12 radiographs from two groups of women: a premenopausal group between 26 and 38 years (yrs)
282
C. Vertan, I. S ¸ tefan, and L. Florea
a)
b)
h(i)
h(i)
0.04
0.04
0.035
0.035
0.03
0.03
0.025
0.025
0.02
0.02
0.015
0.015
0.01
0.01
0.005
0.005
0 0
50
100
150
c)
200
250
i
0 0
50
100
150
200
250
i
d)
Fig. 3. Regions of interest from the a) pre-menopausal and b) post-menopausal group; c), d) gray-level histograms of the regions in a) and b), respectively. The region of interest is region 0, located in the thalamic trabecular system. The appearance of the region of interest extracted from the post-menopausal group shows clear signs of bone demineralization, visible as an overall blurring of the image.
(mean age = 33 yrs, standard deviation = 3.8 yrs) and a postmenopausal group with ages between 48 and 65 yrs (mean age = 56 yrs, standard deviation = 5.5 yrs). All images were acquired with an usual 2Mpixel consumer digital camera from the negatoscope-presented original X-ray films. For each digitized X-ray image, three 150 × 150 pixels regions of interest (ROI) were manually selected in relevant regions of the main bone trabecular systems of the calcaneum (schematically shown in figure 4). Indeed, ROI 0 corresponds to the thalamic region, ROI 1 to Ward’s triangle and ROI 2 to the region where the posterior plantar group of trabeculae intersects the thalamic group, as presented in figure 1. The complexity of human bone structure (as appearing in the simple X-rays images) makes impossible the use of an unique description technique (statistical or fractal) for an accurate bone region discrimination and description. We propose thus the cooperative use of both fractal measures and first-order statistics for the characterization of the bone structure and the detection of the postmenopausal bone modifications in the calcaneum by an automatic procedure. The decision about the presence or absence of osteoporosis is taken according to the feature vectors of the images, following a jackknife (leave-one-out) 3-nearest neighbor (3NN) classification scheme. The decision regarding each
Detection of Postmenopausal Alteration of Bone Structure
283
Fig. 4. Trabecular systems within the calcaneus bone: anterior apophysis system (1), plantar system (2, 4), achillean system (3), thalamic system (5) and Ward triangle (6)
calcaneum bone image is taken according to the known clinical decision associated to the majority of images from the same database which are “closest” to the image under test in the feature space. Thus, the decision algorithm implies the computation of distances between the feature vector of the image under test and that of all of the other images in the database, then selecting the three images that are the nearest to the given image and identifying the clinical case with most occurrences. Table 1 summarizes the most relevant results. The best identification (1 miss detection and 2 false alarms, standing for an overall error of 12.50%) is obtained for a heuristic combination of fractal and statistical parameters (as suggested by previous studies [9] that aimed at the automatical detection of osteoporosis from calcaneal X-rays, but on a different collection of images). The robustness of the parameters was tested in a Monte-Carlo style test: from each X-ray image, each of the regions of interest was shifted by random displacements up to ±30 pixels. For the resulting ROIs, the histogram parameters and the fractal dimensions were computed and the resulting feature vector was compared to the 24 basic feature vectors describing the images and classified according to a 3NN decision. The test was repeated 1000 times for each image. Overall there were 3 miss-detections and 2 false alarms, that included the cases that generated miss-detections and false alarms in the original classification (presented in table 1). Thus, there are only 2 supplemental errors (8.33%) induced by the poor selection of the regions of interest. Table 1. Decision results Feature and region Miss detection ratio False alarm ratio Total error Histogram parameters in ROI 0 16.66% 33.33% 25.00% Histogram parameters in ROI 1 25.00% 25.00% 25.00% Histogram parameters in ROI 2 50.00% 33.33% 41.66% Histogram parameters in all ROIs 16.66% 25.00% 20.83% Fractal dimension in ROI 2 41.66% 16.66% 25.00% Histogram parameters in ROI 0, 1 8.33% 16.66% 12.50% and fractal dimension in ROI 2
284
4
C. Vertan, I. S ¸ tefan, and L. Florea
Conclusions
We proposed a computer-vision-based system approach for the support of osteoporosis screening by pre- or post-menopausal condition classification, based on the digital image processing of classical calcaneum X-ray images, acquired by a simple consumer digital camera. The bone quality (and thus the degree of osteoporosis) is described by simple textural features. The inherent complexity of the bone structure makes difficult to find a simple set of features that effectively describe the bone status at a single location. The cooperative (joint) use of fractal and statistical features computed within three clinically significant regions is shown to provide a fast, precise and robust classification. There are several other directions of further work: the correlation of the pre-/post-menopausal group classification to the osteoporosis degree, the use of more regions of interest for the computation of textural features and the testing of more textural features, as well as the automatic selection of the regions of interest (the automatic localization of the Ward triangle).
References 1. Bouxsein, M.L.: Mechanisms of osteoporosis therapy: a bone strength perspective. Clinical Cornerstone, pp, 13–21 (2003) 2. Pothuaud, L., Lespessailles, E., et al.: Fractal analysis of trabecular bone texture on radiographs: Discriminant value in postmenopausal osteoporosis. Osteoporosis Int. 8, 618–625 (1998) 3. Link, T.M., Majumdar, S.: Current diagnostic techniques in the evaluation of bone architecture. Current Osteoporosis Reports, pp. 47–52 (2004) 4. Apostol, L., Boudousq, V., et al.: Relevance of 2d radiographic texture analysis for the assessment of 3d bone micro-architecture. Medical Physics 33, 3546–3556 (2006) 5. Benhamou, C.L., Poupon, S., et al.: Fractal analysis of radiographic trabecular bone texture and bone mineral density: two complementary parameters related to osteoporotic fractures. J. Bone Mineral Res. 16, 697–704 (2001) 6. Haralick, R.M.: Statistical and structural approaches to texture. Proc. of IEEE 67, 786–804 (1979) 7. Pentland, A.: Fractal-based description of natural scenes. IEEE Trans. on Pattern Analysis and Machine Intelligence 6, 661–674 (1984) 8. Mandelbrot, B.B.: The Fractal Geometry of Nature. Freeman Publ., San Francisco CA (1983) 9. Vertan, C., Ionescu, B., S ¸ tefan, I., Ciuc, M.: Osteoporosis Detection in Calcaneum X-Ray Images by Digital Image Processing: Fractal and Statistical Approaches. In: Interdisciplinary Applications of Fractal and Chaos Theory, pp. 264–273. Romanian Academy Publ, Bucharest, Romania (2004)
Blood Detection in IVUS Images for 3D Volume of Lumen Changes Measurement Due to Different Drugs Administration David Rotger1,2 , Petia Radeva1,2 , Eduard Fern´ andez-Nofrer´ıas3, 3 and Josepa Mauri 1
Computer Science Department, Autonomous University of Barcelona, Spain 2 Computer Vision Center, Autonomous University of Barcelona, Spain 3 University Hospital Germans Trias i Pujol, Spain
Abstract. Lumen volume variations is of great interest by the physicians given it reduces the probability of infarction as it increases. In this paper we present a fast and efficient method to detect the lumen borders in longitudinal cuts of IVUS sequences using an AdaBoost classifier trained with several local features assuring their stability. We propose a criterion for feature selection based on stability leave-one-out cross validation. Results on the segmentation of 18 IVUS pullbacks show that the proposed procedure is fast and robust leading to 90% of time reduction with the same characterization performance.
1
Introduction
Intravascular Ultrasound Images (IVUS) are an excellent tool for direct visualization of vascular pathologies and evaluation of the lumen and plaque in coronary arteries. However, visual evaluation and characterization of plaque require integration of complex information and suffer from substantial variability depending on the observer. This fact explains the difficulties of manual segmentation prone to high subjectivity in final results. Automatic segmentation will save time to physicians and provide objective vessel measurements [1]. Nowadays, the most common methods to separate the tissue from the lumen are based on gray levels providing non-satisfactory segmentations. This leads to use more complex measures to discriminate lumen and plaque. One of the most wide spread methods in medical imaging for such task is texture analysis. The problem of texture analysis has played a prominent role in computer vision to solve problems of object segmentation and retrieval in numerous applications [2], [3]. This approach, encodes the textural features of our image, and provide a feature space in which a classification based on such primitives is easier to perform. In general, two approaches are used for texture analysis: supervised and unsupervised analysis. Our scheme will use supervised texture analysis. Texture analysis has an important problem in both approaches, the precise location of textured object boundaries. Previous works in segmentation of IVUS images have shown different ways to segment lumen and to classify tissues [4], [5], [6]. However, these approaches usually are semi-automatic, and very sensitive to W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 285–292, 2007. c Springer-Verlag Berlin Heidelberg 2007
286
D. Rotger et al.
(a)
(b)
Fig. 1. IVUS images: (a) original cartesian image, (b) Longitudinal cut at 45 of the IVUS sequence
image artifacts. The classification process is critical step in any image segmentation problem. Arcing and boosting techniques have been applied successfully to different computer vision areas. In this paper we analyze the relevance of boosting techniques, and in particular AdaBoost in Intravascular Ultrasound Image analysis for a dual task, creation of a strong classifier and feature selection. Moreover, we propose a fast and efficient robust process to detect blood in IVUS image sequences. This process is integrated in an automatic framework for discrimination of lumen and tissue in longitudinal cuts of the IVUS sequence of images (figure 1). The method is divided in 4 steps, corresponding to feature extraction, feature selection, classification and higher level organization of data using deformable models. An objective evaluation of the different approaches is made and validated by the physicians in patients with different pathologies and images with different topologies.
2
Methods
We propose an algorithm for fast and accurate detection of the lumen borders on a longitudinal cut of the IVUS sequence that consists on a pixel classification step followed by a morphological postprocessing and a final segmentation. In order to detect the lumen borders in the IVUS sequences we use a learning approach, in particular the AdaBoost (Adaptive Boosting) classifier formulated by Freund and Schapire in [7]. This technique is a supervised learning and classification tool, created as a method for combining simple classifiers in a multiple classifier in order to obtain a very accurate precision. Roughly, it is an iterative assembling process in which each classifier is devoted to find a good division of the sub-set points formed by the samples that are more difficult classified by the ”weak” classifiers estimated up to that point. It is recognized as one of the most accurate processes for high accuracy classification. 2.1
AdaBoost Procedure
Adaboost is an iterative method that allows the designer to keep adding ”weak” classifiers until some desired low training error has been achieved [7],[8]. At each
Blood Detection in IVUS Images for 3D Volume
287
step of the process, a weight is assigned to each of the feature points. These weights measure how accurate the feature point is being classified at that stage. If it is accurately classified, then its probability of being used in subsequent learners is reduced, or emphasized otherwise. Thus, AdaBoost focuses on difficult training points at each stage. The classification result is a linear combination of the ”weak” classifiers. The weight of each classifier is proportional to the amount of data that classifies in a correct way. 2.2
AdaBoost as Feature Selection Process
As an additional feature, AdaBoost is capable of performing a features selection process while training. In order to perform both tasks, feature selection and classification process, a weak learning algorithm is designed to select the single features which best separate the different classes. That is, one classifier is trained for each feature, determining the optimal classification function (so that the minimum number of feature points is misclassified). And then, the most accurate classifier-feature pair is stored at that stage of the process. If feature selection is not desired, the weak classifier focuses on all the features at a time. The original training set consisted on vectors of 263 features concerning the gray level of the image, the gradient image, a mean of the gray level of the neighbor cuts, the standard deviation, mean and a division between both of local windows of different sizes surrounding the evaluated pixel, a bank of Gabor filters (a special case of wavelets, see [9]) for several frequencies and orientations, Cooccurrence Matrices (defined as the estimation of the joint probability density function of gray level pairs in an image [10]), Local Binary Patterns [11] and Fast Fourier Transform of local windows of different sizes surrounding the evaluated pixel. 2.3
Stability Criterion of the Selected Features
AdaBoost tries to assure the better performance given the training set and the set of features. But several authors have discussed that even the good performance of the classifier, the stability of the features selected is not warranted. Moreover, in [12] authors assure that ’Learning from examples’ is a paradigm in which systems learn a functional relationship from a training set of examples. Within this paradigm, a learning algorithm is a map from the space of training sets to the hypothesis space of possible functional solutions. A central question for the theory is to determine conditions under which a learning algorithm will generalize from its finite training set to novel examples. Therefore, we will need to characterize conditions on the hypothesis space that ensure generalization for the natural class of empirical risk minimization (ERM) learning algorithms that are based on minimizing the error on the training set. In [13] Feature Space Mapping model is proposed. It describes an adaptive system that measures the contribution of each feature to the final classifier, useful
288
D. Rotger et al.
(a) Original image.
(b) Output with all the selected features.
(c) Output with the 5 most relevant features. Fig. 2. Output of the classifier trained with different features
on the reduction of multidimensional searches to a series of one-dimensional searches. It is an exhaustive method to assure that each feature added to the system, combined with the rest, does not decrease the performance of the classifier. To assure the stability of the features selected by the AdaBoost classifier and to reduce the variance of the error of classification due to a bad election of the samples, we propose a method based on the combination of the features selected by several AdaBoost classifiers taking into account their weight at each classifier. We left all the samples concerning to the IVUS pullbacks performed to one of the patients out at each time and trained a classifier with a randomized set of samples of the rest of the patients data. Given that we wanted to speed up the process of blood detection in order to make it feasible in the day by day clinical practice, we did a feature study similar to Feature Space Mapping [13]. We started with a classifier trained with only the feature of maximal accumulated weight up to a classifier trained with the 15 most relevant features taking into account their accumulated weights. The results were ”surprising” given that, as we can see in figure 2 only 5 features were necessary to obtain an output similar to the one obtained with the 84 features selected by the original classifier. 2.4
Postprocessing and Segmentation of the Lumen
Once we have the output of the classifier obtained, we filtered it using mathematical morphology to smooth the results and adapt a snake to them. The morphological filters used are a majority filter (sets a pixel to 1 if five or more pixels in its 3-by-3 neighborhood are 1’s; otherwise, it sets the pixel to 0) to connect isolated points and an opening (erosion followed by dilation) of the obtained result with a cross as structuring element (see [14]). The last step to detect the lumen border is to adapt a B-Snake model to the output of a distance map of the Canny edges of the previously obtained image (see [15],[16],[17]). Figure 3 shows the result of the filtering (a) and the adapted B-Snake model (b).
Blood Detection in IVUS Images for 3D Volume
289
(a) Filtered output of the classifier of figure. 2(c)
(b) Detected edges (blue) and the adapted snake (red). Fig. 3. B-Snake model adapted to the filtered classifier output
3
Results
A deep validation process was performed to test the stability criterion, the performance of the final classifier and the speed up of the process. First of all, we wanted to validate our feature stability criterion. We trained 9 different classifiers. As we said before, to assure the stability of feature set, we left all the samples concerning to the IVUS pullbacks performed to one of the patients out at each time and trained a classifier with a randomized set of samples of the rest of the patients data, performing a real leave-one-out cross validation. Each classifier reduced the dimensionality of the feature set from 263 initial features to approximately 36 by mean (see figure 4). But if we accumulate the results of all the classifiers, we can see that 84 different features were selected by at least one classifier, that is 179 features have not been selected by any classifier (see figure 5). Moreover, the selected features with higher weights are not the same in all the cases, showing the problem of stability in the feature selection using AdaBoost. In terms of performance, we have trained new classifiers (always following the leave-one-out strategy for the cross validation) with a feature set increasing from 5 up to 15 features. Beginning with a classifier trained with the 5 features of higher weight of figure 5, at each step, we added a new feature, trained a 1.4 1.2
Weight
1 0.8 0.6 0.4 0.2 0
0
50
100
150 Feature number
200
250
Fig. 4. Selected features by one AdaBoost classifier and their weights
290
D. Rotger et al.
5
Weight
4
3
2
1
0
0
50
100
150 Feature number
200
250
Fig. 5. Addition of all the weights assigned by all the trained AdaBoost classifiers for each feature
15 mean Error std Error median Error 10
5
0
4
6
8
10 Number of features
12
14
16
Fig. 6. Evolution of the mean error (in blue, stars), standard deviation (in green, circles) and the median of the error (in red, crosses) of each multiple classifier
Fig. 7. Result of the lumen border detection with a multiple classifier trained with 5 features
new AdaBoost classifier and tested its performance. Figure 6 shows a plot of the evolution of the mean error, standard deviation and the median of the error of each multiple classifier. In terms of speed, the most of the features with higher weight in figure 5 take almost the same time in being computed, so we can speed up the process by choosing a number of features as low as possible. We can see in figure 6 that choosing a classifier trained with only 5 features we can achieve very good performance and the speed of the process has increased more than the 90% in comparison with the first approach of 84 features. We can can now process a 2400 images cut in a mean time of 1.32 minutes (without code optimizations,
Blood Detection in IVUS Images for 3D Volume
291
written in MatLab) and taking into account that the most of this time is devoted to the sequence decompression and the generation of the cut. Figure 7 shows the result in the segmentation of a 2400 images longitudinal cut with only 5 features selected.
4
Conclusions and Future Work
We have proposed a novel stability criterion for feature selection with AdaBoost multiple classifiers and applied it for speeding up the blood detection in an IVUS sequence with great performance. We have seen that with a multiple classifier trained with only 5 ”well chosen” features we can detect the lumen borders of any longitudinal IVUS cut with a mean error of 5.24 pixels (0.105 mm) in 1.32 minutes. It is needed to say that this time can be reduced programming the feature calculation and the cut generation with a compiled language like C++ instead of our approach in MatLab. The results are very promising to be applied in volume calculations to evaluate, for example, the effect of different drugs. In combination with threedimensional reconstruction of the vessel shape using angiography and exact correspondence with the IVUS data as proposed in [18], we can compute the volume of any vessel in a very fast and accurate way.
Acknowledgments This work was supported in part by a research grant from projects TIN200615308-C02 and FIS-PI061290.
References 1. Sonka, M., Zhang, X., S.e.a, M.: Segmentation of intravascular ultrasound images: A knowledge based approach. IEEE Trans. on Medical Imaging 14, 719–732 (1995) 2. Malik, J., Belongie, S., Shi, T.L.: Contour and texture analysis for image segmentation. International Journal of Computer Vision 43, 7–27 (2001) 3. Puzicha, J., Buhmann, T.H.: Unsupervised texture segmentation in a deterministic anhealing framework. IEEE Trans. on Pattern Recognition and Machine Intelligence 20, 803–818 (1998) 4. Zhang, X., Sonka, C.M.: Tissue characterization in intravascular ultrasound images. IEEE Trans. on Medical Imaging 17, 889–899 (1998) 5. von Birgelen, C., van der Lugt, A., A.N., et al.: Computerized assessment of coronary lumen and atherosclerotic plaque dimensions in three-dimensional intravascular ultrasound correlated with histomorphometry. American Journal of Cardiology 78, 1202–1209 (1996) 6. Klingensmith, J., Shekhar, R., Vince, D.: Evaluation of three-dimensional segmentation algorithms for identification of luminal and medial-adventitial borders in intravascular ultrasound images. IEEE Trans. on Medical Imaging 19, 996–1011 (2000)
292
D. Rotger et al.
7. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: European Conference on Computational Learning Theory, pp. 23–37 (1995) 8. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features (2001) 9. Daugman, J.: Uncertainty relation for resolution in space, spatial frequency and orientation optimized by two dimensional visual cortical filters. Journal of the Optical Society of America 2, 1160–1169 (1985) 10. Dubes, R., Ohanian, P.: Performance evaluation for four classes of textural features. Pattern Recognition 25, 819–833 (1992) 11. Ojala, T., M, P., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 971–987 (2002) 12. Poggio, T., Ryan Rifkin, S.M., Niyogi, P.: General conditions for predictivity in learning theory. Letters to Nature 428, 419–422 (2004) 13. Duch, W., Diercksen, G.H.F.: Feature Space Mapping as a universal adaptive system. Computer Physics Communications 87, 341–371 (1995) 14. Haralick, R.M., Shapiro, L.G.: Computer and Robot Vision, vol. I. Addison-Wesley Longman Publishing Co., Inc, Boston, MA, USA (1992) 15. Michael Kass, A.W., Terzopoulo, D.: Snakes: Active contour models. International Journal of Computer Vision 1, 321–331 (1998) 16. McInerney, T., Terzopoulos, D.: Deformable models in medical images analysis:a survey. Medical Image Analysis 1, 91–108 (1996) 17. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8, 679–698 (1986) 18. Rotger, D., Rosales, M., Garcia, J., Pujol, O., Mauri, J., Radeva, P.: Activevessel: A new multimedia workstation for intravascular ultrasound and angiography fusion. In: Proc. IEEE of Computers in Cardiology 30, 65–68 (2003)
Eigenmotion-Based Detection of Intestinal Contractions Laura Igual1 , Santi Segu´ı1 , Jordi Vitri` a1 , Fernando Azpiroz2 , and Petia Radeva1 1
Computer Vision Center and Universidad Aut´ onoma de Barcelona, Bellaterra, Spain 2 Hospital de Vall d’Hebron, Barcelona, Spain
Abstract. Intestinal contractions are one of the main features for analyzing intestinal motility and detecting different gastrointestinal pathologies. In this paper we propose Eigenmotion-based Contraction Detection (ECD), a novel approach for automatic annotation of intestinal contractions of video capsule endoscopy. Our approach extracts the main motion information of a set of contraction sequences in form of eigenmotions using Principal Component Analysis. Then, it uses a selection of them to represent the high dimension motion data. Finally, this contraction characterization is used to classify the contraction sequences by means of machine learning techniques. The experimental results show that motion information is useful in the contraction detection. Moreover, the proposed automatic method is essential to speed up the costly examination of the video capsule endoscopy.
1
Introduction
The analysis of the small bowel contractions has been proved to be a meaningful method for diagnosing several intestinal dysfunctions [1], [2]. Currently, the most extended diagnosis test for motility disorders is intestinal manometry [3], which measures variations of pressure. This technique is highly invasive and disagreeable and requires hospitalization of the patient. A recent and very promising alternative acquisition method is the Wireless Capsule Video Endoscopy (WCVE) [4] which consists of a capsule with a camera, a battery, and a set of led lamps for illumination. The capsule is swallowed by the patient and emits a radio frequency signal stored in an external device. The result is a video which records the ”trip” of the capsule along the intestinal tract with a rate of two frames per second. This novel technique is non-invasive compared with the manometry and there is no need of hospitalization of the patient. The resulting images provide a view of the inner gut in which the gut wall and lumen are visualized (Fig. 1 (right)). The visual paradigm of intestine muscle contractions in WCVE is defined as a close-open movement of the lumen. Fig. 1 (left) shows three examples of intestinal contraction sequences. They are defined by 9 frames, since phasic contraction takes on average 4.5 seconds. The non-occlusive contractions (lumen is not completely closed) are hard to detect by classical manometry, since the intestinal walls are not accomplishing enough amount of pressure. However, W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 293–300, 2007. c Springer-Verlag Berlin Heidelberg 2007
294
L. Igual et al.
Fig. 1. Left: three different examples of contraction sequences. Right: example of a frame from capsule video endoscopy with a description of the parts in it.
they are clearly identified in WVCE (Fig. 1 (third row)). Unfortunately, the examination and interpretation of the capsule recordings by the specialists is time consuming - it takes approximately two hours for each study - and very tedious, since the prevalence of contractions in video is very low (1:50 frames). Thus, this procedure becomes not feasible as a clinical routine. Several works have been developed for the automatic detection in WCVE of intestinal affections, such as cancer, bowel Crohn’s disease and small bowel ulcers [5], [6], [7]. Nevertheless, to the author’s knowledge WCVE has not been used to analyze the intestinal motility disorders, except for the work presented in [8]. We propose Eigenmotion-based Contraction Detection (ECD), a new method for the automatic detection of intestinal contractions in WCVE based on the dynamic patterns present in the video. Unlike this method, the approach in [8] only uses static information of 9 frames and does not exploit the motion structure of the capsule recordings. For now on, we will refer to the method developed in [8] as Static-based Contraction Detection (SCD). In further sections we will show that including motion analysis improves drastically the results. In order to study the motion present in the WCVE, we use a robust method for optical flow estimation [9]. Afterwards, we learn the main motion structures from a particular training set of optical flow fields using Principal Component Analysis (PCA) [10]. Then, we use a selection of the learned eigenmotions to characterize the contraction sequences. This new data representation is applied in the posterior classification of the video sequences as contraction or non-contraction sequences. In previous works, as [11] and [12], the authors propose frameworks for learning generic, expressive optical flow priors which later are used to improve the optical flow computation. We extract main optical flow patterns, but our goal is to use them for classifying. The paper is organized as follows: in Section 2, we introduce the methodology employed to the automatic detection of intestinal contractions, in Section 3 we present the experimental results, and finally, in Section 4, we expose the conclusions and future work.
2
Methodology
The proposed methodology has five steps: 1) Filtering, 2) Motion Estimation, 3) Eigenmotion Extraction, 4) Eigenmotion Selection and 5) Classification.
Eigenmotion-Based Detection of Intestinal Contractions
295
1) Filtering: The movement of intestinal contractions in WCVE can be hidden by technical and physiological effects. On one hand, the free movement of the camera into the gut leads to an unsteady orientation. On the other hand, intestinal juices mixed up with the remains of food can also prevent the correct visualization of contraction events. These visual artifacts present in the video make useless a subset of frames in it. The aim of this step is to reject this kind of frames, which are labelled as: turbid, wall and tunnel frames, and represent generally the 30% of the video frames. Turbid frames are those where intestinal turbid liquid appears. As proposed in [8], we characterize them in terms of color and we classify them. Wall frames are due to the stable orientation of the camera towards the intestinal wall, keeping the intestinal lumen completely out of the field of view. Tunnel frames are the result of a continuous orientation of the camera focusing the intestinal lumen during an undefined period of time in which the gut does not move. Both wall and tunnel frames are described by means of the sum of the lumen area throughout the sequence of 9 frames which is compared with two certain thresholds empirically set with the help of the experts. In order to estimate the area of the lumen in each frame, a Laplacian of Gaussian (LoG) filter is applied [8]. 2) Motion Estimation: Motion estimation is a key problem in computer vision which consists in computing the apparent motion of the elements present in an image sequence, called optical flow field. Many approaches exist to carry out this estimation, but we have chosen the method proposed in [9] for its proved robustness [9], [12]. It consists on a framework based on robust estimation that addresses the problem of improving accuracy of flow estimates in regions containing multiple motions by relaxing the single motion assumption. It estimates the dominant motion accurately, ignoring other existing motions by minimizing a functional with robust data and smoothness terms. In addition, the approach detects where the single motion assumption is violated (i.e., where the error measure is large), and these positions are examined to see if they correspond to a consistent motion. Before computing the optical flow field of endoscopy video sequences, we try to correct any deviation of the lumen position due to the capsule movement. For that, we simply translate the image in such a way that the centroid of the lumen coincide with the center of the image. Fig. 2 shows the estimated optical flow fields of consecutive frames of a contraction sequence. 3) Eigenmotion Extraction: The purpose of this step is to find motion models from the class of motions of intestinal contraction sequences, denoted C. This problem has many applications including dimensionality reduction, visualization, exploratory data analysis and pattern recognition. In some cases the models can be derived analytically imposing some hypothesis over data, e. g. the assumption that the addressed motions follow a parametric affine model. However, in many cases, as in the current problem, more complex motions are considered, and PCA can be used to find a basis of optical flow fields with an available training ensemble [10], [12], [13].
296
L. Igual et al.
Fig. 2. Optical flow pairwise computed between the 1th to 5th frames of a contraction sequence. The optical flow fields are superimposed to the 2th to 5th frames.
Let us define the training set of representative samples of the class C as a matrix X = {vi }i∈{1,...,N } , where vi denotes a particular optical flow field, and N is the number of optical flow fields in the set. Note that, since there are 8 motions in each contraction sequence, N = 8c, where c is the number of contraction sequences in the set. Moreover, we only consider the optical flow within a defined region of interest, which contains the lumen of the video endoscopy images. Thus, each vector vi has dimension 2D, where D is the area of the region of interest (15373 pixels). Limiting the optical flow to this region avoids data to be disturbed by possible boundary effects. We apply PCA to the 2D × N matrix X to obtain a decomposition X = W H, where W = {wi }i∈{1,...,R} , is a 2D × R matrix containing an orthonormal basis of optical flow fields in its columns, and H = {hi }i∈{1,...,R} , is a R × R matrix. The elements of this basis are the eigenvectors of the covariance matrix of the set of optical flow fields, and we call them eigenmotions. An approximation v˜j of optical flow fields in the class C can be obtained as a linear combination of eigenmotions: v˜j =
r
wk hk , j = 1, . . . , R,
(1)
k=1
where r ∈ {1, . . . , R}. The question now arises is how ”better” the new basis of eigenmotions are as data representation and how much we can reduce the data dimensionality. The more extended way to select the best dimension r for the approximations is evaluating the cumulative energy content of the eigenmotions. Nevertheless, we propose a more intelligent way to select the basis elements which helps us to better understand the data indicating which ones are the important eigenmotions. This subject is argued in the following step. 4) Eigenmotion Selection: In this step we deal with the problem of selecting the ”best” eigenmotions. Eigenmotions are actually features of the data, thus, this procedure is called feature selection or eigenmotion selection. By removing the most irrelevant and redundant eigenmotions the effect of the curse of dimensionality is alleviated, facilitating the subsequent classification step.
Eigenmotion-Based Detection of Intestinal Contractions
297
In our case, we apply the method presented in [14] for feature selection. It requires the optimization of a global loss function, which contains one term associated with the empirical loss and a regularization constraint on the parameters w, but adding another constraint to also promote sparsity on the distribution of the features. Given v ∈ Rd a new parameter vector, the global loss is defined as follows: G(w, v) = L(X, w, v) + λ(w1 + σkd (v)1 ).
(2)
L(X, w, v) is the negated log-likelihood estimator for the parameter set, λ is a positive real value, and the two terms w1 + σkd (v)1 form the imposed L1 restriction on the parameters (w, v), where σkd : Rd → Rd is defined as σkd (a) = (σk (a1 ), ..., σk (ad )), ∀a = (a1 , ..., ad ) ∈ Rd , k any positive real value, and σk : R → R is the sigmoid function defined as σk (a) = 1+exp1(−ka) , ∀a ∈ R. The loss function (2) represents a preference for solutions that use a small set of components from the parameters. The optimization of this functional is carried out using the Boosted Lasso algorithm as explained in [14]. We need for the optimization an appropriate and carefully selected training set where samples are very representative of contractions. As a result of this task we obtain ¯. a subset of p eigenmotions of the basis W , represented as a 2D × p matrix, W The selected subset of eigenmotions of the basis is used to represent optical flow data. A set of M contraction sequences provide us 8M optical flow fields stored in a 2D × 8M matrix U . Then, we transform U into a p × 8M matrix Y as ¯ T U. The matrix Y contains M data of 8p dimensions. Note that follows: Y = W we reduce the dimension of the data from 16D to 8p. 5) Classification: Once the optical flow fields are transformed to the eigenmotionbased representation, this new codification is the input of a classifier. This classifier decide whether new video sequences are intestinal contraction sequences or not. In particular, we use the Relevance-Vector-Machine (RVM) classifier [16], since it has desired properties, namely, soft decisions and less complexity (number of kernels). Unlike the previous step, the present training process requires a rather large data set of video sequences, since our goal now is to achieve best generalization error.
3
Experimental Results
Our tests were carried out with data provided by the Digestive Diseases Research Group of Hospital ”Vall d’Hebron” in Barcelona, Spain, who are currently working with the endoscopy capsule developed by Given Imaging, Ltd., Israel [17]. Videos from healthy subjects and subjects with intestinal dysfunctions (here called patients) were considered. We first generated a set of intestinal contraction sequences extracted from a healthy subject video endoscopy, with the supervision of the specialists for assuring the representativeness of the selected group. This set was used to extract
298
L. Igual et al.
Fig. 3. The six first eigenmotions. The selected one is pointed by a black box.
the eigenmotions, applying step (2) and (3). In Fig. 3 we show the first six extracted eigenmotions (ordered by decreasing eigenvalues). The first and second ones correspond to translation motions, they are probably due to the global camera motion. The next ones are dilatation, rotation and two shear deformations. The eigenmotion selection performed in step (4), with the parameters set to λ = 0.5 and k = 10, returned only one selected eigenmotion. This one is pointed by a black box in Fig. 3. As can be seen, it represents contraction-dilatation movements. This motion is the one we would expect to find in the paradigm of intestinal contraction motion. Once the selected eigenmotion was available, new motion data were transformed to the eigenmotion-based representation, resulting in data of dimension 8, since p = 1. We evaluated the performance results of step (5) using a ten fold cross validation strategy [10] on a larger data set. We considered a pool of 10 different videos from healthy subjects and patients. We first filtered these videos using step (1). The remaining frames were revised by a specialist who indicated the center of the contraction sequences in it. This generated a set of 11030 sequences. These findings were used as gold standard for testing our system. The test set were composed by these contraction sequences and an under-sampled set of non-contraction sequences randomly chosen from the same pool. The results were validated using several measures, described in terms of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN), as follows: Error= F P + F N , Sensitivity= T P/(T P + F N ), Specificity= T N/(T N + F P ), Precision= T P/(T P + F P ) and False Alarm Ratio, FAR= F P/(T P + F N ). Table 1 summarizes the results of the Eigenmotion-based Contraction Detection (ECD) using two different data representation. The first one used the 51 first eigenmotions, which returned the 99% of the data variance, and led to data of dimension 408. The second one used the selected eigenmotion in step (4), and the data had dimension 8. As can be seen, the more simple representation gave
Eigenmotion-Based Detection of Intestinal Contractions
299
Table 1. Classification results of our ECD method for two different data representation Error Sensitivity Specificity Precision FAR ECD (dim= 408) 31.23% 69.29% 68.30% 68.96% 32.04% ECD (dim= 8) 29.51% 70.30 % 70.68% 70.57% 29.32%
Table 2. Classification results of the methods SCD and CCD Error Sensitivity Specificity Precision FAR SCD (dim= 54) 22.90% 73.55% 80.62% 79.14.% 28.46% CCD (dim= 62) 18.73% 77.58 % 84.95% 83.75% 24.20%
better results (second row), than the more complex one (first row). This means that the selected eigenmotion represents an excellent discriminant feature. Finally, in Table 2 (first row), we display the results obtained using the Staticbased Contraction Detection (SCD) [8]. This method characterized intestinal contractions with 54 features, instead of 8 as in our case, and the evaluation measures were, in general, better than the ones obtained with ECD. Anyway, we concluded that motion information is important when we studied the results of the Combined Contraction Detection (CCD) which used both, the static and dynamic features. In this case, sequences were defined in a 62-dimensional space. We can see that this method led to the best results (second row of Table 2). Moreover, these results represent high specificity and sensitivity rates compared to the rates obtained by the specialists.
4
Conclusions and Future Work
Our contribution represents a significant step towards the introduction of the WCVE for intestinal motility diseases diagnosis. Our proposal combine classical and new procedures to achieve a robust and efficient method for automatic detection of intestinal contractions. Our method obtains high specificity and sensitivity rates and can be used to drastically reduce the time of analyzing the video capsule endoscopy. The presented work can be extended in several ways. There are other methods for dimensionality reduction that could be used here instead of PCA, as for instance Independent Component Analysis (ICA) or Non-negative Matrix Factorization (NMF). These alternative approaches will be contemplated in future works. Moreover, as well as in general contraction detection, the characterization of the motility based on eigenmotions can be useful in other applications, as for instance in the definition, classification and detection of different types of contractions. In this direction, future work imply multiclassification tasks and other sophisticated techniques.
300
L. Igual et al.
Acknowledgements This work was supported in part by a research grant from Given Imaging Ltd., Yoqneam Israel, and Hospital Universitari ”Vall d’Hebron” Barcelona, Spain, as well as the project TIN2006-15308-C02.
References 1. Kellow, J., Delvaux, M., Aspriroz, F., et al.: Principles of applied neurogastroenterology: physiology motility-sensation. Gut 45(2), 1117–1124 (1999) 2. Quigley, E.M.: Gastric and small intestinal motility in health and disease. Gastroenterology Clinics of North America 25, 113–145 (1996) 3. Hansen, M.B.: Small intestinal manometry. Physiological Research 51, 541–556 (2002) 4. Iddan, G., Meron, G., et al.: Wireless capsule endoscopy. Nature 405, 417 (2000) 5. Tjoa, M.P., Krishnan, S.M.: Feature extraction for the analysis of colon status from the endoscopic images. Biomedical Engineering OnLine 2, 3–17 (2003) 6. Karkanis, S.A., Iakovidis, D.K., et al.: Computer aided tumor detection in endoscopic video using color wavelet features. IEEE Trans. on Information Technology in Biomedicine 7, 141–152 (2003) 7. Zheng, M.M., Krishnan, S.M., et al.: A fusion-based clinical support for disease diagnosis from endoscopic images. Comp. Biology and Medicine 35, 259–274 (2005) 8. Vilari˜ no, F., Spyridonos, P., Vitri` a, J., Azpiroz, F., Radeva, P.: Cascade analysis for intestinal contraction detection. CARS, 9–10 (2006) 9. Black, M.J., Anandan, P.: The robust estimation of multiple motions: parametric and piecewise-smooth flow fields. Comput. Vis. Image Underst. 63, 75–104 (1996) 10. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience Publication, New York, NY, USA (2000) 11. Roth, S., Black, M.J.: On the spatial statistics of optical flow (2005) 12. Fleet, D.J., Black, M.J., Yacoob, Y., Jepson, A.D.: Design and use of linear models for image motion analysis. Int. J. Comput. Vision 36(3), 171–193 (2000) 13. Skocaj, D., Bischof, H., Leonardis, A.: A robust pca algorithm for building representations from panoramic images. In: Tistarelli, M., Bigun, J., Jain, A.K. (eds.) ECCV 2002. LNCS, vol. 2359, pp. 761–775. Springer, Heidelberg (2002) 14. Lapedriza, A., Segu´ı, S., Masip, D., Vitri` a, J.: A sparse bayesian approach for joint feature selection and classifier learning. Technical Report CVC (2007) 15. Zhao, P., Yu, B.: Boosted lasso (2004) 16. Tipping, M.: The relevance vector machine. In: Advances in Neural Information Processing Systems, Morgan Kaufmann, Seattle, Washington, USA (2000) 17. Given Imaging, L.: (2007) http://www.givenimaging.com
Brain Tissue Classification with Automated Generation of Training Data Improved by Deformable Registration Daniel Schwarz1 and Tomas Kasparek2 1
Institute of Biostatistics and Analyses, Masaryk University, Kamenice 3, 625 00 Brno, Czech Republic
[email protected] 2 Clinic of Psychiatry, Faculty Hospital Brno, Jihlavska 20, 625 00 Brno, Czech Republic
[email protected] Abstract. Methods of tissue classification in MRI brain images play a significant role in computational neuroanatomy, particularly in automated ROI-based volumetry. A well-known and very simple k-NN classifier is used here without the need for user input during the training process. The classifier is trained with the use of tissue probability maps which are available in selected digital atlases of brain. The influence of misalignement between images and the tissue probability maps on the classifier’s efficiency is studied in this paper. Deformable registration is used here to align the images and maps. The classifier’s efficiency is tested in an experiment with data obtained from standard Simulated Brain Database. Keywords: image analysis, image registration, MRI, computational neuroanatomy, brain tissue classification, atlas-based segmentation.
1 Introduction Methods of brain tissue classification in MRI data are often used in whole-brain morphometry methods, such as voxel-based morphometry [1]. The brain tissues are segmented and their density loss is analysed in voxel-by-voxel tests [2], [3] in order to clarify how the morphological changes implicate neuropsychiatric diseases and to illustrate the relationship between the brain morphology and its function. Despite the vast rate of VBM utilization in the neuroimaging community [4], the conventional region of interest (ROI)-based methods play still an important role and stand for the gold standard in volumetry of anatomical structures. These methods require often manual segmentation, which makes them time-consuming, subjective and errorprone. To address the problem of manual segmentation, a number of automated unbiased techniques emerged. In [5], a deformable registration is used to warp a prelabeled image, which serves as an atlas, to a subject image. Automated segmentation of gross anatomical structures is thus achieved and it is further improved by tissue classification and several heuristic rules. Methods of image registration are still in the scope of many researchers [6], [7]. The most challenging tasks are connected to intersubject multimodal registration which does not assume that MRI tissue intensities and contrasts remain unchanged W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 301–308, 2007. © Springer-Verlag Berlin Heidelberg 2007
302
D. Schwarz and T. Kasparek
across subjects [8]–[10]. Even if a high-dimensional registration is used, it cannot be assumed that applying resulting deformations on a prelabeled atlas, a perfect segmentation can be obtained, mainly in small structures with a complex shape, e.g. hippocampus. After deforming a hippocampus mask obtained from an atlas, the voxels belonging into white matter or cerebrospinal fluid classes may be subtracted from the neighbourhood of the mask refining its shape in this way. In this paper, an algorithm for tissue classification is tested for its further use in atlas-based segmentation. Tissue probability maps (TPMs) are used to train the classifier and the influence of misalignment between images and TPMs on the classifier’s efficiency is studied. The misalignment is caused by intersubject anatomical variability and is partially suppressed with the use of deformable registration.
2 Methods Efficiency of algorithms for brain tissue segmentation in MRI images is tested here. A simple algorithm, usually used for supervised classification, is tested together with an automated training process based on TPMs. Next, prelabeled image is deformed according to transformation found in deformable registration of an atlas with a target image. Simulated images with ground-truth segmentations are used for evaluation purposes. 2.1 Unsupervised Classification The k-Nearest Neighbour (k-NN) classification rule is proposed here as it is a straightforward and intuitive method. It is a technique for nonparametric supervised pattern classification. It uses a training set of N prototype feature vectors and the corresponding correct classification of each prototype into one of C classes. For a given vector, it searches for the k nearest vectors in the training set according to a given metric, traditionally the Euclidean distance. The class of the vector is then decided by the majority class among the k nearest neighbours [7]. The size of the feature vectors depends on the data to be classified. In the case of MRI brain images the feature vectors contain one or more intensities from T1-weighted, T2-weighted or proton density (PD)-weighted images. In order to avoid user’s input in the training set construction, an automated selection of the prototypes is proposed here. Decisions about the spatial locations of the prototypes are not made by a user but it is done according to (TPMs). For each spatial location in the stereotaxic space, the TPM value expresses the probability of the occurrence of the given tissue in that location. The spatial probability distribution represents certain subject population which the particular atlas is made from. In the proposed algorithm, the set of prototypes contains intensities sampled at the most probable locations of main brain tissues occurrences. The same number of prototypes is taken for each of the tissues: white matter (WM), gray matter (GM) and cerebrospinal fluid (CSF) here. The number of the prototypes affects the final classifier efficiency. A higher number of prototypes means to sample at locations with lower probabilities of certain tissue occurences. Hence, it produces more false tissue pairs, i.e. intensity pairs representing
Brain Tissue Classification with Automated Generation of Training Data
303
two different tissues in the atlas and in the image. Due to the intersubject anatomical variability, even sampling at the locations with very high probabilities of certain tissue occurences can bring a fraction of false tissue pairs. On the other hand, a very low number of prototypes produces a limited coverage of the brain area. Due to the inhomogenity of brain tissues throughout the brain and due to the MRI intensity nonuniformity, the sampling at a very low number of locations is not a good solution to prevent the false prototypes in the training data. Here, a two-step strategy is proposed to improve the training dataset obtained with the use TPMs: 1) reduction of the intersubject anatomical variability with the use of deformable registration, 2) pruning away the prototypes which are the most distant from the median prototype for each class. 2.2 Deformable Registratiom The algorithm used here for nonlinear multimodal intersubject registration is inspired by Rogelj’s approach in [8] and it is described in detail in [9]. The registration method operates directly on image intensity values with no data reduction by segmentation or classification. The scheme of the algorithm is in Fig. 1. At first, local translations f (usually referred as forces) are calculated at each voxel or pixel as the derivative of a local similarity measure derived from a global mutual information. Then, a spatial deformation model is used to compute the displacement u of the floating image N from the local forces. The process is iterated until the global similarity measure stops increasing. A multiresolution strategy propagating solutions from coarser to finer levels is applied to speed up the convergence of the algorithm and to avoid local optima. The combined elastic-incremental spatial deformation model proposed in [8] is used here. It combines the elastic and the incremental elastic approach to deformable registration based on continuum mechanics. Its design follows the concept of solving partial differential equation associated with linearized elasticity or viscosity by convolution filtering, where the filter kernel equals the impulse response of the deformable media. The displacement is computed as a reaction of local forces exerted in the floating image N by [8]:
u f = c ⋅f,
(
)
u (i ) = u (i −1) + u (fi ) ∗ G I ∗ G E .
(1)
The first part of (11) follows Hook's law to compute unregularized displacements. It says that the points move proportionally to the applied forces f with a constant c. The second part of (1) regularizes the displacements by convolution filters GI and GE which define spatial deformation properties of the modelled material. Gaussian kernel as a separable approximation to the elastic kernel is used here. It does not provide control over compressibility, due to independence of spatial dimensions. While this property is disadvantageous in particular registration tasks, it is required in the case of intersubject registration. 2.3 Synthetic Images and Deformations In real biomedical problems, the correct deformations between images or between an image and an atlas are unknown. To assess the quality of automated classification,
304
D. Schwarz and T. Kasparek
correctly registered images together with their segmentations in advance are required. Realistic T1 , T2- and proton density (PD)-weighted data with 3% noise and 20% intensity nonuniformity (INU) from the Simulated Brain Database (SBD) are used here for evaluation purposes. The values of noise and INU are reported as typical artifact severity in [11]. Images with voxels classified according to the tissue they represent are available. The disadvantage of the evaluation technique may reside in the synthetic character of the initial deformations which may be different from the real intersubject anatomical variability. Thus, the synthetic deformations are produced here by two completely different anatomical variability simulators, which are described further. Additionally, the performance of the proposed methods is evaluated in complex studies involving registration and classification experiments on many images deformed by various synthetic deformations. Due to the computational complexity of the classification and registration methods, the evaluation is done on 2D data. Both anatomical variability simulators are used to produce 20 random, image dependent deformations for 20 transversal slices extracted from each SBD volume (T1-, T2- and PD-weighted). The deformations are normalized, in order to use them for simulating various degrees of anatomical differences, expressed by the maximum absolute initial displacement |uinitMAX|∈{5 mm, 10 mm, 12 mm}. RGsim - simulator based on Rogelj's Gaussians. Rogelj's combined elasticincremental spatial deformation model is used to regularize random force fields in an iterative process. The standard deviations of the Gaussian convolution filters are set randomly to σGI=4±2 mm and σGE=8±2 mm in each iteration. Random forces are generated at 10% randomly selected points from the points which form edges in the image. The maximum force magnitude is gradually decreasing |fMAX|~Ki/i, where Ki=3±2 is a randomly preset total number of iterations and i is the current iteration. The diffeomorphicity of the obtained deformation with |uinitMAX|=12 mm is checked with the minimum determinant of its Jacobian in each iteration and σGI and σGE are linearly increased, if the condition of topology preservation is not met. TPSsim - simulator based on thin-plate splines. Multilevel deformations with thinplate splines as basis functions are generated. Forces (translations of control points) are set in centers of blocks. The number of blocks is increasing in the subdivision scheme from 16 in the second level to 4096 in the sixth level (the first level is left out). The translations are set equally to the displacements obtained in previous levels. Only in randomly selected blocks, the translations are set differently to refine the deformation. The probability of the block selection is decreasing P=0.6level-2 as well as the maximum force magnitude |fMAX|~0.75level-2. The total number of changing translations is 95 on the average. The diffeomorphicity of the obtained deformation with |uinitMAX|=12 mm is checked with the minimum determinant of its Jacobian in each iteration and P and |fMAX| are linearly decreased, if the condition of topology preservation is not met.
Brain Tissue Classification with Automated Generation of Training Data
305
3 Experiment and Results The experiment included five segmentation and classification strategies, see Fig. 1: 1) brain tissue segmentation based on deforming a prelabeled image, 2) k-NN classification with the use of TPMs, 3) k-NN classification with the use of TPMs matched to the classified image with the use of registration results, 4) k-NN classification with the use of TPMs and pruning the training data, 5) k-NN classification with the matched TPMs and pruning the training data. Each part of the experiment was computed on 20 various 2D images obtained by applying various deformations computed
Fig. 1. The scheme of the experiment shows the particular steps and images involved. Synthetic ground truth is constructed from the data from Simulated Brain Database with the use of randomly generated deformations. Results from atlas based segmentation and kNN classification with various setups (see the text for details) are expressed by an overlap measure Jx.
306
D. Schwarz and T. Kasparek
with the simulators described above. The deformations were applied on 20 transversal slices of prelabeled 3D image from SBD and on 20 corresponding slices of T1-weighted image with no added noise or INU artifact. In this way, 20 various templates with ground-truth segmentations were prepared. The images for classification were obtained as the corresponding slices of T2-weighted image with added noise and INU artifact. The average overlap of the labels in T2-weighted images and T1-weighted templates is expressed by the values of J0 in Table 1. The efficiency of classification Jx in all parts of the experiment was computed for each image as the ratio between the number of correctly labeled pixels and the number of all pixels belonging to GM, WM and CSF classes. Table 1 shows the average efficiency computed from 20 classifications. Deformations of prelabeled images were computed during registrations of T1-weighted templates to T2-weighted images. The elastic-incremental model was set as follows: σGI=3.5 mm and σGE=1.5 mm. The average overlaps are expressed as J1 in Table 1. If the registration was perfect, no drop in overlap would be presented. The k-NN classifier was trained on 300 prototypes, 100 prototypes for each class, and it sought for the majority class among k=5 nearest neighbours. The onedimensional feature space contained T2-weighted intensities for three various brain tissues. The prototypes were obtained with the use of TPMs which were used in their original form, with the results expressed by J2 in Table 1, as well as in the form of maps geometrically matched to the T2-weighted image with the use of the deformations obtained from the foregoing registrations, with the results expressed by J3 in Table. 1. Further, the pruning strategy was applied on the training datasets, such that 10% of the most distant prototypes from their median were pruned away for each class. The resulting average efficiencies of the classifiers with the pruning strategy are represented by J′2 and J′3 in Table 1. Table 1. Initial average overlaps of the labels in T2-weighted images and T1-weighted templates J0 and the classification efficiencies. J1: labels assigned with the use of deformable registration, J2: labels assigned by the k-NN classifier trained with the use of TPMs, J3: labels assigned by the k-NN classifier trained with the use of registered TPMs, J′2: labels assigned by the k-NN classifier trained with the use of TPMs and pruning the training dataset, J′3: labels assigned by the k-NN classifier trained with the use of registered TPMs and pruning the training dataset.
|uinitMAX|
J0 [%]
J1 [%]
J2 [%]
J3 [%]
J′ 2 [%]
J′ 3 [%]
RGsim 5 mm 10 mm 12 mm
64.3 49.8 46.3
87.8 83.1 80.3
79.1 74.8 73.5
80.0 80.3 79.8
78.7 75.2 77.4
79.7 80.7 81.3
TPSsim 5 mm 10 mm 12 mm
57.1 43.1 40.5
89.1 86.9 84.8
78.2 72.2 68.3
82.0 81.6 81.3
78.9 72.3 68.5
80.9 81.2 81.5
Brain Tissue Classification with Automated Generation of Training Data
307
4 Conclusion Methods of tissue type classification may play an important role in ROI-based volumetry, as they may be applied in the pipeline of atlas-based segmentation. The classifier may refine shapes of selected anatomical structures found previously by deforming a prelabeled atlas by means of registration. Efficiency of brain tissue classification with the use of simple k-NN method, improved by unsupervised training, was tested here. It was shown that despite using high-dimensional deformable registration method for atlas-based segmentation, the overlaps of ground-truth segmentations and the segmentations obtained by deforming a prelabeled atlas are not perfect. Additional information from intensity-based classification may therefore help to improve the results. The simple k-NN classifier was tested here, as it is suitable for monomodal as well as multimodal image data. Image data from Simulated Brain Database were used here. A set of prototypes was collected with the use of available tissue probability maps and without user’s intervention. The approach to automated training of a classifier presented here is similar to [12]. The main difference lies in the strategy of pruning the set of prototype samples. While the training set is pruned after its acquisition in [11], here the prior information used for training set acquisition is modified before the training phase. The results showed a considerable influence of the matching between the maps and the classified image on the classifier’s efficiency. This fact encourages applying a deformable registration for reducing the intersubject anatomical variability before intensitybased classification rather than using a registration with the use of affine transforms.
References 1. Ashburner, J., Friston, K.J.: Voxel-Based Morphometry–The Methods. NeuroImage 11, 805–821 (2000) 2. Kasparek, T., Prikryl, R., Mikl, M., Schwarz, D., Ceskova, E., Krupa, P.: Prefrontal But Not Temporal Gray Matter Changes in Males with First-Episode Schizophrenia. Progress in Neuro-Psychopharmacology and Biological Psychiatry. 31, 151–157 (2007) 3. Sigmundsson, T., et al.: Structural Abnormalities in Frontal, Temporal and Limbic Regions and Interconnecting White Matter Tracts in Schizophrenic Patients with Prominent Negative Symptoms. American Journal of Psychiatry 158, 234–243 (2001) 4. Friston, K.J., Ashburner, J.: Generative and Recognition Models for Neuroanatomy. NeuroImage. 23, 21–24 (2004) 5. Collins, D.L., Holmes, C.J., Peters, T.M., Evans, A.C.: Automatic 3D Model-based Neuroanatomical Segmentation. Human Brain Mapping. 3, 190–208 (1995) 6. Zitova, B., Flusser, J.: Image Registration Methods: A Survey. Image and Vision Computing. 21, 977–1000 (2003) 7. Suri, J.S., Wilson, D.L., Laxminarayan, S.: Handbook of Biomedical Image Analysis: Registration Models. Kluwer Academic Plenum Publishers, New York (2005) 8. Rogelj, P., Kovacic, S., Gee, J.C.: Point Similarity Measures for Non-rigid Registration of Multi-modal Data. Computer Vision and Image Understanding. 92, 112–140 (2003) 9. Schwarz, D., Kasparek, T., Provaznik, I., Jarkovsky, J.: A Deformable Registration Method for Automated Morphometry of MRI Brain Images in Neuropsychiatric Research. IEEE Transactions on Medical Imaging. 26, 452–461 (2007)
308
D. Schwarz and T. Kasparek
10. D’Agostino, E., Maes, F., Vandermuelen, D., Suetens, P.: A Viscous Fluid Model for Multimodal Non-rigid Image Registration Using Mutual Information. Medical Image Analysis. 7, 541–548 (2003) 11. Sled, J.G., Zijdenbos, A.P., Evans, A.C.: A Nonparametric Method for Automatic Correction of Intensity Nonuniformity in MRI Data. IEEE Transactions on Medical Imaging. 17, 87–97 (1998) 12. Cocosco, C.A., Zijdenbos, A.P., Evans, A.C.: A Fully Automatic and Robust Brain MRI Tissue Classification Method. Medical Image Analysis. 7, 513–527 (2003)
On Simulating 3D Fluorescent Microscope Images David Svoboda, Marek Kaˇs´ık, Martin Maˇska, Jan Huben´ y, Stanislav Stejskal, and Michal Zimmermann Centre for Biomedical Image Analysis, Faculty of Informatics Masaryk University, Brno, Czech Republic
[email protected] Abstract. In recent years many various biomedical image segmentation methods have appeared. Though typically presented to be successful the majority of them was not properly tested against ground truth images. The obvious way of testing the quality of new segmentation was based on visual inspection by a specialist in the given field. The novel 3D biomedical image data simulator is presented in this paper. It offers the results of high quality. The comparison of generated synthetic data is compared against real image data using standard similarity techniques. Keywords: synthetic image, simulator, procedural texture, convolution, fluorescent optical microscope.
1
Introduction
When handling biomedical image data one of the most crucial tasks is their segmentation and consequently the verification of the results. Concerning the segmentation, one can utilize manual segmentation that is easy to manage and offers the best results. Its drawback lies in the amount of the user interaction. The utilization of these methods is usually cumbersome, time consuming, and the results are unrepeatable. In the past, lots of semiautomatic methods [1, 2, 3, 4, 5, 6,7] were designed, however with lack of further proper evaluation of the results. The visual inspection by an expert was used instead of a ground truth image. On that account some authors started to use synthetic images. These images typically contain simple geometric shapes which are easy to recognize. The most popular shapes include circles and ellipses in 2D and spheres and ellipsoids in 3D space because of the similarity to the nucleus shapes. Having these type of the objects it is possible to check the quality of the proposed segmentation method. The issue of synthetic object generation was partially solved by Manders et al. [8] who verified the novel region growing segmentation algorithm over a large set of Gaussian-like 3D objects formed in a grid. A simulation toolbox for generating large set of FISH spots was proposed in [9]. Here, each spot was represented by a blurred sphere randomly placed in 3D space. Two individual spheres were allowed to overlap but only under given conditions. In [10] a set of artificial spatial objects in the shape of curved sphere, ellipsoid, disc, banana, satellite W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 309–316, 2007. c Springer-Verlag Berlin Heidelberg 2007
310
D. Svoboda et al.
disc, and dumbbell was generated. All these objects were affected with Poisson noise to simulate the plausibility of the images. The principles were adopted in [11] and further extended. Gaussian filter was added to roughly approximate the blurring effect of the confocal microscope point spread function (PSF). The objects were subsampled to simulate anisotropic resolution. Finally, Gaussian noise was added to simulate the CCD detector defects. Lehmussola et al [12] designed a complex simulator capable of generating a large populations of the cells. However, the toolbox was designed for 2D images only and the extension to higher dimension was not straightforward. In this paper a novel fully 3D technique simulating the whole process of image acquisition from optical microscope is presented. The whole process is split into several independent consecutive sub-processes, which implies the structure of the text in the following section. In the third section the results are presented. The figures providing the information concerning the quality of the proposed method are given. In the conclusion, the achieved goals are discussed.
2
Method
The whole engine consists of several image filters ordered in unique sequence. Each of the filters represents specific phenomenon appearing when acquiring the image and storing into the computer hard drive. The following paragraphs are ordered in the way corresponding to the time sequence, i.e. when reading the text one can easily imagine the meaning and the order of the simulated events. (a) Specimen Selection. In the very beginning, the shape of the object which should be simulated is selected. If it is the case of cell HL-60 (see Fig. 1) we can presume the shape of the initial object might be spherical as this nucleus topologically equals to sphere. On the other hand, any reasonable shape can be generated, i.e. the task is not limited to sphere-like objects only. The basic objects like sphere or ellipsoid are too simple and regular. Since the aim is to simulate real objects a certain amount of irregularity is required. For this purpose the PDE method was used to distort the object. The idea is based upon the fact that the object boundary is seen as a deformable surface. The deformation is realized via fast level set methods [13], and the speed function [14] generated from Perlin’s noise [15]. The modified object is depicted in Fig. 5b.
Fig. 1. 2D xy cross-sections of HL-60 cell nuclei
On Simulating 3D Fluorescent Microscope Images
311
(b) Specimen Preparation. Besides the shape, the texture of the nucleus image profile reveals the important information of the cell activity. In each stage of the cell cycle, chromatin has different properties and hence when stained it looks differently. For this purposes, the study and the measurement of heterogeneity of chromatin is an important task [16]. Since the well-known procedural model [15] is often utilized to simulate the context of the artificial nuclei of the cells, we decided to use the same approach. The texture in location (x, y, z) is defined as a sum of several Perlin’s noise functions: texture(x, y, z) =
N −1 i=0
noise(x, y, z) · β i αi
(1)
where β controls flickering of the texture, α is responsible for smoothness, and N controls whether the result is still coarse (for N < 5) or fine enough. It is noteworthy that some nucleus context cannot be stained with the same dyes as the rest of the nucleus volume. The nucleoli might be an example of such type of object which typically appears as a dark place in the image of nucleus (see Fig. 5c). It was empirically found out that there is only one nucleoli per one healthy nucleus in human cells. As for cancerous cells there might exist more than only one nucleoli. (c) Light Conditions. In traditional light microscopy the laser as well as mercury lamp are used for visualization purposes. No one of them is able to spread the light within the specimen uniformly. Therefore, the distribution of the light intensity is assumed to be quadratic over each xy-plane [17]: light(x, y) = a0 + a1 x + a2 y + a3 xy + a4 x2 + a5 y 2
(2)
where vector a = (a0 , a1 , a2 , a3 , a4 , a5 ) defines the shape of a quadratic surface typically widely opened. In some cases even the two quadratic profiles can be found in the image. It is typically the case of not properly calibrated mercury lamp (see Fig. 2). When doing correction, the quadratic surface is usually fitted to the original image. The result is then subtracted from the original image. Since we solve the inversion problem we can simply add the estimated quadratic surface to the image. (d) Optical Mechanics. In simple way, each optical system can be described by a point spread function (PSF) which is the impulse response of this system determining the amount and the way of input image blur. The PSF can be either empirically measured (real PSF) or estimated (theoretical PSF). Here, we used theoretical PSF which was generated using the image analysis software1. The theoretical PSF is usually based on the prior knowledge of the optical system properties (confocal/widefield microscope, optical lens, colour channel, . . . ). The generated PSF is subsequently used as a convolution kernel expressing the influence of the optical system to the image passed (see Fig. 5e). 1
R Huygens Essential – http://www.svi.nl/products/essential/
312
D. Svoboda et al.
light intensity
light intensity 1 0.8 0.6 0.4 0.2 0
1 0.8 0.6 0.4 0.2 0
100 80 y-axis
100 80
100
60
80
40
60 20
40 20
x-axis
0
60 20
40 x-axis 60
40 80
y-axis
20 1000
Fig. 2. Quadratic polynomial surface representing the uneven distribution of light within the area of each slice in 3D image: (left) non calibrated mercury lamp profile (right) laser lamp profile
(e) CCD Chip Properties. The charge-couple device (CCD) detector, commonly used in microscope imaging, introduces an error to the acquired signal. This detector noise is typically modelled as photon shot noise with Poisson distribution (see Fig. 5f): e−λ λx fλ (x) = (3) x! where x is the number of occurrences of photons and λ is the expected number of occurrences that occur during the given interval. Besides the photon shot noise, each CCD produces a small amount of hot pixels, which appear like either very bright or very dark pixels in the image matrix. (f) A/D Converter. Similarly to CCD detector, the analogue-digital (A/D) converter produces a certain amount of noise. This type of noise is modelled with additive Gaussian noise controlled by Gaussian distribution: fμ,σ (x) =
(x−μ)2 1 √ e− 2σ2 σ 2π
(4)
where μ and σ is the mean and the variance of the noise, respectively.
3
Results
When simulating real processes one should keep in mind the task complexity. Typically, the parametric space is huge. Since any input parameter may strongly affect the quality of the output, the parameter selection and the setting must be done with care (see Fig. 3 for some results). The plausibility of the results can be assessed in many ways. The most common way of the real biomedical image data comparison is the visual inspection by an expert. On one hand, this approach is important for coarse estimate when deciding whether the given image is similar to the class of simulated real images. On the other hand, it is tedious and cumbersome. Anyway, it is very difficult or nearly impossible to recognize all the image features with a naked eye. For this reason we decided to utilize some well-defined image similarity metrics common
On Simulating 3D Fluorescent Microscope Images
313
Fig. 3. Selection of generated images. For simplicity, only 2D cross-sections are presented. The outputs were obtained with prior setting α = 8, β = 4, λ = 0.7, and (μ, σ) = (0.0, 0.5).
0.07
0.07
0.06
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0
0 0
50
100
150
200
250
300
0
50
100
150
200
250
Fig. 4. Comparison of intensity histograms: (left) real data (right) synthetic data Table 1. Each column corresponds to the mean of selected measure evaluated over the whole image dataset. Notice the similarity in each column. The small difference in the first and the last column comes from the size fluctuations of real images. The real images vary in size and hence the corresponding value changes as the background covers smaller or larger part of each image. average central moments entropy intensity 2nd 3rd real data 33.25 20.37 25.24 5.38 synthetic data 26.83 20.78 25.64 5.13
in image databases. These are image descriptors [18] from which we selected the intensity histogram, the central moments, and the entropy. To ensure the real image acquired from optical microscope is almost the same as the output of
314
D. Svoboda et al.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5. A sequence of consecutive images created by the simulator: (a) initial object, (b) object after deformation, (c) nucleoli added, (d) procedural texture applied, (e) after convolution with PSF, (f) affected by Poisson noise.
in this paper presented procedure, the above measures were evaluated over the hundreds of synthetic and real images. Some results are depicted in Fig. 3 and Fig. 5. The similarity of real and synthetic data is summarized in Figure 4 and Table 1.
4
Conclusion
The approach presented in this paper offers a very complex and an efficient tool for generating synthetic biomedical images. It simulates the whole process of image acquisition from optical microscope from the very beginning when preparing the specimen to the very end when converting the analogue signal into its digital form and storing it into the hard drive. This toolbox generates the 3D objects but does not have any limitations, hence, it can be extended to any higher dimension (time sequences) and generate other types of cells. For the time being, only one object (cell nucleus) is generated per one image. In the future we intend to generate cell clusters which typically appear in tissue scans.
On Simulating 3D Fluorescent Microscope Images
315
Acknowledgement. This work was supported by the Ministry of Education of the Czech Republic (Projects No. LC535 and No. 2B06052).
References [1] Adiga, U., Chaudhuri, B.B.: An efficient method based on watershed and rulebased merging for segmentation of 3-D histo-pathological images. Computer Vision and Pattern Recognition Unit, 1449–1458 (2001) [2] Bamford, P., Lovell, B.: Unsupervised cell nucleus segmentation with active contours. Signal Processing Special Issue: Deformable Models and Techniques for Image and Signal Processing 71, 203–213 (1998) [3] Alexopoulos, L.G., Erickson, G.R., Guilak, F.: A method for quantifying cell size from differential interference contrast images: validation and aplication to osmotically stressed chondrocytes. Journal of Microscopy 205, 125–135 (2002) [4] Beli¨en, J.A.M., van Ginkel, H.A.H.M., Tekola, P., Ploeger, L.S., Poulin, N.M., Baak, J.P.A., van Diest, P.J.: Confocal DNA cytometry: A contour-based segmentation algorithm for automated three-dimensional image segmentation. Cytometry 49, 12–21 (2002) [5] Matula, P., Svoboda, D.: Spherical object reconstruction using star-shaped simplex meshes. In: Figueiredo, M., Zerubia, J., Jain, A.K. (eds.) EMMCVPR 2001. LNCS, vol. 2134, pp. 608–620. Springer, Heidelberg (2001) [6] Malpica, N., Sol´ orzano, C.O.d., Vaquero, J.J., Santos, A., Vallcorba, I., Garc´ıaSagredo, J.M., del Pozo, F.: Applying watershed algorithms to the segmentation of clustered nuclei. Cytometry 28, 289–297 (1997) [7] Irinopoulou, T., Vassy, J., Beil, M., Nicolopoulou, P., Encaoua, D., Rigaut, J.P.: Three-dimensional DNA image cytometry by confocal scanning laser microscopy in thick tissue blocks of prostatic lesions. Cytometry 27, 99–105 (1997) [8] Manders, E.M.M., Hoebe, R., Strackee, J., Vossepoel, A.M., Aten, J.A.: Largest contour segmentation: A tool for the localization of spots in confocal images. Cytometry 23, 15–21 (1996) [9] Grigoryan, A.M., Hostetter, G., Kallioniemi, O., Dougherty, E.R.: Simulation toolbox for 3d-fish spot-counting algorithms. Real-Time Imaging 8(3), 203–212 (2002) [10] Lockett, S.J., Sudar, D., Thompson, C.T., Pinkel, D., Gray, J.W.: Efficient, interactive, and three-dimensional segmentation of cell nuclei in thick tissue sections. Cytometry 31, 275–286 (1998) [11] Sol´ orzano, C.O.d., Rodriguez, E.G., Jones, A., Pinkel, D., Gray, J.W., Sudar, D., Lockett, S.J.: Segmentation of confocal microscope images of cell nuclei in thick tissue sections. Journal of Microscopy 193, 212–226 (1999) [12] Lehmussola, A., Selinummi, J., Ruusuvuori, P., Niemist, A., Yli-Harja, O.: Simulating fluorescent microscope images of cell populations. In: Proceedings of the 27th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC’05), pp. 3153 – 3156 (2005) [13] Nilsson, B., Heyden, A.: A fast algorithm for level set-like active contours. Pattern Recogn. Lett. 24(9-10), 1331–1337 (2003) [14] Sethian, J.A.: Level Set Methods and Fast Marching Methods. Cambridge University Press, Cambridge (1999) [15] Perlin, K.: An image synthesizer. In: SIGGRAPH ’85. Proceedings of the 12th annual conference on Computer graphics and interactive techniques, New York, USA, pp. 287–296. ACM Press, New York (1985)
316
D. Svoboda et al.
[16] Rousselle, C., Paillasson, S., Robert-Nicoud, M., Ronot, X.: Chromatin texture analysis in living cells. The Histochemcal J. 31(1), 63–70 (1999) [17] Klein, A.D., van den Doel, L.R., Young, I.T., Ellenberger, S.L., van Vliet, L.: Quantitative evaluation and comparison of light microscopes. In: Optical Investigation of Cells In Vitro and In Vivo, Proc. SPIE, Progress in Biomedical Optics. vol. 3260, pp. 162–173 (1998) [18] da Costa Jr., L.F., R.M.C.: Shape Analysis and Classification: Theory and Practice. CRC Press, Orlando, USA (2001)
Hierarchical Detection of Multiple Organs Using Boosted Features Samuel Hugueny and Mika¨el Rousson Department of Imaging and Visualization Siemens Corporate Research, Princeton, NJ, USA Abstract. We propose a framework for fast and automated initialization of segmentation algorithms in Computed Tomography images. Based on the idea that time-consuming voxel classification should be done only on spatially constrained areas, we build classifiers at body and slice levels which quickly define a constrained region of interest. Voxel classification is then performed by a divide-and-conquer strategy using a probabilistic-boosting tree. In addition, this framework can incorporate additional information on the volume, if available, such as the position of another organ to improve its accuracy and robustness. The framework is applied to seed extraction in kidneys and liver.
1
Introduction
Statistical model-based segmentation algorithms such as active contours [1], region competition [2] or graph cuts [3] are considered powerful and effective, and are widely used in medical image segmentation [4,5]. They share a commmon particularity, namely, they need to be provided with one or multiple seeds, or an initial contour. This initialization step is often performed manually, hence preventing them to be fully automated. A possible approach to automate the process is machine learning. Being able to combine abstractly various sources of information, machine learning techniques (bagging trees [6], neural networks [7], support vector machines [8], boosting trees [9], etc. ) have a major role in computer assisted diagnosis (CAD). For the purpose of seed extraction from Computed Tomography (CT) 3D scans, direct application of learning algorithms either at image or voxel level is bound to fail. At image level, one would need a large training set of CT scans (hundreds), which is not easily available. At voxel level, inter-organ similarities and intraorgan variability prevent learning to be fast and robust. Moreover, the amount of voxels pin a 3D scan makes a direct approach computationally expensive, both in the training stage and in the detection phase. Even with divide-andconquer strategies on the voxel population, detectors are hard to train and their robustness is questionable. However, the organization of the body structures is well defined and remains mostly consistent from one patient to another. Modelization of this high-level knowledge of dependencies between organs and tissues can help reduce the complexity of the seed extraction problem. Large parts of the body/scan can be discarded, hence leading to a faster and better decision boundary. Previous works W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 317–325, 2007. c Springer-Verlag Berlin Heidelberg 2007
318
S. Hugueny and M. Rousson
in this direction includes use of deformable models [10], or computation of probability maps based on a previous segmentation of the lungs [11]. In this paper, we present a framework for fully automated multi-organ detection in abdominal CT images, based on the Adaboost algorithm. We propose to progressively incorporate spatial knowledge in the learning process, first by classifying horizontal slices, then by discarding slice areas based on prediction of the organ of interest’s position, and finally by performing a voxel classification. This leads to an important reduction of both training and detection time as well as an improved accuracy. At each step, features are defined and normalized when possible, so that they are independent of the actual intensity dynamic and patient characteristics. Detectors are therefore robust to variations due to unhealthy organs or contrast agents. We also define filters which takes into accound predefined landmarks, if available, such as results from a previous run of the algorithm on a different organ. Section 2 details the framework learning procedures, section 3 describes the features used at each step, section 4 shows how detection can be improved with the prior knowledge of landmarks, while section 5 presents some of our results.
2 2.1
Learning Adaboost
Adaboost, invented by Freund and Schapire [12] and its variants have been successfully applied to many problems in vision and machine learning. Friedman and al. [13] have shown that Adaboost approach the posterior p(y|x) by selecting and combining a set of weak classifiers into a strong classifier. At each iteration, it increases the weights of the previously misclassified samples, so that the next selected weak classifier will perform better on ”hard” examples. The final hypothesis is a weighted linear combination of the T hypotheses whose T weights are inversely proportional to their training errors: H(x) = t=1 αt ht (x) with ht (x) being a weak classifier and αt its weight. A classifier trained with the standard procedure is used for slice classification. In the detection phase, only voxels which belong to slices detected as positives are passed to the voxel classifier. However, using standard Adaboost at voxel level is questionable for different reasons: – although Adaboost asymptotically converges to the target distribution, it needs to pick hundreds of weak classifiers, thus being computationally expensive. Since a 3D CT image contains millions of pixels with great diversity, picking a single sampled training set would make training a matter of weeks, and detection a matter of minutes. – the order in which features are selected is not preserved although it may correspond to high level semantics and be useful for the understanding of patterns.
Hierarchical Detection of Multiple Organs Using Boosted Features
319
– the re-weighting scheme of Adaboost may cause samples previously correctly classified to be misclassified again. In the case of a rare event detection problem, we want to be able to discard a large amount of samples in the early stages, so that later stages deal with samples that are similar to the targeted data. These issues are naturally addressed by a divide-and-conquer (e.g. tree or cascade) approach. 2.2
Probabilistic Boosting Tree (PBT)
Voxel classification is performed using a Two-class Probabilistic Boosting Tree (PBT) as presented in [14]. The PBT procedure automatically constructs a tree in which each node is a standard Adaboost classifier (a combination of weak classifiers) that can deal with complex distributions while being resistant to overfitting. Training samples are then divided into two new sets according to their response to the learnt classifier and used to train a left sub-tree and a right sub-tree. Confusing data are passed further down, leading to the expansion of the tree. PBT naturally embbeds clustering and has applications in classification, detection, recognition and segmentation [15]. Table 1 details the learning process. The detection phase is consistent with the learning phase. At each node, a sample is passed down to the right subtree if its response to the node’s strong classifier is positive (meaning that the voxel belongs to the organ of interest), to the left subtree if not, until it reaches a leaf. The probability returned is the ratio of positives samples that reached this leaf in the training process.
3
Defining Features for Seed Detection
Since seed detection is a voxel classification problem, training has ultimately to be made at voxel level. A voxel-based training faces the following problems: huge amount of voxels, intensity similarities between organs and tissues, variations between acquisitions, lack of organ tissue homogeneity within and among different image slices both in shape and texture, abnormalities of unhealthy tissues, absence of the organ of interest from the considered scan. Most of these problems can be addressed by an ad hoc reduction of the number of potential seeds. Since very general anatomical considerations (on one of the body, beneath the diaphragm, ...)help predicting an organ position, we propose to constrain a spatial domain of interest using boosted classifiers at body and slice level. To be independent of scale factors, a normalized coordinate system of the patient’s body is defined for the two axial dimensions. Because CT scans do not always represent the same part of a patient’s body in the vertical dimension, we cannot have the same normalization in z. The axial dimensions on one
320
S. Hugueny and M. Rousson
Table 1. PBT procedure used to learn the voxel classifier. The only difference with the procedure presented in [14] is the sampling stage (2) introduced to speed up trainings. Procedure for training of a tree with maximum depth of L: 1. Given: a set of images {(X1 , Y1 ), ..., (Xn , Yn )}. Xk is a region of interest of a particular image. Yk is the corresponding ground truth. Training set is S = {(x1 , y1 ), ..., (xp , yp )}, where xi ∈ Xk and yi = 0, 1 for negative and positive respectively. 2. Sample a training set Ssampled of labeled examples {(x1 , y1 ), ..., (xn , yn )} with respect to the number of positives and negatives in S. wi is the weight of the training sample. 1 1 3. Initialize weights wi = 2m , 2l for yi = 0, 1 respectively, where m and l are the number of negatives and positives respectively. 4. Normalize weights in Ssampled . 5. In Ssampled , train a strong classifier using the standard Adaboost procedure. Exit early if the training error at classifier t > θ (e.g. θ = 0.45). 6. If the current tree depth is L, then exits. 7. Initialize two empty sets Slef t and Sright . 8. For each sample in S compute the probability q(1|(xk , ik )) and q(−1|(xk , ik )) using the learned strong classifier learned on Ssampled . 9. if q(1|xi ) > 12 then (xi , yi , 1) → Sright else (xi , yi , 1) → Slef t 10. Repeat the procedure recursively for Slef t and Sright .
hand and the vertical dimension on the other hand are treated differently. Our approach consists of three steps: 1. Slice classification: a classifier that gives a probability that a particular slice intersects with the organ of interest is trained. A permissive decision threshold is set to obtain an almost perfect recall. 2. Slice area selection: given an easily computed segmentation of the body, a scaled coordinate system is defined for each slice, so that regions are defined univocally in every slice. By looking up the organ coordinates in the training images, we identify regions where the organ of interest has very little chance to be observed. This process discards roughly 70 % of a slice pixel for vertically oriented organs such as kidneys. 3. Voxel classification: a voxel classifier is trained, on a large neighborhood of the organ. Center Op,k and dimensions wp,k , hp,k are determined for each slice Sp,k in volume Vp using a slice-by-slice segmentation of the body (obtained by a simple contour-tracing algorithm). These are used to define on each slice a coordinate system, with the center of the slice as its origin and the X and Y axes as its basis axes. Coordinates are normalized by the slice dimensions, so that every voxel in the slice has coordinates between −1 and 1, regardless of the patient’s size or corpulence (Fig. 1).
Hierarchical Detection of Multiple Organs Using Boosted Features
321
Fig. 1. Left: CT slice with coordinate system and reference segmentation used for kidneys. A similar coordinate system is defined for every slice. Right: observation map for both kidneys obtained using 20 volumes. Bottom left corner is (-1,-1), upper right is (1,1) in the slice coordinate system.
3.1
Slice Classification
Vertical normalization would require to have the patient’s size or two easily detected anatomical landmarks in the CT scan, which cannot be guaranteed. We propose to train an Adaboost classifier that determines the probability for a slice to intersect with the organ of interest. Features (Table 2) are designed to identify spatial organization, similarities in shape, appearance, symmetry, relative intensities compared to those of the entire body, comparisons with neighbours, entropy, etc. They are computed both for the entire slice and for sub-windows. Since they are slice-level features, they may be more computationally expensive than voxel features. Features are fed to a two-class Adaboost procedure, together with a training set of slices. Slices are labeled positive if they intersect with the organ of interest, negative if they do not. After regularization based on the fact that organs are connex, the obtained detectors discard a vast majority of slices, while having a nearly perfect recall. 3.2
Slice Area Selection
Using this coordinate system we report the 2D normalized coordinates of every voxel that belongs to the organ of interest in each training image on an “occurence map”, from which the consistency of the organ’s position in the slice is observable. Based on this map we define a domain where voxels are likely to be positive, according to the number of positive observations at their locations. A permissive bounding box is defined on each slice using the observation map. For kidneys, this step discards at once 45 of voxels. Of course, the smaller the horizontal sections of the organ are, the more voxels are discarded. 3.3
Features Used for Voxel Classification
We use the three dimensional version of the Haar filters combined with the use of an integral image [16]. The computational cost of computing Haar filters is
322
S. Hugueny and M. Rousson
Table 2. Examples of features used for slice classification. They are designed as cues of symmetry (h3 ), shape (h1 , h5 ), appearance(h2 , h6 ). Features are computed for the entire slice and for rectangular subwindows.
card{x/x inS
}
p,k – h1 (Sp,k ) = 4wp,k hp,k – h2 (Sp,k ) =percentage of air voxels in Sp,k → − → card{x∈Sp,k /− x . i p,k >0} – h3 (Sp,k ) = → − → − card{x∈Sp,k / x . i p,k α} = true positives + f alse positives card{(x, y)/H(x) > α}
where 0 ≤ α ≤ 1 is a threshold and 0 ≤ H(x) ≤ 1 the response of the H classifier for the x sample. A retrieval rate of 30% of the organ voxels is possible with a precision higher than 90 % for both liver and kidneys, which proved sufficient to initialize a level set algorithm. In the case of a kidney, a random guess in the entire image would have a precision and a recall lower than 1 %. As expected, knowledge of another organ position improves the average accuracy. Moreover, since positively detected voxels are spatially constrained, geometrical post-processing should in the future lead to an increased precision. Detection time on a standard computer is around 400 000 voxels per second. Robustness is acceptable given the size of our training set and better results could be achieved with larger datasets. In this paper, a framework for automated seed extraction is introduced. Learning is carried out in a hierarchical way by divide-and-conquer strategies.
324
S. Hugueny and M. Rousson
Experiments are reported on liver and kidney. Results show that it is a valid and robust approach to automatically extract seeds in multiple organs. Furthermore, observed recall rates and intrinsic spatial concentration of the retrieved seeds suggest that feature improvement and larger databases could lead to application in segmentation. Future work also includes extending the method to more organs and simultaneous seed extraction in several organs.
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988) 2. Zhu, S., Yuille, A.: Region competition: unifying snakes, region growing, and Bayes/ MDL for multiband image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(9), 884–900 (1996) 3. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 1222–1239 (2001) 4. Chakraborty, A., Staib, L., Duncan, J.: Deformable boundary finding in medical images by integrating gradient and region information. IEEE Transactions on Medical Imaging 15(6), 859–870 (1996) 5. Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In: ICCV, pp. 105–112 (2001) 6. Hothorn, T., Lausen, B.: Bagging tree classifiers for laser scanning images: a data- and simulation-based strategy. Artificial Intelligence in Medicine 27(1), 65–79 (2003) 7. Sharkey, A., Sharkey, N., Cross, S.: Adapting an ensemble approach for the diagnosis of breast cancer. In: Proceedings of the 6th International Conference on Artificial Neural Networks, pp. 281–286 (1998) 8. Chang, R.F., Wu, W.J., Moon, W.K., Chou, Y.H., Chen, D.R.: Support vector machines for diagnosis of breast tumors on US images. Academic radiology, 189– 197 (2003) 9. Tu, Z.: Probabilistic 3D polyp detection in CT images: The role of sample alignement pp. 1544–1551 (2006) 10. Karssemeijer, N., van Erning, L.J.T.O., Eijkman, E.G.J.: Recognition of organs in CT-image sequences: a model guided approach. Computers and Biomedical Research 21(5), 434–448 (1988) 11. Zhou, X.: Constructing a probabilistic model for automated liver region segmentation using non-contrast x-ray torso ct images. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4190, pp. 856–863. Springer, Heidelberg (2006) 12. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: European Conference on Computational Learning Theory, pp. 23–37 (1995) 13. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. In: Dept. of Statistics, Stanford Univ. Technical Report (1998)
Hierarchical Detection of Multiple Organs Using Boosted Features
325
14. Tu, Z.: Probabilistic boosting-tree: Learning discriminitive models for classification, recognition and clustering. In: 10th IEEE International Conference on Computer Vision (2005) 15. Zheng, S., Tu, Z., Yuille, A., Reiss, A., Dutton, R., Lee, A., Galaburda, A., Dinov, I., Thompson, P., Toga, A.: A learning-based algorithm for automated extraction of the cortical sulci (2006) 16. Viola, P., Jones, M.: Face recognition using boosted local features (2003)
Monitoring of Emotion to Create Adaptive Game for Children with Mild Autistic P. Ravindra S. De Silva1 , Masatake Higashi1 , Stephen G. Lambacher2 , and Minetada Osano2 1
Computer Aided Design Laboratory-Toyota Technological Institute, 2 Software Engineering Laboratory-University of Aizu, Japan {ravindra, higashi}@toyota-ti.ac.jp, {steeve, osano}@u-aizu.ac.jp
Abstract. Computer-based interactive systems and robots have become a massive technology for improving human-impaired social interaction, especially for children with autistic. Autism is a lifelong development disability, often accompanied by learning technologies. As a result, they have trouble interacting within our complex social environment and are, for the most part, unable to recognize other people’s behaviors. In this paper, we present game-based therapeutic environments for people diagnosed with a mild form of autism. The proposed interactive system traces a child’s emotion with intensity for changing a game environment for the purpose of triggering their emotions. The pedagogical agent provides therapy instruction with motivational support to children through adapting a child’s emotional behaviors.
1
Introduction
Autism is a lifelong developmental disability, often accompanied by learning disabilities. Here we list the main impairments that are characteristic of people with autism: (1) impaired social interaction-the inability to relate to others in meaningful ways, difficulty in forming social relationships, the inability to understand others’ intentions; (2) impaired social communication-difficulties with verbal and non verbal communication, e.g., difficulties in understanding gesture and facial expressions; and (3) impaired imagination-difficulty in the development of play, and having limited range of imaginative activities. Current research findings in interactive robots and agent-based systems suggest that people with autism generally feel comfortable in a predictable environment, and especially enjoy interacting with computers [1] and robots [2]. In work reported in recent studies, it has been argued that video-based and computerbased games can be considered as educational tools to improve child’s emotional skills [3]. In addition, some studies [4] have shown that while educational game are usually highly engaging, they often do not trigger the constructive reasoning necessary for learning. An investigation by Murray [5] has revealed that the attention of people with autism tends to be fixed on isolated objects apart from the surrounding area. Furthermore, Murray’s work focused on a computer-based W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 326–333, 2007. c Springer-Verlag Berlin Heidelberg 2007
Monitoring of Emotion to Create Adaptive Game
327
system that can break into this world by focusing the individual’s attention tunnel on the screen, so that external events can be ignored more easily. Considering the work reported above, we can argue that the application of computers in the education and therapy of people with autism can used pedagogically to benefit the development of impaired social interactions and communications. Nadel [6] has developed a humanoid robot called Robota for children suffering from autistic. Robota performed as social mediators using basic imitative and turn-taking games. Robota encourages the imitation and social interaction skill processing of different types of games, and the process has been evaluated using different types of behavioral criteria (including eye gaze, touch, near, and imitation) based on video data. The results that have been carried out require long term studies in order to reveal the full potential of Robota in therapy process for children with autistic. Grynszpan [7] study is using video sequence analysis for estimating the intensity of emotion for the therapeutic process.This has resulted in a continuation of the therapy process, because the system cannot estimate a child’s intensity of emotion in real time while continuing the therapeutic process. Since our goal is to develop the child social interaction and social communication skills, we need to investigate and evaluate emotional behaviors and cognitive behaviors when they interact within different contexts (e.g., game playing, and when interacting with other persons, etc.). Therefore, we propose game-based interactive environments for developing child’s emotional communication skills. The proposed system can trace a child’s emotional performance in real time, and that information is then used for changing game environments for triggering a child’s emotion.
2
Emotion-Based Interactive Game
Our proposed game (see Figure 1) can be played by two children. Currently we are tracing one of the children behaviors (mild autistic children) and other child is not suffering any disabilities. Each child controls the same game from different computers. Currently In this game, a child controls a line that can be moved within the game square. The objective of the game is to get the longest line while attempting to block the opponents’ line. When a player blocks the opponent’s line, the length of each line is measured and the player with the longest line is declared the winner. If one child has a faster moving line speed, then he/she can obtain a longer line more quickly than the other child. Therefore, winning depends on the speed of line. Our system tries to change speed of line according to child’s intensity of emotion. Through this work, we propose a dynamic and flexible emotion intensity estimation model that can be used to adapt game states (change speed of line) that in turn would trigger a change in emotion intensity of the child playing the game. This process was designed as therapeutic functions for developing the emotional expression of mild-autistic children.
328
P.R.S. De Silva et al.
(a)
(b)
Fig. 1. The pedagogical agent expressing several emotions according to the child’s emotion with game state: (a) frustrated (b) happy
3
The System Architecture
Figure 2 illustrates the agent-based procedure for processing therapeutic functions. One of the children wears a motion capture system, and the system traces the child’s affective gesture information for recognize their emotion. The Xsense [8] motion capture system was used with 8 markers to capture child’s upper body gesture motions information. These 8 markers were attached to the child’s
Fig. 2. System architecture for process therapeutic function in continuously
Monitoring of Emotion to Create Adaptive Game
329
upper body parts: Front head, Back head, Right shoulder, Left shoulder, Right wrist, Left wrist, Center of chest, and Middle of back. Each of the markers gives euler angles according to a user-defined coordinate system. Three euler angle motion measurements from each sensor were obtained, which provided a total of 24 measurements to describe child’s affective gestures. The system consists of three functional modules as follows: affective gesture recognition model, interface update, and game information. A multiagent-based environment was used to create this interactive system. The system consists of four types of agents: a motion constrains agent, an arousal perception agent, a pedagogical agent, and a situational control agent and emotion-appraisal agency. All agents react according to child’s emotional behaviors. If the child is losing the game, then the agent tries to offer help to win the game, and the agent’s interaction with the child is based on his/her current emotional behaviors. Since the agent’s interaction is adapted to the child’s actual emotional state, the system becomes “believable” in the child’s point of view. The motion constraint agent receives 10 frames (motion information) per second from the motion capture system. The motion constraint agent traces all of the motion frames to select key frames (maximum body movement frames) for describing a child’s upper body gesture information. The Hidden Markov Model (HMM) [9] used selected key frames for recognize child’s emotions. 3.1
The Intensity of Emotion
We proposed intensity dependent factors for estimating emotion intensity by using only one dimension. The emotion-appraisal agency has three agents: avalanche-appraisal agent, emotion elicit-appraisal agent, and mood-appraisal agent. These agents are estimating emotional-appraisal factors (intensity dependent factors). In modeling the avalanche-appraisal agent, the emotion elicitappraisal agent, the mood-appraisal agent, and concepts of cognitive science and psychology were all considered. The avalanche-appraisal agent is estimating excitatory-inhibitory effects from emotion blends. Minsky [10] defined this process as the avalanche process. The emotion elicit-appraisal agent is responsible for finding the emotion elicitor appraisal for using sensorimotor information and game information. The mood-appraisal agent considers mood with game information for finding mood-appraisal. These intensity dependent factors were proposed in [11] (please refer to it for detailed descriptions). The arousal perception agent communicates with the agency to gather information about intensity dependent factors in order to estimate a child’s intensity of emotions. The arousal perception agent using an exponential equation is used to estimate the intensity of emotions. The total emotion intensity factor value (xEkt ) can be applied to equation (1), and finally the IEKt value gives the intensity of emotion EKt at time t. Estimated values for intensity will vary between 0 and its saturation value (maximum value), where as IEKt represents intensity of emotion EK at time t, and xEkt represents intensity dependent factors for emotion EK at time t:
330
P.R.S. De Silva et al.
1
IEKt =
−xE
1 + exp 3.2
+0.5 kt 0.1
(1)
Probabilistic-Based Game Control Model
The situational control agent controls the emotional behaviors and gives motivational support to improve the child’s performance level by adjusting the game states. As explained previously, intensity of emotion can vary between 0 and its saturation value (maximum value). The situational control agent creates 3 intensity levels between 0 and its saturation value: 0 to 0.25 as level1, 0.25 to 0.75 as level2, and 0.75 to 1 as level3. It uses two rules for adapting the game: for positive emotions it always tries to keep the child’s affective states at level 2, and for negative emotions it tries to keep child’s affective states at level 1. The reason to promote the above two rules is that the system always tries to maintain a child’s negative intensity of emotion at a low intensity level and a child’s positive emotion intensity at a middle level. To achieve this goal, a probability-based Bayesian network was used. Nevertheless, emotion, intensity of emotion, and game state factors were used to estimate the game adjustment level (a child’s speed of line) that helps to trigger a child’s emotion intensity. Figure 3 depicts the Bayesian network for the game control module. Suppose at time t a child has emotion, Eit , intensity of emotion IEit , and game states (probability of win or lose) P r(Gt ) . Since we can estimate a suitable probability for a game level (βt+1 ) in time t + 1, pr(βt+1 |Eit , IEit , Gt ). This probability will be used to change the speed of line, thereby adjusting the game to new conditions.
Fig. 3. Bayesian network showing game control model
Monitoring of Emotion to Create Adaptive Game
3.3
331
The Pedagogical Agent
The pedagogical agent collects information about child’s emotion, intensity of emotion, and game information from the arousal perception agent, affective gesture recognition model, and game module respectively. The pedagogical agent then uses an encouraging voice message that embeds emotional tones (empathic feedback) to prompt the child to express him/her self. The agent gets the behaviors data from the emotion expression database and voice database to express emotion with encouraging voice (see Figure 1).
4
Performance of the Affective Gesture Recognition Model
Thirty affective gestures from each of the emotions were selected and used for training the model. Twenty children were used to test the model and each child was asked to play the game five times. For each game, the following data were recorded: HMM model recognized emotion, estimated intensity of emotion, and observers’ feedback about emotion and intensity of emotions. At the end of the experiments, a total of 100 gesture emotions were obtained. According to the observer’s feedback, there were 30 affective gestures for sad, 24 gestures for frustrated, 22 gestures for happy, and 24 gestures for joy. The first task carried out during the experiments was to test the recognition power of the gesture recognition model (HMM). For this purpose an agreement/disagreement confusion matrix between recognition model (HMM) and observer’s feedback was computed. Overall, the confusion matrix showed a 79% agreement rate between observer’s feedback and the gesture recognition model (HMM). To significance of this agreement was examined (i.e., the null hypothesis being that the observer’s feedback are independent of the gesture recognition model evaluations); subsequently, the χ2 test was applied to the recognition model evaluation and observer’s feedback data. The resulting value of 159.78 for 9 degrees of freedom was much higher than the critical value of 27.877 for α= 0.001, implying that the null hypothesis can be rejected and that the relationship between two groups reached significance. To check the relationship between the model’s estimated intensity and observer’s feedback intensity, 79 emotions were selected. This was due to the fact that 79 emotions there was an agreement between observers and the gesture recognition model for 79 emotions. An analysis of variance (ANOVA) was used to check these relationships. It was observed that a linear (F(1,78) = 112.79, P < .001) trend was present in the relationship between the estimated intensity and observer’s feedback intensity. 4.1
Testing Improvement of Child’s Emotional Expression
To test how our system improved child emotional expression through affective body gestures, the following experiment was performed. Twenty children participated in two experiments (Experiments 1 and Experiments 2), and they were
332
P.R.S. De Silva et al.
(a)
(b)
Fig. 4. (a) Emotional expression performances without integrating situational control agent and pedagogical agent; (b) Emotional expression performances with integrating situational control agent and pedagogical agent. (s= sad, f= frustrated, h=happy, j= joy).
asked to play a game more than 20 times. In experiment 1, we removed the situational control agent with the pedagogical agent process and both children’s line speed was placed as a constant. The game environment was unchanged according to the child’s emotional behaviors. The following Figure 4(a) depicts the game number vs. child’s emotion and intensity of emotions for the children. According to Figure 4(a) the child played a game 20 times and experienced positive emotions three times. Also, these positive emotions had low intensity levels, because the child experienced expressing negative emotions a number of times. An interesting result occurred when the number of the gaming times was increased. This gradually increased the child’s intensity of emotions (negative emotion). According to prior research, this type of situation causes a child’s self confidence levels to decrease. Also, in regard to these children’s age of emotion development, helped to break down the level of emotional development according to their exact age. In experiment 2, we intergraded the situational control agent for changing each game environments according to the child’s behaviors. We asked the children to play a game more than 20 times. The following Figure 4(b) depicts the results of the games that include the child behaviors which were the same as of the children of experiment 1. The graph shows some interesting results, and we were able to evaluate the effectiveness of our system. However, when the situational control agent and pedagogical agent changed the game environment with affective feedback according to the emotional behavior of the children, they expressed a variety of emotions. Therefore, the children had experience in expressing a variety of emotions and received empathic feedback from the system. It seemed to be helped in developing the children’s emotional expressions through this interactive system. However, when the number of games increased, each intensity of emotion gradually increased.
Monitoring of Emotion to Create Adaptive Game
5
333
Conclusion
Overall, the results showed that the affective gesture recognition model recognized a child’s emotion with a considerably high rate of over 79%, and the arousal perception agent (estimated intensity of emotion) had a strong relationship with the observers’ feedback, except in low intensity levels. When the situational control agent changed the game environment according to the emotional behavior of the children, they have the opportunity to express a variety of emotions. Therefore, the children had experience in expressing a variety of emotions and received empathic feedback from the system. This interactive system appeared to help in developing the children’s emotional expressions. However, when the number of games increased, each intensity of emotion gradually increased and it will help to improve child emotional expression.
Acknowledgements This study was supported by the Grant-in-Aid for High-tech Research Center for “Space Robotics” and for scientific research (C)(2)(13650289) of the Ministry of Education, science, Sports and Culture of Japan.
References 1. Powell, S.: The use of computers in teaching people with autism. In: National Autistics society conference (1996) 2. Nadel, J., Guerini, C., Peze, A., Rivert, C.: The evolving nature of imitation as a format of communication. Cambridge University Press, Cambridge (1999) 3. DeGroot, D., Broekens, J.: Using negative emotions to impair game play. In: 15th Belgian-Dutch Conference on Artificial Intelligence (2003) 4. Ca˜ namero, D.L., Fredslund, J.: I show you how i like you: Emotional human-robot interaction through facial expression and tactile stimulation. In: IEEE Transactions on Systems, Man, and Cybernetics, Part A, IEEE, 454–459 (2001) 5. Murray, D.: Autism and information technology: therapy with computers. Psychological Review 1, 68–90 (1993) 6. Nadel, J.: Imitation and imitation recognition:functional use in preverbal infants and nonverbal children with autism, pp. 42–62 (2002) 7. Grynszpan, O., Martin, J.C., Oudin, N.: On the annotation of gestures in multimodal autistic behaviour. In: The 5th International Workshop on Gesture and Sign Language based Human-Computer Interaction, pp. 25–33 (2003) 8. Xsens: Motion capture system. http://www.xsens.com/ 9. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of the IEEE, pp. 257–286 (1989) 10. Minsky, M.: The Society of Mind. Simon & Schuster, New York (1986) 11. Silva, P.R.D., Osano, M., Marasinghe, A., Madurapperuma, A.P.: Towards recognizing emotion with affective dimensions through body gestures. In: FGR ’06, pp. 269–274. IEEE Computer Society Press, Los Alamitos (2006)
A Simplified Human Vision Model Applied to a Blocking Artifact Metric Hantao Liu1 and Ingrid Heynderickx1,2 1
Department of Mediamatics, Delft University of Technology, P.O. Box 5031, 2628 CD, Delft, The Netherlands
[email protected] 2 Group Visual Experiences, Philips Research Laboratories, Prof. Holstlaan 4, 5656 AA, Eindhoven, The Netherlands
[email protected] Abstract. A novel approach towards a simplified, though still reliable human vision model based on the spatial masking properties of the human visual system (HVS) is presented. The model contains two basic characteristics of the HVS, namely texture masking and luminance masking. These masking effects are implemented as simple spatial filtering followed by a weighting function, and are efficiently combined into a single visibility coefficient. This HVS model is applied to a blockiness metric by using its output to scale the blockedge strength. To validate the proposed model, its performance in the blockiness metric is determined by comparing it to the same blockiness metric having different HVS-based models embedded. The results show that the proposed model is indeed simple, without compromising its accuracy. Keywords: Human vision model, image quality assessment, luminance masking, texture masking, blockiness metric.
1 Introduction During the last decades a lot of research effort was devoted to the development of objective image quality metrics, which nowadays are widely used in a broad range of image rendering applications, such as for the optimization of video coding or for realtime quality monitoring in displays. In the video chain of a current TV-set e.g., various objective quality metrics, which determine the quality of the incoming video signal in terms of blockiness, noise, blur, etc. and adapt the parameters in the video processing algorithms accordingly, are implemented to enable an improved overall perceived quality for the viewer. To assure that they predict perceived quality, objective metrics based on models of the human visual system (HVS) are potentially more reliable for accurate quality prediction [1]. Indeed, including in an objective metric stimulus aspects important to the human eye, while removing perceptual redundancies inherent in metrics purely signal based has been proved to enhance the performance of a metric [1-2]. Advances in human vision research provided crucial information on the structural and functional mechanisms of the HVS [2], which has been primarily adopted to W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 334 – 341, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Simplified Human Vision Model Applied to a Blocking Artifact Metric
335
design a variety of computational vision models in the literature [1-4]. The essential task of modeling the HVS is to quantitatively simulate its operations, which generally involves some lower level processing (e.g. sensitivity and masking) and some higher level processing (e.g. attention) in the visual system [2], as well as to restrictively incorporate them in a vision model [1]. However, as the HVS is extremely complex, HVS based objective metrics often are computationally intensive. Hence, from a practical point of view, it is desirable to reduce the computational complexity of the HVS model without significantly compromising its performance. Much work has been done trying to incorporate HVS properties into quality metrics [4-7]. In some research parametric vision models including certain HVS aspects, were constructed. The parameters in these models were defined based on the results of a number of psychovisual experiments [6-7]. As a consequence, the accuracy of these models largely depends on the parameter selection, and their robustness cannot be fully ensured. In other research just-noticeable-distortion (JND) profiles, which provide each stimulus being tested with a visibility threshold of the distortion [8-9], are used. In these models, the thresholds for various masking effects are different, which potentially introduces difficulties in combining different masking effects. Instead of only estimating the threshold, the HVS model used in [5] is formulated as a weighting function. The main drawback in this approach, however, is that only one weighting function, intrinsically combining luminance and texture masking, is taken into account, and that efficient integration of different masking effects is not considered. In our paper, we further rely on the approach taken in [5] by using a HVS model as a weighting function for visibility, but extend the idea by including both luminance and texture masking, and by combining both masking effects in a simple way into a single visibility coefficient. To evaluate this approach, we used the model in a blockiness metric comparable to [5]. The blocking artifact, which manifests itself as an artificial discontinuity between adjacent blocks, is known as the major type of distortion in block-based DCT coding. It is checked whether the simple HVS model helps to quantify the visibility of blocking artifacts in grey-scale images.
2 The Simplified Human Vision Model The human vision model described in this paper adopts two fundamental properties of the HVS, which affect the visibility of a stimulus in the spatial domain: (1) the averaged background luminance surrounding the stimulus, and (2) the spatial nonuniformity in the background luminance [9]. They are well known as luminance masking and texture masking, respectively. Masking is the reduction in the visibility of one image component (the target) due to the presence of another (the masker), and it is strongest when both components have the same or similar frequency, orientation, and location [10]. In our proposed HVS model both spatial masking effects are estimated in a simple way, and efficiently combined into a single visibility coefficient. The approach is illustrated in Figure 1. A window, representing the local surrounding of a stimulus (i.e. in our case a – possibly deviating – pixel value, e.g. a blocking edge), is scanned over all stimuli (i.e. in our case over all pixels in an image). Both
336
H. Liu and I. Heynderickx
Fig. 1. Schematic overview of the proposed human vision model
masking effects are estimated by analyzing the local signal properties within this window. Based on the results a visibility coefficient (VC), which reflects the perceptual significance of the stimulus, is defined. 2.1 Local Visibility Due to Texture Masking Texture masking is modeled calculating a visibility coefficient (VCt). The higher the value of this coefficient, the smaller the masking effect, and hence, the stronger the visibility of the stimulus is. The procedure of modeling texture masking comprises three steps: − − −
Texture Detection: calculate the local background activity (non-uniformity). Thresholding: a classification scheme to capture the active background regions. Visibility Transform Function (VTF): obtain a visibility coefficient (VCt) based on the HVS characteristics for texture masking. 1
2
0
-2 -1
1
4
6
4
1
4
8
0
-8 -4
2
8
12
8
2
6
12
0 -12 -6
0
0
0
0
0
4
8
0
-8 -4
-2 -8 -12 -8 -2
1
2
0
-2 -1
-1 -4 -6 -4 -1
T1
T2
Fig. 2. The high-pass filters for texture detection, and Visibility Transform Function used
Texture detection can be performed convolving the signal with some form of highpass filter. One of the Laws’ texture energy filters [11] is employed here in a slightly modified form. As shown in Figure 2, T 1 and T 2 are used to measure the background activity in horizontal and vertical direction, respectively. A pre-defined
A Simplified Human Vision Model Applied to a Blocking Artifact Metric
337
threshold Thr ( Thr = 0.15 in our experiments) is applied to classify the background into ‘flat’ or ‘texture’, resulting in an activity value I t (i, j ) , which is given by ⎧⎪ 0 I t (i , j ) = ⎨ ⎪⎩ t (i, j )
t (i, j ) =
1 48
5
if t (i, j ) < Thr
(1)
otherwise
5
∑∑ I (i − 3 + x, j − 3 + y) ⋅ T ( x, y)
(2)
x =1 y =1
where I (i, j ) denotes the pixel intensity at location (i, j ) , and T is chosen as T1 for texture calculation in horizontal direction, and T 2 for vertical direction. It should be noted that splitting up the calculation in horizontal and vertical direction, and using a modified version of the texture energy filter, in which some template coefficients are removed, is done with the application of a blockiness metric in mind. A visibility transform function (VTF) is proposed in accordance to human perceptual properties, which means that the visibility coefficient VCt (i, j ) is inversely proportional (nonlinear) to the activity value I t (i, j ) . Figure 2 shows an example of such a transform function, which can be defined as
VCt (i, j ) =
1 (1 + I t (i, j ))α
(3)
where VCt (i, j ) = 1 , when the stimulus is in a ‘flat’ background, and α > 1 ( α = 5 in our experiments) is used to adjust the nonlinearity. This shape of the VTF is an approximation, considered to be good enough. 2.2 Local Visibility Due to Luminance Masking
In psychovisual experiments it was found that the human visual system’s sensitivity to variations in luminance depends on (is a nonlinear function of) the local mean luminance [10]. In this paper, modeling the luminance masking is based on two empirically driven properties of HVS: (1) a distortion in a dark surrounding tends to be less visible than one in a bright surrounding [9], and (2) a distortion is most visible for a surrounding with an averaged luminance value between 70 and 90 (centered approximately at 81) in 8bits gray-scale images [5]. The procedure of modeling luminance masking consists of two steps: − −
Local Luminance Detection: calculate the local averaged background luminance. Visibility Transform Function (VTF): obtain a visibility coefficient (VCl) based on the HVS characteristics for luminance masking.
The local luminance of a certain stimulus is calculated using a weighted low-pass filter as shown in Figure 3, in which some template coefficients are set to ‘0’. The local luminance I l (i, j ) is given by
338
H. Liu and I. Heynderickx
Il (i, j ) =
1 26
5
5
∑∑ I (i − 3 + x, j − 3 + y) ⋅ L( x, y)
(4)
x =1 y =1
where L is chosen as L1 for calculating the background luminance in horizontal direction, and L 2 for the vertical direction. Again, splitting up the calculation in horizontal and vertical direction, and using a modified low-pass filter, in which some template coefficients are set to 0, is done with the application of a blockiness metric in mind. 1
1
0
1
1
1
1
1
1
1
1
2
0
2
1
1
2
2
2
1
1
2
0
2
1
0
0
0
0
0
1
2
0
2
1
1
2
2
2
1
1
1
0
1
1
1
1
1
1
1
L1
L2
Fig. 3. The low-pass filters for local luminance detection, and Visibility Transform Function used
For simplicity, the relationship between the visibility coefficient VCl (i, j ) and the local luminance I l (i, j ) is modeled by a power law for low background luminance (i.e. below 81), and is approximated by a linear function at higher background luminance (i.e. above 81). This functional behavior is shown in Figure 3, and mathematically described as ⎧⎛ I (i, j ) ⎞ 1 / 2 ⎪⎜ l ⎟ ⎪⎝ 81 ⎠ VCl (i, j ) = ⎨ ⎪⎛ 1 − β ⎞ ⎟ ⋅ (81 − I l (i, j )) + 1 ⎪⎜⎝ ⎩ 174 ⎠
if 0 ≤ I l (i, j ) ≤ 81 (5) otherwise
where VCl (i, j ) achieves the highest value of 1 when I l (i, j ) = 81 , and 0 < β < 1 ( β = 0.7 in our experiments) is used to adjust the slope of the linear part of this function. 2.3 Integration Strategy
The visibility of a stimulus depends on various masking effects co-existing in the HVS, and how to efficiently integrate them is an important issue in obtaining an accurate perceptual model [8]. Since spatial masking intrinsically is a local phenomenon, the locality in the visibility of a distortion due to masking is maintained in the integration strategy of both masking effects. The resulting approach is schematically given in Figure 4. Based on the local image content surrounding a stimulus first the texture
A Simplified Human Vision Model Applied to a Blocking Artifact Metric
339
Fig. 4. Integration strategy of the texture and luminance masking effects
masking is calculated. In case the local activity in the area is larger than a given threshold (see equation (1)), a visibility coefficient VCt is applied, followed by the application of a luminance masking coefficient VCl. In case the local activity in the area is low, only VCl is applied.
3 Blockiness Metric Using Proposed Model Given a DCT-coded image, the block-edge strength (BS) can be defined as the interpixel difference across block boundaries (e.g. BS h (i, j ) = I (i, j ) − I (i, j + 1) is de-
fined as the inter-pixel difference across horizontal block boundaries, where (i, j ) denotes the pixel location) [5]. The output of the proposed human vision model VC can be used to locally weight the BS to produce a visual blocking strength (VBS), which is given by VBS (i, j ) = VC(i, j ) × BS (i, j )
(6)
The VBS can be easily implemented in a generalized block-edge impairment metric [5], which is formulated as
Metric =
MO(i, j ) × BS (i, j ) MO(i, j ) × NBS (i, j )
(7)
where || . || is the L2-Norm, and NBS is defined as the inter-pixel difference between pixels, which are not at block boundaries [5]. MO is used to indicate the output of any HVS model (in our case VC). The horizontal and vertical blocking artifacts can be calculated separately using the appropriate filters for VC, and then added together to give the resultant blockiness score, i.e. Metric = Metric(h) + Metric(v).
4 Performance Evaluation The proposed human vision model is validated by its application to an objective blockiness metric. In order to analyze the model contribution rather than the performance of various blockiness metrics, a comparative evaluation is necessarily conducted by embedding different human vision models to the same blockiness metric. Based on the generalized blockiness metric defined in (7), three options are implemented for MO: (1) our proposed model, (2) the model used in [5], and (3) the JND profile used in [9]. This results in three blockiness metrics, which we refer to as VBSM, GBIM
340
H. Liu and I. Heynderickx
and JNDM, respectively. The LIVE database [13], which consists of 233 JPEG images with their subjective Mean Opinion Score (MOS), is used to test the performance of these blockiness metrics. According to the Video Quality Expert Group (VQEG) [12], the performance of the objective metrics can be quantitatively measured by the Pearson linear correlation coefficient and the Spearman rank order correlation coefficient between subjective MOS and objective ratings after nonlinear regression.
Fig. 5. Scatter plots of MOS vs. VBSM, GBIM and JNDM Table 1. Performance comparison of three objective metrics for image quality assessment
Metric VBSM GBIM JNDM
Pearson Linear Correlation 0.9517 0.9280 0.9401
Spearman Rank Order Correlation 0.9251 0.9116 0.9176
Figure 5 shows the scatter plots of the MOS vs. VBSM, GBIM, and JNDM, respectively. The corresponding correlation coefficients are listed in Table 1. It is verified that a promising performance is achieved by applying our HVS in a blockiness assessment. In contrast to our model, the vision model used in GBIM [5] intrinsically combines luminance and texture masking into a single weighting function. Although this is statistically acceptable, it might degrade the model’s performance in some demanding circumstances, for example when assessing highly textured images. This problem is solved in our model by separating the two masking effects, and by adaptively combining them based on local signal features. Therefore, our model is more reliable in terms of content independency. This is confirmed by repeating the correlation analysis on a limited set of 50 (out of 233) highly textured LIVE database images only. For these images the VBSM gives a Pearson correlation of 0.9391, whereas the GBIM results in a poorer correlation of 0.7695. Our model is comparable to the approach chosen in the JND profile [9] with the exception that the JND profile only considers a threshold, while our model also estimates supra-threshold visibility. This makes our model slightly more accurate and robust (for the limited dataset mentioned above the Pearson correlation for the JNDM is 0.9038). Our model also has the intrinsic advantage that knowledge on the nature of the artifact can simply be taken into account (e.g. by evaluating horizontal and vertical masking separately for blocking artifacts), which makes the model simple and efficient. This simplification is less
A Simplified Human Vision Model Applied to a Blocking Artifact Metric
341
obvious in the more generally applicable JND profile model. Nonetheless, we expect also our model to be more generally applicable to the visibility of other artifacts (mainly by changing size and coefficients in the filters).
5 Conclusion We have presented a simplified and more efficient human vision model based on estimating spatial masking effects of the HVS, such as luminance and texture masking. These masking effects were estimated using spatial filtering followed by a weighting function, and were efficiently combined into a single visibility coefficient. The application of this model in a blockiness assessment resulted in a strong correlation with subjective ratings. The proposed model is unsupervised and does not need to be trained with subjective data. It can be easily integrated into either full-reference or no-reference approaches for measuring blocking artifacts.
References 1 Winkler, S.: Issues in Vision Modeling for Perceptual Video Quality Assessment. Signal Processing 78(2), 231–252 (1999) 2 Osberger, W., Maeder, A.J., McLean, D.: A Computational Model of the Human Visual System for Image Quality Assessment. In: Proc. DICTA-97, pp. 337–342 (December 1997) 3 Yu, Z., Wu, H.R.: Human Visual System Based Objective Digital Video Quality Metrics. In: Proc. Int. Conf. Signal Processing, vol. II, pp.1088–1095 (August 2000) 4 Yu, Z., Wu, H.R., Winkler, S., Chen, T.: Vision Model Based Impairment Metric to Evaluate Blocking Artifacts in Digital Video. In: Proc. of the IEEE, pp. 154–169 (January 2002) 5 Wu, H.R, Yuen, M.: A Generalized Block-edge Impairment Metric for Video Coding. IEEE Signal Processing Letters 70(3), 247–278 (1998) 6 Yeh, E.M., Kokaram, A.C., Kingsburg, N.G.: A Perceptual Distortion Measure for EdgeLike Artifacts in Image Sequences. Human Vision and Electronic Imaging III, pp. 160172, SPIE (1998) 7 Karunasekera, S.A., Kingsbury, N.G.: A Distortion Measure for Blocking Artifacts in Images Based on Human Visual Sensitivity. IEEE Trans. Image Processing (1995) 8 Yang, X., Lin, W., Lu, Z., Ong, E., Yao, S.: Motion-Compensated Residue Preprocessing in Video Coding Based on Just-Noticeable-Distortion Profile. IEEE Trans. on Circuits and Systems for Video Technology 15(6), 742–751 (2005) 9 Chou, C.H., Li, Y.C.: A Perceptually Tuned Subband Image Coder Based on the Measure of Just-Noticeable-Distortion profile. IEEE Trans. on Circuits and Systems for Video Technology (December 1995) 10 Pappas, T.N., Safranek, R.J.: Perceptual criteria for image quality evaluation. In: Bovik, A.C. (ed.) Handbook of Image and Video Processing, Academic Press, San Diego (May 2000) 11 Laws, K.I.: Texture Energy Measures. In: Proc. DARPA Image Understanding Workshop, Los Angeles, pp. 47–51 (1979) 12 VQEG: Final report from the video quality experts group on the validation of objective models of video quality assessment (2003) http://www.vqeg.org/, Aug 13 Sheikh, H. R., Wang, Z., Cormack, L., Bovik, A. C.: LIVE image quality assessment database. http://live.ece.utexas.edu/research/quality
Estimating Reflectance Functions Using a Cyberware 3030 Scanner Matthew P. Dickens and Edwin R. Hancock Department of Computer Science, The University of York
Abstract. Measuring reflectance properties is important to the fields of computer graphics and vision. We present a novel, rapid measurement technique specifically targeting the reflectance properties of skin. Using a Cyberware laser scanner to capture range and radiance data, a sampling of the surface radiance function under a broad range of incident and view directions is constructed. This function is at the core of local reflectance computation and the recovered data can be used for various tasks including rendering, guiding models, albedo estimation and more. We qualitatively analyse the resulting data, and illustrate a number of possible uses for it.
1
Introduction
The effects of local illumination provide important information about the shape and texture of a surface. Recently, the study of these effects has been important to both graphics and computer vision. The widely used assumption of Lambertian reflectance for diffuse surfaces is not sufficient for the needs of either field. Moreover, the demand for increasingly realistic rendering techniques from both the film and games industries, as well as the pursuit of more complete vision systems provides an imperative for detailed reflectance models. The bidirectional reflectance distribution function (BRDF, see section 2) [1] completely describes the reflectance properties of a surface. Unfortunately, it is often impractical to measure due to its dimensionality and sensitivity to change in surface properties. However, various techniques have been described in the literature that attempt to capture samples of the BRDF for a given object. Classically, this involves the use of a gonioreflectometer, as discussed in [2]. In recent years, this single-sample-single-detector approach has been surpassed in popularity by image-based techniques that use CCDs to capture more complete samples in a shorter time. There are several examples of such methods. Amongst the more notable is that of Ward [2] which relies on a hemispheric mirror to direct the full reflected hemisphere onto a CCD, thus capturing the BRDF in a significantly shorter time than classic gonioreflectometer methods. Another approach is that described by Marschner [3], which uses a curved sample and a CCD to capture multiple angles of incidence and reflectance at once. Matusik et al. [4] use a similar method, placing an emphasis on capturing higher-frequency properties, i.e. areas of specularity, in more detail. Jensen [5] on the other hand, W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 342–350, 2007. c Springer-Verlag Berlin Heidelberg 2007
Estimating Reflectance Functions Using a Cyberware 3030 Scanner
343
constructs a device capable of directly measuring subsurface transport in skin and other materials in order to capture a more physically plausible model of diffuse reflectance. In [6], Robles-Kelly et al. present a method with the slightly different goal of recovering the surface radiance function of an object from a single image by implicitly exploiting the mapping between the Gauss sphere representation of surface normal distribution and the imaged surface. Although not capable of recovering the full BRDF, the method has been successfully used for applying corrected Lambertian reflectance to curved surfaces, specularity removal and material classification. Although the measured BRDF can be used directly, it is more commonly used to guide a model due to the memory overheads of storing the complete BRDF, as well as to avoid the inaccuracies of direct measurement [2]. This approach has been applied to various models, including the simple Gaussian-based isotropic model presented with the BRDF measurements in [2]. This model is then fit to the data observed with the measurement device described above using a simple least-squares approach. The simplicity of such models with so few parameters has, however, been criticised on the basis of their potential to statistically disregard interesting aspects of the BRDF as measurement error [4]. Matusik et al. observe that many analytical models of the BRDF use only a few parameters, and thus argue that the space of all BRDFs may be represented by a low-dimensional manifold [4]. Using data collected by a system similar to that of Marschner [3], they form both a linear and a non-linear basis for the BRDF space, and present renderings of novel and observed BRDF samples from within the space. The results are promising for materials that are relatively similar to the classes of those measured. However, the data gathering technique relies on spherical samples and thus objects of complex shape cannot easily be represented. Debevec et al. [7] capture multiple images from fixed views under different illumination conditions in order to construct a description of the ”reflectance field”, which substitutes a local description of light source direction for a more distant description. The observed data is split into specular and diffuse components using a method derived from the temporal-color spaces of Sato et. al [8], and from here simple corresponding models are fitted to the observed data. A wealth of other work has been undertaken in the graphics community with the goal of providing visually pleasing face renderings, with an emphasis on subsurface approximations for the diffuse component. The emerging trend is to opt for physically-based approximations over empirical methods. Notable works in the area include those of Hanrahan [9] and more recently, Jensen [5]. There has therefore been significant effort to model reflectance properties, however the acquisition process is still an active area of research and fast methods are desirable to provide guidance for the many existing and emerging datadriven techniques. To provide such data, in this paper we develop a method for measuring the BRDF using a Cyberware 3030 scanner to simultaneously capture both range and luminance data. Combined with geometric calibration, this data can be used to estimate the reflectance properties of curved objects.
344
2
M.P. Dickens and E.R. Hancock
Radiometric Definitions
In this section we review the elements of radiometry required to develop our reflectance estimation process, and describe the data-capture process [10]. Figure 1 shows the local geometry of the luminance sampling process. BRDF: The bidirectional reflectance distribution function (BRDF) described by Nicodemus [1] completely describes the way in which incident irradiance and reflected radiance are related at a given point on the surface: ρbd (θi , φi , θr , φr , P ) =
Lr (P, θr , φr ) Li (P, θi , φi ) cos θi dω
This definition of the BRDF is five dimensional, making it cumbersome by nature and it is not uncommon to make the assumptions of homogeneity and isotropy to reduce the dimensionality. The BRDF is therefore the ra- Fig. 1. Local geometry of lighttio of reflected radiance in the direction of the surface interaction viewer to the incident irradiance. Radiance: Radiance, L(P, θ, φ), describes the configuration of light sources throughout the environment [10]. At any point in any direction, radiance describes the transfer of energy per unit time along the specified direction of travel per unit solid angle. This quantity is based on foreshortened surface area, that is, the area of the surface visible from the source. This can be thought of as the power incident on a surface of area dA cos θ which lies perpendicular to the direction of travel, per unit solid angle. Irradiance: Irradiance, I(P, θ, φ) = L(P, θ, φ) cos θdω, is the transfer of energy per unit time per unit area along the direction of travel. The area is not foreshortened, thus defining how much power the surface patch receives from the specified incident direction. By integrating irradiance over the incident hemisphere, the total power incident on a point can be computed.
3
Reflectance Estimation
We present a method of rapidly gathering radiance information for a variety of curved objects including the human face using a Cyberware laser scanner. In contrast to previous approaches that use range data in conjunction with a separate imaging devices, we use the “luminance map” returned by the scanner, containing the reflected intensity of the laser at each point on the surface. Using the combination of luminance map and range data, samples of the surface radiance function under varying incident illumination can be obtained.
Estimating Reflectance Functions Using a Cyberware 3030 Scanner
345
The goal is to obtain dense samples of observed radiance, L(θr , φr ) over as many angles of reflectance as possible for a variety of incident angles. Similarly to [3], with the assumptions of homogeneity and isotropy the varying surface orientations of an object provide numerous potential samples of the BRDF depending on θr , θi and Δφ. It is this fact, and the geometry of the Cyberware scanner that lies at the core of the proposed method. The advantage of this technique is that depth can be captured simultaneously and the whole capture process is completed in the time taken for this single scan (typically 20 seconds). 3.1
Scanner Operation
The following information regarding the scanner is derived from the data-sheet for the scanner [11]. The Cyberware 3030 scanhead is a range sensor capable of digitising objects with a resolution of up to 1024x1440 points. Expressed in cylindrical coordinates in correspondence with the motion of the scan head, surface points generate entries in the output depth and luminance maps. The mapping between the surface height and luminance is direct, and each surface point in the depth map has an associated luminance. The scanhead operates on the principal of capturing profiles of a surface, which are recorded on CCD via two ”virtual cameras”. A plane of laser light is emitted towards the centre of rotation of the scan system. Two mirrors inside the scanhead direct the reflected light towards a CCD, from which the data can be used to compute profiles of the surface, a series of which fully describe the Fig. 2. Normalised sample scanner surface. The internal geometry of the camera output relies on a series of mirrors but can be conceptualised in a more convenient manner, as illustrated in Figure 3. Normal Computation: From the range data, surface normals can be trivially computed per mesh-face or per mesh-vertex using any of a number of standard techniques with little variation between methods due to the density of the sampling. In order to conveniently maintain a relationship between surface normal and luminance values, the normals are computed per-vertex using the mean weighted-angle approach [12]. Object Shape: Because of the cylindrical nature of the scan, a purely cylindrical object centered exactly on the axis of rotation will return a single datum (assuming homogeneity and isotropy) due to the single incident and exitant angles. This presents the worst-case sampling of the BRDF. However, in most cases the curvature of the object is sufficiently varied to provide a range of surface orientations, which in turn provide a range of incident and exitant angles. The
346
M.P. Dickens and E.R. Hancock
Fig. 3. (Left) Scanner description, based on that found in [11]. (Right) Conceptual layout of imaging geometry.
zenith angles can be computed using θi = arccos(N.I) and θr = arccos(N.R) where: ⎛ ⎞ cos α 0 sin α 1 0 ⎠ I = (0 0 1) . ⎝ 0 − sin α 0 cos α Here α = − j×2×π + π , j is the column of the depth map under consideration Linc and Linc is the total horizontal resolution of the depth map. R is defined similarly to I, instead using α + arctan (x/d) in the rotation matrix, where x is the horizontal offset of the virtual camera as shown in Figure 3 and d is the distance from the light emitter (laser) to the surface. Values of the difference in azimuth angle Δφ between the emitter and the camera can be computed by projecting I and R into the normal N to give the points Pi and Pr . The vector defined by I − Pi is a vector Ip in the plane perpendicular to N in the direction of I. The same can be performed for R and Pr giving Rp , and thus Δφ = arccos(Ip .Rp ). 3.2
Practicalities of Measurement
In default operation, the scanner captures luminance data using the combined results of both virtual cameras in a proprietary (undocumented) manner. For the purposes of this method, however, we need a luminance value due to a single reflected ray. To effect this we simply obstruct the path of reflected light to one of the virtual cameras1 . This gives a single luminance value with a predictable angle of reflection. Further, the scanner is calibrated and its internal units are computed using a test cylinder manufactured by Cyberware. 1
Thanks to Gene Sexton of Cyberware for discussion regarding the practicalities.
Estimating Reflectance Functions Using a Cyberware 3030 Scanner
347
Radiometric calibration is a common concern when considering surface reflectance properties. The radiometric quantities are clearly defined, and different approaches from the literature go to varying lengths in order to ensure adherence to these definitions. The internal mechanisms of the scanner make a significant portion of radiometric calibration impractical due to the proprietary way in which the CCD data is processed. A simple distance squared correction, however, was applied to the resultant luminance map to little effect due to the relatively small variation in surface height.
4
Observations
Data Coverage: It is immediately apparent that although a full range of θi and θr values are not covered, the region that is covered is densely sampled. The boundaries to the data as seen in figure 4 are due to the fixed relationship between the emitter and detector. Although the surface height varies, this relative variation is not significant enough to broaden the range of available samples significantly. This configuration has similar implications for the range of Δφ, as is also evident from the figure. Although it is not possible to sample the full BRDF, the data that is available has some notable properties suggesting possible applications, which will be discussed in the following sections.
3.5
θr=0.1 θr=0.2 θr=0.3
3
θr=0.4 θr=0.5 θr=0.6
2.5
θr=0.7 θr=0.8 θr=0.9
2
θr=1.0
Δφ
θr=1.1 θr=1.2
1.5
θr=1.3 θr=1.4 θr=1.5 θr=1.6
1
0.5
0
0
0.2
0.4
0.6
0.8 θi
1
1.2
1.4
1.6
Fig. 4. (Left) θi against θr . The intensities of the points are also shown. (Right) Values of φ for various θi and θr values.
Comparing Radiance Estimates: Figure 5 compares radiance curves over various values of θi for a scan of a human face and for a scan of a plaster bust. A second dataset of the same face is provided for comparison, which shows that the two face scans share very similar properties, even in this simple representation. This similarity could be exploited by recognition or classification techniques, if variation between subjects is sufficiently large. Further use of the radiance estimates recovered for these purposes are left for investigation. Individual analysis of the shapes of the curves is impractical because the Δφ values are so closely linked to θi and θr . However, face-like objects exhibit very
348
M.P. Dickens and E.R. Hancock
0.22
0.14 0.12
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4
0.5
0.45
0.4
0.35 Intensity
0.16
θ =0.1 i
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3
0.18
Intensity
0.55
θi=0.1
0.2
0.1
0.3
0.25
0.08
0.2
0.06
0.15
0.04
0.1
0.02
0.05
0
0.5
1
θr
1.5
0
0.5
θr
1
1.5
0.25
0.2
Intensity
0.15
0.1
0.05
0
0
0.2
0.4
0.6
0.8
1 θ
1.2
1.4
1.6
1.8
2
r
Fig. 5. (Left) Curves showing θr and intensity for various values of θi for the face. (Right) The same, for the bust. (Bottom) Second dataset for same face.
similar relationships between incident and exitant angles, and because of this a comparison for recognition purposes is more viable. Figure 5 clearly highlights the difference in radiance curves between plaster and human skin. Albedo Estimation: One practical use for the data is to produce texture maps of objects devoid of shading information due to illumination. Such maps are often referred to as albedo maps, and are useful for recognition purposes as they are illumination independent [13], or for realistic facial rendering. To achieve the estimation, a 256 × 256 look-up table was constructed from the available data. From the radiance X, 256 vectors vk are constructed, where data the lth element of vk , vlk = N1 xn . The elements xn are taken from X which contains those elements in X where xn = xm if abs(xθmr − k/256 × π/2) < t. Thus, first an array for each of the 256 θr values is constructed containing the points with a reflected angle that deviates from θr by less than t radians. X i is then constructed where xn = xm if abs(xθ m − l/256 × π/2) < t. If there is not sufficient data for a k, l pair then the entry is set to 0. Various values of t were used, with insignificant effects on the outcome. t = 0.1 was chosen for the presented results. For each θi , θr pair, the vector vk is first chosen using θr k = pi/2 × 256. The radiance is then chosen using a similar function of θi .
Estimating Reflectance Functions Using a Cyberware 3030 Scanner
349
Fig. 6. (Left) The original luminance map. (Centre Left) Estimated albedo (shading removed). (Centre Right) Render using acquired function. (Right) Lambertian render.
Figure 6 shows an example of the outcome of the process. It is clear from the central image that the majority of the shading information is removed to an impressive extent, especially around the chin and nose regions. A shortcoming of this method is the persistence of the specularity on the nose. Although a significant specular component is present, it is clear from the rendered image that this very high frequency area is not predicted, largely due to the degree of smoothing ascribed to the radiance function estimation stage.
5
Summary
We have presented a novel approach for measuring a part of the BRDF using a Cyberware range scanner. Although not intended for the purpose, the scanner is capable of capturing useful radiance data which in turn has been shown to be useful for albedo estimation. We have also examined the range of data recovered by the scanner. Further work will attempt to incorporate the reflectance data into a recognition framework, to augment the process with illumination and view independent information.
References 1. Nicodemus, F.E., Richmond, J.C., Hsia, J.J., Ginsberg, I.W., Limperis, T.: Geometrical Considerations and Nomenclature for Reflectance, Final Report National Bureau of Standards, Washington, DC. Inst. for Basic Standards (1977) 2. Ward, G.J.: Measuring and Modeling Anisotropic Reflection. Computer Graphics 26(2), 265–272 (1992) 3. Marschner, S.R., Westin, S.H., Lafortune, E.P.F., Torrance, K.E., Greenberg, D.P.: Image-Based BRDF Measurement Including Human Skin. In: Proc. 10th Eurographics Workshop on Rendering, pp. 139–152 (1999) 4. Matusik, W., Pfister, H., Brand, M., McMillan, L.: Efficient Isotropic BRDF Measurement. In: Eurographics Workshop on Rendering, pp. 241–247 (2003)
350
M.P. Dickens and E.R. Hancock
5. Jensen, H.W., Marschner, S.R., Levoy, M., Hanrahan, P.: A Practical Model for Subsurface Light Transport. In: Proc. SIGGRAPH. pp. 511–518 (2001) 6. Robles-Kelly, A., Hancock, E.R.: Estimating the Surface Radiance Function from Single Images. Graphical Models 67(6), 518–548 (2005) 7. Debevec, P., Hawkins, T., Tchou, C., Duiker, H.-P., Sarokin, W., Sagar, M.: Acquiring the Reflectance Field of a Human Face. In: Proc. SIGGRAPH, pp. 145–156 (2000) 8. Sato, Y., Ikeuchi, K.: Temporal-Color Space Analysis of Reflection. Journal of Optical Society of America A 11(11), 2990–3002 (1994) 9. Hanrahan, P., Krueger, W.: Reflection from Layered Surfaces due to Subsurface Scattering. In: Proc. SIGGRAPH, pp. 165–174 (1993) 10. Forsyth, D.A., Ponce, J.: Computer Vision A Modern Approach. Prentice Hall, Englewood Cliffs (2003) 11. Cyberware Scanhead Operation Manual. (Last accessed: 29/3/2007), http://www.cyberware.com/guides/cyscan/operation.html 12. Walsh, J.: Normal Computations for Heightfield Lighting. (Last accessed: 01/04/2007) http://www.gamedev.net/reference/programming/features/ normalheightfield/default.asp 13. Smith, W.A.P., Hancock, E.R.: Estimating Facial Albedo from a Single Image, Internation Journal of Pattern Recognition and Artificial Intelligence (2006)
Are Younger People More Difficult to Identify or Just a Peer-to-Peer Effect Wai Han Ho, Paul Watters, and Dominic Verity Macquarie University, NSW, Australia
[email protected],
[email protected],
[email protected] Abstract. Recent investigations into the effect of age on face identification concluded that it was more difficult to identify younger people than older ones. The identification rates of the different age groups were, however, not measured under identical conditions. There was a significantly higher percentage of younger people in all the face image samples. We found that a person from any age group will find that they look more similar to another person from the same age group, as opposed to someone from another age group. The experiments we carried out using samples that have an evenly distributed age range did not show a statistically significant difference between the sample age groups. Keywords: Aging, face, identification, biometrics.
1
Introduction
Ho et al. showed in a survey [1] that the amount of research on face recognition had increased rapidly in recent years, but that research into the effect of aging on face recognition was limited as compared to research into illumination change, pose, expression and occlusion. In this paper, we discuss whether people of different ages have different levels of difficulty in being identified. We look at how a person’s age affects the person’s chance of being correctly identified (not the effect of time on a person’s identification). Age difference refers to the difference between the ages of the two people involved, and thus their images. Recent analyses [2,3,4] into the effect of age on face identification using different face samples resulted in a similar conclusion, that younger people were more difficult to identify. This research project found that the samples used all had an uneven distribution of different ages, with a higher percentage of younger people. This raised the question of whether younger people were more difficult to identify, or the samples offered made the identification more difficult, due to the higher representation of younger faces in the samples. The experimental results we obtained showed that a person from any age group will find that they look more similar to another person from the same age group, as opposed to someone from another age group. This may explain why a lower identification rate was measured for younger people in the previous studies, which used samples that have more younger people. We found that an W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 351–359, 2007. c Springer-Verlag Berlin Heidelberg 2007
352
W.H. Ho, P. Watters, and D. Verity
age group’s identification rate could be increased by reducing its proportion in a sample. Finally, we carried out two experiments using samples that have an even distribution of different ages in order to determine whether younger people are really more difficult to recognize. We found that although younger people showed a higher tendency to present a lower identification rate, the difference was not statistically significant. Section 2 describes in detail the correlation found between the age group sizes and identification rates measured in previous experiments. Section 3.1 describes how a change in sample composition changed the identification rates of different age groups. Section 3.2 shows the relationship between the similarity and age difference of two people, and section 3.3 describes the experiments carried out using samples of an even age distribution to determine whether there was a significant difference between the identification rates of older and younger people. Section 4 summarizes our findings.
2
Previous Experiments from Another Perspective
FRVT 2002 [2] analyzed the effect of age on identification using 121,589 images from 37,437 people, each having at least three images. The gallery set contained the oldest images of the people while the probe set had the more recent ones. The gallery and probe sets have image pairs of elapsed time up to 1140 days. Among them, 10,506 or about 14% were between 0-60 days. The people in the sample were clasFig. 1. FVRT02 Ident. Rate verses Group Size sified into 12 age groups of five years apart according to the age of their oldest images. The first group of people were of age 18 to 22, and the last one from 73 to 78. The research showed that the identification rates of the age groups increased with their age for the majority of the algorithms tested, and concluded that younger people were more difficult to recognize. The assumptions we made concerning the data provided in the FRVT02 report were: (a) As no information was provided in the report about the individual distribution of the elapsed time periods for each age group, we assumed the individual age groups had a similar distribution as the distribution of all the age groups. (b) As the report provided no information on the change in identification rate for different age groups for individual elapsed time period, we assumed that the different elapsed time periods had similar trends in identification rate. FRVT02 Identification Rates of Different Ages vs Corresponding Subject Counts
1
Cognitec, r = −0.94968 Eyematic, r = −0.87523 VisionSphere, r = −0.85691
0.9
Identification Rate
0.8
0.7
0.6
0.5
0.4
0.3
9
2
29
59
54 25 2374 68 19 24 15 91 11 52
39
29
37
93
55
55
64
80
54
0.2
Subject Count
age:
23−27
18−22
28−32
33−37
38−42 48−52 58−62 43−47 53−57 53−57
68−72 73−77
Are Younger People More Difficult to Identify or Just a Peer-to-Peer Effect
353
Based on the above assumptions, the ’0-60 days’ elapsed time period had a similar identification trend as the overall trend of all elapsed time periods. From the FRVT02 report, we found that the face sample used had an uneven distribution of different ages, with a decreasing percentage as age increases. The age group ’23 - 27’ had the largest number of people, 8054, while the oldest age group had the minimum, 299. The only exception was that age group ’18-22’ had fewer people than age group ’23 - 27’. When we disregarded the age and considered only the change in identification rate with group size, we found a high negative correlation between the two. The correlation coefficients between the group sizes and identification rates of the eight algorithms shown in the FRVT02 report were -0.95, -0.88, -0.86, -0.93, -0.90, -0.90, -0.90 and -0.66, very close to those between the ages and identification rates, which were 0.93, 0.80, 0.92, 0.99, 0.92, 0.98, 0.95 and 0.45. Figure 1 plots the variations in identification rate with group size of three algorithms. The research done by Ricanek et al. [4] using the MORPH dataset Table 1. Age Dist. and I. Rate from [4] also resulted in a conclusion that younger people were more difficult Age Range < 18 18 - 29 30 - 39 40 - 49 to identify. According to the paper, Percentage 10% 55% 26% 9% the test sample contained 1,724 face I. Rate 0.344 0.42 0.452 0.8 images from 515 people, each having two to four images. The elapsed time between the first enrolled and subsequent images ranged from 46 days to 29 years. The people in the dataset were classified into four age groups. Table 1 tabulates the age distribution and the corresponding identification rates when identifying images having an elapsed time of 0-5 years [4]. The correlation when looking at all of the four groups was -0.41. However, as the number of groups was relatively small, any deviation from the norm would probably have rendered a larger drop in correlation. The correlation for the oldest three groups was -0.84. Given et al. [3] also showed that older (>40) people were easier to recognize, using images on the FA list of the publicly available FERET [5] database. The authors did not provide the age distribution of the sample used in the study, so we could not determine if a similar correlation existed. However, the FERET database version that we have had also had a fewer number of older people. Table 2 tabulates the age distribution of people on the FA list with ground truths. The investigations done into the effect of age on face identification all came up with a similar conclusion: it was more difficult to identify younger people. However, when studying the experiments from another perspective, we found a high negative correlation between the ages and sizes of the age groups, and thus between the identification rates and sizes. This leads to the question of whether the difference in identification rates measured was actually caused by a difference in the ages of the people studied, or a difference in proportions.
354
3
W.H. Ho, P. Watters, and D. Verity
Our Study into the Age Effect on Face Recognition
We believe the group sizes played a large part in the ultimate identification rates measured in the previous experiments. Subsection 3.1 describes the experiment that showed younger age groups’ identification rates could be increased by reducing their proportions in a sample. Subsection 3.2 shows the relationship between image similarity and age difference of two people. We believe the higher similarity between people of a similar age explained one of the reasons why an age group having a larger representation in a sample had shown a lower identification rate. Subsection 3.3 shows the experiment done with an even composition of different ages. The difference in identification rate was not statistically significant. 3.1
The Change in Identification Rate with Proportion
The experiment used the Principal Component Analysis (PCA) algorithm developed by CSU [6], and the publicly available FERET face database distributed by FERET [5]. The images in the gallery and probe sets came from the FA and FB lists respectively. The lists consisted of images taken from the same people on the same date but with different expressions. After eliminating people without ground truths, the final sample consisted of images from 986 people of age 10, 20(two of age 22), 30, 40, 50 and 60. The two people in their 70’s were eliminated due to its small percentage. Table 2 tabulates the distribution. A person was said to be identified if the Euclidean distance between his probe and gallery images was shorter than those between the probe image and other gallery images. The overall identification rate was calculated as the number of images correctly identified out of all probe images tested. The identification rate of an age group was the number of correctly identified images belonging to people of that age group out of all probe images of that age group. Table 2 tabulates the identification rates measured for the age groups. The rates measured generally increased with the ages. But again, those age groups having a higher identification rate were in general having a lower representation in the sample. When the number of people at ages 20 and 30 in the sample were reduced to those in table 3, the identification rate of the two age groups increased to a value close to those of older ages. The identification rates of the older age groups, on the other hand, almost remained unchanged. The test illustrated that the identification rate measured for an age group does change with its proportion. Table 2. Age Distribution of Our Sample and Corresponding Ident. Rate.
Table 3. Adjusted Age Distribution and Corresponding Identification Rate
Age 10 20 30 40 50 60 Count 18 444 257 163 80 24 I. R.% 66.7 73 85.2 86.5 81.3 91.7
Age 10 20 30 40 50 60 Count 18 40 70 163 80 24 I. R.% 77.8 82.5 95.7 88.3 82.5 91.7
Are Younger People More Difficult to Identify or Just a Peer-to-Peer Effect
3.2
355
Similarity Score Verses Age Difference of Two People
Table 4 shows the statistics of the normalized distances between the probe and gallery images of Section 3.1 . The normalized distance of a probe to a gallery image is its original value subtracted by the mean of all distances of that particular probe to all the gallery images, divided by their standard deviation. Normalization provides a uniform platform for similarity comparison. The rows display the statistics for the different age groups of the probe set, while the columns display the statistics for the different age groups of the gallery set. For example, the cell on the first data row and second data column shows the statistics of the probe images in the age group of 10 compared to the gallery images of those in the group of 20’s. The first line of each row shows the mean and standard deviation, and the second line the 2.5% and 97.5% percentiles. As failure to identification occurs usually to outliners, we studied the 2.5% percentile distances. Figure 2 plots the 2.5% percentiles of all the age groups. Each line in the figure represents the normalized distances of the probe images belonging to a certain age group to the gallery images. The x-axis presents the age groups of the gallery images, and the y-axis the distances. The figure shows that the younger groups, like those of age 10 and 20, had a greater distance to the older groups in general. The distance increased with age difference. Whereas in general, the older groups had a greater distance to the younger groups. The distance decreased as the age difference decreased. Figure 3 plots the 2.5% percentile distances again but with respect to the age differences between the people of the probe and gallery images. Negative values on the X axis represent cases where the person of the gallery image was younger than the person of the probe. Positive values represent cases where the person of the gallery image was older than the person of the probe. Zero when both people were in the same age group. The figure clearly shows a ’V’ shape for all age groups with the location of the tip closed to the zero X value. The age of 20 had the sharpest tip. Table 4. Normalized Distance between Images of Different Age Groups Age Group 10 10 -0.28 ± 1.00 -1.94 - 1.92 20’s -0.14 ±0.97 -2.04 - 1.79 30 -0.10 ±0.92 -1.91 - 1.83 40 -0.10 ±0.89 -1.77 - 1.79 50 -0.08 ±0.89 -1.82 - 1.80 60 -0.04 ±0.82 -1.39 - 1.89
20’s 0.05 ±1.09 -1.97 - 2.10 -0.05 ±1.09 -2.15 - 2.01 0.03 ±1.04 -1.96 - 2.04 0.10 ±1.01 -1.79 - 2.12 0.16 ±1.00 -1.69 - 2.15 0.18 ±0.99 -1.62 - 2.17
30 0.17 ±1.02 -1.65 - 2.28 0.06 ±1.04 -1.87 - 2.21 0.05 ±1.08 -1.95 - 2.28 0.07 ±1.10 -1.89 - 2.36 0.09 ±1.09 -1.82 - 2.43 0.10 ±1.10 -1.74 - 2.39
40 0.10 ±0.89 -1.52 - 1.93 0.06 ±0.89 -1.61 - 1.85 -0.02 ±0.95 -1.80 - 1.91 -0.08 ±0.99 -1.86 - 1.96 -0.09 ±0.99 -1.82 - 2.03 -0.14 ±0.98 -1.80 - 1.94
50 0.00 ±0.82 -1.47 - 1.78 -0.01 ±0.83 -1.57 - 1.74 -0.12 ±0.90 -1.78 - 1.75 -0.23 ±0.91 -1.86 - 1.76 -0.31 ±0.91 -1.86 - 1.78 -0.39 ±0.88 -1.87 - 1.75
60 0.10 ±0.83 -1.32 - 1.87 0.05 ±0.86 -1.50 - 1.79 -0.05 ±0.91 -1.70 - 1.82 -0.17 ±0.89 -1.73 - 1.75 -0.28 ±0.87 -1.84 - 1.52 -0.37 ±0.89 -1.89 - 1.49
356
W.H. Ho, P. Watters, and D. Verity Normalized 2.5 Pecentile Distance vs Age
Normalized 2.5 Pecentile Distance vs Age Difference
−1.3
−1.3 age 10 age 20’s age 30 age 40 age 50 age 60
−1.4
−1.5
−1.5
−1.6 Normalized Distance
Normalized Distance
−1.6
−1.7
−1.8
−1.9
−1.7
−1.8
−1.9
−2
−2
−2.1
−2.1
−2.2 10
age 10 age 20’s age 30 age 40 age 50 age 60
−1.4
20
30
40
50
60
Age
Fig. 2. Normalized 2.5 Percentile Distance verse Age of Gallery
−2.2 −50
−40
−30
−20
−10
0 10 Age Difference
20
30
40
50
Fig. 3. Normalized 2.5 Percentile Distance verse Age Difference
The mean distance between a probe and the gallery images of the same age, as shown in Table 4, was also among the smallest values of a row in general. The results indicated that there was a higher similarity between images from two people of the same age than those of different ages. No matter how old a person is, they will look more similar to peers of a similar age, as oppose to other people of a different age. It implies that age groups having a larger proportion in a sample will more likely to have a lower identification rate, because their images are compared to more images of a higher similarity. As shown in subsection 3.1, the identification rates of the age groups 20 and 30 increased when their proportions in a sample were reduced from those in Table 2 to Table 3. 3.3
Evaluations Using an Even Distribution of Different Ages
Two experiments were carried out using samples that had an equal share of different ages to evaluate the age effect on identification without the influence of age distribution. The two experiments differed in the number of age groups and the number of people (images) in each age group. The identification rate of the older people was compared to that of the younger ones to see if there was a significant difference between the two. The null hypothesis was that there is no difference between the identification rates of older and younger people. Both experiments used the Owner-tester setup as suggested in [7] instead of the gallery-probe setup. The Owner-tester setup consists of an owner set and a tester set of images from two separate groups of people. Each owner has two images in the owner set, which are to be tested against images of all testers in the tester set. An owner is identified if the similarity between the owner images (one as known and unknown) is higher than the respective similarity scores between the unknown image and tester images. Owners are tested against testers only, not against each other. The observation of any one owner is not influenced by other owners, unlike the gallery-probe setup where images of a person involve not only in the derivation of the person’s observation, but others. The number of owners correctly identified out of all owners is the identification rate. In the first experiment, five samples of 320 people were formed by randomly selecting images from the FERET FA and FB lists of those people having ground truths. Each sample had 80 people randomly chosen from each age group of 20,
Are Younger People More Difficult to Identify or Just a Peer-to-Peer Effect Table 5. The percentiles of ident. rate diff. between (40&50) & (20&30). Alg.: PCA. Owners & Testers: 40 people for each age group of 20, 30, 40 and 50. Percentile Sample 1 2 3 4 5
2.5 % 5 -2.5 -2.5 0 -5
97.5 p0 value % 12.5 1 7.5 0.88 12.5 0.79 12.5 0.94 5 0.36
357
Table 6. The percentiles of ident. rate diff. between 40 and 20. Alg.: PCA. Owners & testers: 80 people for each age group of 20, 30 and 40. Percentile Sample 1 2 3 4 5
2.5 % 0 -2.5 -6.25 0 1.88
97.5 p0 value % 7.5 0.93 5 0.53 2.5 0.04 6.25 0.88 13.75 0.98
30, 40 and 50, which were then randomly divided into two halves and allocated to the owner and tester sets respectively. The final owner and tester sets consisted of images from the respective people allocated to the sets. The identification of each owner was determined by comparing the Euclidean distance between images as described above. The identification rate of a certain age group was the number of correct identifications of that age group out of all owners of that age group. The difference (DS) between the identification rates of the older (OS, 40’s and 50’s) and younger (Y S, 20’s and 30’s) people was calculated as (DS = OS − Y S). The 95% confidence interval of DS was calculated using 2000 bootstrap samples. For each bootstrap sample, 40 owners of each age group were randomly selected with replacement from the respective age group in the original owner set to form the bootstrap owner set; similarly for the bootstrap tester sets. Table 5 shows the 2.5 and 97.5 percentiles together with the probability for DS > 0 (p0). Samples 1 and 4 showed a significant difference, but samples 2, 3 and 5 did not. The p0 value of sample 5 was actually low, only 0.36. The same procedure was used in the second experiment except having 160 people for each age group of 20, 30 and 40. The final owner and tester sets had 80 people of the three age groups respectively. People of age 50 were not included in the experiment because there were not enough people of that age. In this experiment, the difference between the identification rates of the 40s (OS2 ) and 20s (Y S2 ) was calculated (DS2 = OS2 − Y S2 ). Table 6 shows the 2.5 and 97.5 percentiles of DS2 , together with the probability for DS2 > 0 (p0). Samples 1 and 5 showed a significant difference, but not for samples 2, 3 and 4. The p0 value of sample 3 was only 0.04. The results of the above two experiments showed that the difference in performance for the younger and older age groups was not significant to conclude that it is more difficult to identify younger people than older people. The difference in identification could have been caused by sampling error.
4
Summary
Previous research into the effect of age on face recognition concluded that younger people were more difficult to identify. However, we showed that the experiments
358
W.H. Ho, P. Watters, and D. Verity
were carried out using face databases that had an uneven distribution of different ages. There was a high correlation between the age group sizes and identification rates obtained. We illustrated that a person from any age group looked more similar to another person from the same age group, as opposed to someone from another age group. Thus, it could be the uneven age distribution that led to the difference in identification rates in previous studies, totally or partially. We therefore carried out experiments using samples that had an even distribution of ages to further investigate the effect of age on face recognition. The experimental results showed that although younger people were having a lower identification rate in some cases, the difference was not significant to conclude that they are harder to identify. In fact, old people were found to have a lower identification rate at some instances. As the dataset we used was relatively small and had images from people of six discrete ages only, we could not divide the dataset into a larger number of groups with more people in each group. However, we think our investigation has stepped forward the research in age effect on recognition, by bringing in new parameters and perspectives from which the problem should be looked at. We hope there will be more research on age effect using a larger sample with controlled age composition in the future. Acknowledgements. Portions of the research in this paper use the FERET database [5,8] of facial images collected under the FERET program, sponsored by the DOD Counterdrug Technology Development Program Office.
References 1. Ho, W.H., Watters, P.: Ecological validity of face recognition research. In: The workshop of 2005 International Conference on Computational Intelligence and Security, pp. 212– 217 (2005) 2. Phillips, P.J., Grother, P., Micheals, R.J., Blackburn, D.M., Tabassi, E., Bone, M.: Face recognition vendor test 2002 evaluation report. Technical report, FRVT (2003) 3. Given, G., Beveridge, J.R., Draper, B.A., Grother, P., Phillips, P.: How features of the human face affect recognition: a statistical comparison of three face recognition algorithms. In: proceedings of the CVPR (2004) 4. Jr., K.R., Tesafaye, T.: Morph: A longitudinal image database of normal adult age-progression. In: proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (2006) 5. Phillips, P., Wechsler, H., Huang, J., Rauss, P.: The feret database and evaluation procedure for face recognition algorithms. Image and Vision Computing Journal 16, 295–306 (1998) 6. The csu face identification evaluation system, version 5.0. Technical report (Colorado State University) 7. Ho, W.H., Watters, P., Verity, D.: Robustness of the new owner-tester approach for face identification experiments. In: Proceedings of the IEEE Computer Society Workshop on Biometrics, in association with CVPR (2007)
Are Younger People More Difficult to Identify or Just a Peer-to-Peer Effect
359
8. Phillips, P., Moon, H., Rizvi, S., Rauss, P.J.: The feret evaluation methodology for face recognition algorithms. IEEE Trans. Pattern Analysis and Machine Intelligence 22, 1090–1104 (2000) 9. Howell, D.C.: Fundamental Statistics for the Behavioral Sciences. Duxbury Press, Boston, Massachusetts (1985) 10. Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. CRC Press LLC, Boca Raton, USA (1998)
Lip Biometrics for Digit Recognition Maycel Isaac Faraj and Josef Bigun Halmstad University, School of Information Science, Computer and Electrical Engineering (IDE) Halmstad University, Box 823, SE-301 18 Halmstad {maycel.faraj, josef.bigun}@ide.hh.se Abstract. This paper presents a speaker-independent audio-visual digit recognition system that utilizes speech and visual lip signals. The extracted visual features are based on line-motion estimation obtained from video sequences with low resolution (128 ×128 pixels) to increase the robustness of audio recognition. The core experiments investigate lip motion biometrics as stand-alone as well as merged modality in speech recognition system. It uses Support Vector Machines, showing favourable experimental results with digit recognition featuring 83% to 100% on the XM2VTS database depending on the amount of available visual information.
1
Introduction
Automated speaker and speech recognition have been suggested combining visual features to improve the recognition rate in acoustically noisy environments [1][2][3][4][5]. Dealing with digit recognition only, parts of this work are reported [6] which is related to [7][8]. The later is concerned with person authentication. A merger of this work and [6] as a journal article is under preparation. The reports [9][10][11][12][13] suggest visual features based on the shape and intensity of the lip region due to the changes in the mouth shape including the lips and tongue. The dynamic visual lip features carry significant phoneme-discrimination information embedded in motion information which can be modelled by moving-line patterns also known as normal image velocity [7][14], with no requirement of iterative process on detecting the mouth contour. We exploit direct feature fusion to obtain the audio-visual observation vectors by concatenating the audio and visual features. The fused feature sequences are then modelled with a Support Vector Machine (SVM) classifier for digit recognition.
2 2.1
Feature Extraction Visual Features
Bigun et al. proposed a motion estimation technique based on multidimensional structure tensor by solving eigenvalue problem [15], e.g. allowing the minimization process of fitting a line or a plane to be carried without the Fourier Transform. However, this method can be excessive for applications that only need W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 360–365, 2007. c Springer-Verlag Berlin Heidelberg 2007
Lip Biometrics for Digit Recognition
361
line-motion features. We assume that the local neighbourhood in the lip image contains parallel lines or edges which is a realistic assumption [8]. Lines translated with a certain velocity in the spatio-temporal image will generate planes with a normal, where the normal is estimated by using the total-least-square-error (TLS) [15]. We denote the normal unit of the plane as k = (kx , ky , kt )T and the projection of k represents the direction vector of the line’s motion. The velocity vector can be written as follows. kt 1 T v = va = − 2 (kx , ky ) = − k k x 2 kx + ky2 ( ) + ( y )2 kt
kt
kx ky , kt kt
T ,
(1)
where v is the absolute speed in normal the direction and a is the direction of the velocity and k = (kx , ky , kt ) is the normal. The normal velocity estimation k problem becomes a problem of solving the tilts (tan γ1 = kkxt ) and (tan γ2 = kyt ) of the motion plane in the xt and yt manifolds. The estimated velocity components can by simple geometry be written as follows. kx 1 tan γ1 = tan γ1 = tan( arg(˜ u1 )) ⇒ v˜x = kt 2 tan2 γ1 + tan2 γ2
(2)
ky 1 tan γ2 = tan γ2 = tan( arg(˜ u2 )) ⇒ v˜y = kt 2 tan2 γ1 + tan2 γ2
(3)
where the arguments of u˜1 and u˜2 represent the TLS estimations of γ1 and γ2 in the local 2D manifolds xt and yt respectively, but in the double angle representation [16]. The tilde over vx and vy denote that these quantities are estimations of vx and vy . We have dense 2D-velocity vectors, (vx , vy )T , in each mouth-region frame (128×128 pixels). The 2D velocity feature vectors (vx , vy )T at each pixel are reduced to 1D scalars with a certain sign + or − depending on which direction they move relative to their expected spatial directions 0◦ , 45◦ , −45◦ – marked with 3 different greyscale shades in 6 regions in Fig. 1. f (p, q) = (vx (p, q), vy (p, q)) ∗ sgn( (vx (p, q), vy (p, q))), p, q = 0 . . . 127. (4) Next, we want to quantize the estimated velocities from arbitrary real scalars to a more limited set of values. The quantized speeds are obtained from the data by applying a mean approximation as follows.
g(l, k) =
N p,q=0
f (N l + q, N k + p), p, q = 0 . . . (N − 1), l, k = 0 . . . (M − 1) (5)
362
M.I. Faraj and J. Bigun
Fig. 1. Illustration of velocity estimation quantification and reduction
where N and M represent the window size of the boxes (Fig. 1) and the number of boxes, respectively. The visual feature dimension of the lip-motion are represented by 144-dimensional (M × M ) feature vectors, whereas the original dimension before reduction is 128 × 128 × 2 = 32768. 2.2
Acoustic Features
We use the Mel-Frequency Cepstral Coefficient (MFCC) to extract acoustical feature information [17]. The MFCC speech features were generated by the Hidden Markov Model Toolkit (HTK) [18] processing the audio-data stream. The MFCC feature vector dimension is 39, containing 12 cepstral coefficients with normalized log energy, 13 delta coefficients (velocity), and 13 delta-delta coefficients (acceleration).
3
Classification by Support Vector Machine
Support Vector Machine (SVM) is a discrimination-based binary method using a hyperplane w · x + b = 0 as a decision boundary between two or several classes. For linearly separable training dataset labelled pairs xi , yi , i = 1, . . . , l, where xi ∈ n and y ∈ {1,-1}l, the following equation is verified for each observation data (here feature vectors). di (wT xi + b) ≥ 1 − ξi f or i = 1, 2, ..., l ξi > 0,
(6)
where di is the label for sample data xi which can be +1 or -1; wi and b are the weights and bias that describe the hyperplane. In our experiment we use the innerproduct kernel function as RBF kernel along with one against one class approach, further details found in [6]. For all the tests we use the SVM toolkit [19].
Lip Biometrics for Digit Recognition
363
Table 1. Digit-recognition rate of all digits using protocol 2 in one against one SVM Word Audio features Visual features Audio-Visual features 0 89% 70% 92% 1 90% 77% 100% 2 86% 60% 89% 3 90% 75% 96% 4 89% 55% 85% 5 90% 50% 83% 6 100% 90% 100% 7 93% 100% 100% 8 91% 54% 83% 9 90% 49% 85%
4 4.1
Experimental Results XM2VTS Database
The experiments in this paper are conducted by the XM2VTS database [20], using the sentence “0 1 2 3 4 5 6 7 8 9” for all 4 sessions. It is of importance to note that the XM2VTS visual data is difficult to use as it is for digit recognition experiments because neither the audio-speech nor the visual-lip data are segmented. For each speaker of the XM2VTS database, the utterance “ 0 1 2 3 4 5 6 7 8 9” was semiautomatically segmented into single-digit sub sequences 0 to 9, further details can be found in [6]. For all our tests we used protocol 2 defined in [6] because XM2VTS is not designed for digit recognition but person authentication. However, it is the largest audio-visual database that is available, currently. 4.2
Digit Recognition
Digit recognition results for all the systems based on only acoustic, only visual and merged audio visual feature information are presented in Table 1. The best digit recognition rates are obtained for 1, 6, and 7 by 100%, whereas lower digit recognition rates are obtained for 4, 5, 8, and 9. One notable cause why the results in Table 1 vary is the lack of data (especially visual information) for certain digits. We could verify during the experiments that when uttering the words zero to nine in a sequence without silence between words the amount of visual data is notably less in the words 4, 5, 8, and 9 in comparison to other digits in the XM2VTS database. Moreover, the amount of audio-visual speech for each digit is dependent on the manner and the speed of the speaker. The average of the digit recognition over all digits is ≈ 68% and ≈ 90% for only visual and only audio system respectively.
5
Conclusion and Discussion
We present a digit recognition system exploiting lip motion information in dynamic image sequences using a large number of speakers with no use of iterative
364
M.I. Faraj and J. Bigun
processing or assumption on successful lip-contour tracking. The presented experimental results confirm the importance of adding visual lip movement information in digit-recognition systems. Improvements of digit recognition utilizing our motion features to achieve higher recognition performance than audio-alone are verified. An important matter is that the utterance of 4, 5, 8, and 9 contain less audio and visual-lip information. The poor recognition performance of these digits along with manual inspection are indicators of that XM2VTS database do not contain sufficient amounts of visual information on lip movements.
References 1. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91(9), 1306–1326 (2003) 2. Brunelli, K.R., Falavigna, D.: Person identification using multiple cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(10), 955–966 (1995) 3. Chibelushi, C., Deravi, F., Mason, J.: A review of speech-based bimodal recognition. IEEE Transactions on Multimedia 4(1), 23–37 (2002) 4. Duc, B., Fischer, S., Bigun, J.: Face authentication with sparse grid gabor information. and Signal Processing 4(21), 3053–3056 (1997) 5. Tang, X., Li, X.: Video based face recognition using multiple classifiers. In: FGR 2004. Sixth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 345–349. IEEE Computer Society Press, Los Alamitos (2004) 6. Faraj, M.I., Bigun, J.: Speaker and speech recognition by audio-visual lip biometrics. In: The 2nd International Conference on Biometrics, Seoul Korea, 2007 (2007) 7. Faraj, M.I., Bigun, J.: Person verification by lip-motion. In: 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), pp. 37–45 (2006) 8. Faraj, M.I., Bigun, J.: Audio-visual person authentication using lip-motion from orientation maps. Article accepted for publication in Pattern Recognition Letters – 2007 (2007) 9. Luettin, J., Maitre, G.: Evaluation protocol for the extended m2vts database xm2vtsdb 1998. In: IDIAP Communication 98-054, Technical report R R-21, number = IDIAP - (1998) 10. Dieckmann, U., Plankensteiner, P., Wagner, T.: Acoustic-labial speaker verification. In: Big¨ un, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 301–310. Springer, Heidelberg (1997) 11. Jourlin, P., Luettin, J., Genoud, D., Wassner, H.: Acoustic-labial speaker verification. In: Big¨ un, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 319–326. Springer, Heidelberg (1997) 12. Chen, T.: Audiovisual speech processing. IEEE Signal Processing Magazine 18(1), 9–21 (2001) 13. Liang, L., Zhao, X.L.Y., Pi, X., Nefian, A.: Speaker independent audio-visual continuous speech recognition. In: IEEE International Conference on Multimedia and Expo, 2002. ICME ’02. Proceedings. 2002, vol. 2, pp. 26–29 (2002) 14. Kollreider, K., Fronthaler, H., Bigun, J.: Evaluating liveness by face images and the structure tensor. In: AutoID 2005. Fourth Workshop on Automatic Identification Advanced Technologies, pp. 75–80. IEEE Computer Society Press, Los Alamitos (2005)
Lip Biometrics for Digit Recognition
365
15. Bigun, J., Granlund, G., Wiklund, J.: Multidimensional orientation estimation with applications to texture analysis of optical flow. IEEE-Trans Pattern Analysis and Machine Intelligence 13(8), 775–790 (1991) 16. Granlund, G.H.: In search of a general picture processing operator. Computer Graphics and Image Processing 8(2), 155–173 (1978) 17. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on Acoustics, Speech, and Signal Processing 28(4), 357–366 (1980) 18. Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The htk book (for htk version 3.0) (2000) http://htk.eng.cam.ac.uk/docs/docs.shtml 19. Chang, C.C., Lin, C.J.: Libsvm–a library for support vector machines (2001) software available at www.csie.ntu.edu.tw/∼ cjlin/libsvm 20. Messer, K., Matas, J., Kittler, J., Luettin, J.: Xm2vtsdb: The extended m2vts database. In: Second International Conference of Audio and Video-based Biometric Person Authentication, ICSLP’96, pp. 72–77 (1999)
An Embedded Fingerprint Authentication System Integrated with a Hardware-Based Truly Random Number Generator Murat Erat1,2 , Kenan Danı¸sman2 , Salih Erg¨ un1 , Alper Kanak1, 1 and Mehmet Kayaoglu 1
¨ ITAK-National ˙ TUB Research Institute of Electronics and Cryptology, Kocaeli, Turkiye 2 Dept. of Electronics Engineering, Erciyes University, Kayseri, Turkiye {erat,salih,alperkanak,mehmet.kayaoglu}@uekae.tubitak.gov.tr,
[email protected] Abstract. Recent advances in information security requires randomly selected strong keys. Most of these keys are generated by software-based random number generators. However, implementing a Truly Random Number Generator (TRNG) without using a hardware-supported platform is not reliable. In this paper, a fingerprint authentication system using a hardware-based TRNG to produce a private key that encrypts the fingerprint template of a person is presented. The designed hardware can easily be mounted on a standard or embedded PC via its PCI interface to produce random number keys. Random numbers forming the private key is guaranteed to be true because it passes a two-level randomness test evaluated first on the FPGA then on the PC by applying the full NIST test suite. The whole system implements an AES-based encryption scheme to store the person’s secret stored on a smart or glossary card safely. The main contribution of the work is the use of new-generation hardware-based TRNGs to enhance the security of a fingerprint authentication system.
1
Introduction
There is a growing need for information secrecy as a natural result of the emerging demand in enabling electronic official & financial transactions. With this respect, random number generators (RNG) which are indispensable components of cryptographic systems began merging into typical digital communication devices. Since the generation of public/private key-pairs for asymmetric algorithms and keys for symmetric and hybrid cryptosystems there is an emerging need for RNGs. Generally, two types of RNGs exist: Truly Random Number Generators (TRNGs) and Pseudo-Random Number Generators (PRNGs). TRNGs take advantage of nondeterministic sources (entropy sources) which truly produce random numbers. TRNG output may be either directly used as random W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 366–373, 2007. c Springer-Verlag Berlin Heidelberg 2007
An Embedded Fingerprint Authentication System
367
number sequence or fed into a PRNG. PRNGs use specific algorithms to generate bits in a deterministic fashion. In order to appear to be generated by a TRNG, pseudo-random sequences must be seeded from a shorter truly random sequence [4] and no correlation between the seed and any value generated from that seed should be evident. Besides all mentioned above, the production of highquality Truly Random Numbers (TRNs) may be time consuming, making such a process undesirable when a large quantity of random numbers needed. Hence, for producing large quantities of random numbers, PRNGs may be preferable in spite of their predictability weaknesses. Although RNG design is known, making a useful prediction about the output should not be possible. To fulfill the requirements for secrecy of one-time pad, key generation and any other cryptographic application, TRNG must satisfy the following properties: The output bit stream of TRNG must pass all the statistical tests of randomness; random bits must be forward and backward unpredictable; the same output bit stream of TRNG must not be able to be reproduced [5]. The best way one can generate TRNs is to exploit the natural randomness of the real world by finding random events that occur regularly [5]. There exist fundamentally four different techniques for RNG considering various random events: amplification of a noise source [6,7] jittered oscillator sampling [3,2], discrete-time chaotic maps [8,9] and continuous-time chaotic oscillators [10]. In spite of the fact that the use of discrete-time chaotic maps in the realization of RNG is well-known for some time, it has been recently shown that continuoustime chaotic oscillators can be used to realize TRNGs as well. Since TRNs might be used to generate digital signatures, integrating biometricbased person authentication system with cryptographic schemes that use TRNbased keys is a promising field [14]. In this study, a fingerprint verification system, in which the fingerprint feature templates are encrypted by private keys, are presented. Note that, private keys are extracted by TRNG. Having a PCI interface to upload the generated bit sequences make the proposed design ideal for computer based cryptographic applications. The TRNG presented in this paper is designed in a dual oscillator architecture by combining with the thermal noise amplification method. The throughput data rate of hardware implemented TRNG effectively becomes 130 Kbps which seems very promising for real-time applications. The main contribution of this study is the integration of cost-effective Wedge and Ring (W&R) based fingerprint recognition algorithm [16] with a secure feature storage scheme. The system guarantees generating reliable private keys comprised of TRNs. Another contribution of this system is that the FPGA based system might easily be mounted on any PC or embedded PC.
2
Hardware Implemented Truly Random Number Generator
Since it is not possible to produce true randomness but pseudo randomness by software-based methods, a hardware implemented TRNG design which uses a dual oscillator architecture with thermal noise amplification method in order to
368
M. Erat et al.
increase the output throughput and the statistical quality of the generated bit sequences, is used. The proposed hardware is presented in Fig. 1. Thermal noise generation process is multiplicative and results in the production of a random series of noise spikes. Op-Amp in Fig 1 amplifies the noise voltage over the RSrc resistor by 500 times. Amplifier circuit is capable of passing signals from 20 Hz to 500 kHz and the frequency of a slower clock is modulated with the output signal of the amplifier. Note that the noise in a register has a white spectrum. 74HCT4046A voltage-controlled oscillator (VCO) is used to implement the modulation of the slower clock frequency with the amplified noise voltage. Then, with the rising edge of the noisemodulated slower clock, the output of a fast clock is sampled using a D flipflop inside the FPGA. Center frequency of the VCO determines the center frequency of the slower clock and can be adFig. 1. Hardware Implemented TRNG justed up to 17 MHz for 74HCT4046A. Drift between the two oscillators provides random bit generation to be more robust. Because of the nonlinear aliasing phenomenon associated with sampling, the dual oscillator architecture achieves increased output throughput and higher statistical quality [2]. In [1], it has been reported that in order to obtain an uncorrelated random bit stream, the modulated slower oscillator period should feature a standard deviation much greater than the fast oscillator period. In order to remove the biasing of the output bit sequence, fast oscillator should have a balanced duty cycle. To get a satisfactory result, fast oscillator is implemented by dividing a low jitter 192MHz crystal oscillator by 4 inside the FPGA. In this way, we get a 48 MHz fast oscillator that has an approximate 50% duty cycle. If a balanced duty cycle can be guaranteed, the fast oscillator frequency should raise. The slow and fast oscillators used in [3] and [1] have center frequency ratios on the order of 1 : 100. In our design, we experimentally get successful results from the full NIST test suite when the slower clock frequency is adjusted up to 520 KHz, which determines the throughput data rate. Then, 48 MHz fast oscillator is sampled on the rising edge of the slower clock using a D flip-flop inside the FPGA. High jitter level achieved by noise-modulated oscillator feature a standard deviation much greater than the fast oscillator period. Thus this scheme outputs uncorrelated random bit streams. However, the binary sequence thus obtained may also be biased. In order to remove the unknown bias in this sequence, the well-known Von Neumann’s de-skewing technique [11] is employed. This technique consists of converting the bit pair 01 into the output 0, 10 into the output 1 and of discarding bit pairs 00 and 11. Von Neumann processing was implemented in the FPGA. Because of generating approximately 1 bit from 4 bits this process decreases the frequency of the random signal to 130 kHz.
An Embedded Fingerprint Authentication System
369
The possible random numbers are evaluated by two mechanisms, which are implemented as hardware and software. The hardware evaluation mechanism is enabled by the software mechanism to start counting the bit streams described in the five basic tests (Frequency (mono-bit), poker, runs, long-run and serial tests) covering the security requirements for cryptographic modules and specifies recommended statistical tests for random number generators. Each of the five tests are performed by the FPGA on 100.000 consecutive bits of output from the hardware random number generator. When the test program is run, the software starts randomness tests using the FPGA. During the tests, the software reads and stores the values assumed to be random over the FPGA. When the tests (Von Neumann algorithm and five statistical tests) are completed, the addresses of the test results are read over the FPGA and evaluated. If the results of all the test are positive, the stored value is transferred to the ”Candidate Random Number Pool” in the memory while any failing candidate random numbers are not stored in the final pool. If random numbers are required for cryptographic -or generally security- purposes, random number generation shall not be compromised with less than three independent failures no less than two of which must be physically independent. To provide this condition a test mechanism in which full NIST random number test suite[12] performing in software which is physically independent from the FPGA is added. Successful random numbers which are stored in the “Candidate Random Number Pool” subjected to full NIST test suite by software and transferred to the final pool except for failing random numbers. When the amount of the random numbers in the pool falls below 125 Kbytes, the tests are restarted and the data is resampled until the amount of tested values reaches 1250 Kbytes. If the test results are positive, the amount of random numbers in the pool is completed to 1250 Kbytes using the tested values.
3
Wedge and Ring (W&R) Based Fingerprint Features
The traditional fingerprint recognition systems usually concentrate on the variation on structural properties of fingerprints, such as ridges, valleys and minutiae [15]. In such methods matching process is done by comparing the variable-sized minutiae lists. However, minutiae-based methods require sophisticated enhancement and pre-processing staff to detect structural properties resulting with a time-consuming scheme. Moreover, the variable size of minutia-based representation makes it unsuitable for hardware oriented applications. These problems canalize researchers to spectral methods. In this study a robust FFT-based algorithm allowing good recognition of low quality fingerprints with inexpensive hardware is implemented. In the proposed scheme, features are extracted from the enhanced images using a W&R classifier [16]. In this method a reference point is located on the image of the fingerprint and a dart-board pattern of wedges and rings is overlaid on the image, with the center of the board at the reference point. The idea is simple: Since fingerprints are broadly composed of periodic structures, it should be natural to examine them in the frequency domain. In
370
M. Erat et al.
order to get good results, first we have applied a novel enhancement technique based on Short Time Fourier Transform (STFT) analysis and contextual/nonstationary filtering in the Fourier domain [13]. This method is advantageous because all intrinsic images (ridge orientation, frequency and the region mask) are estimated simultaneously from STFT analysis. In order to extract features, the fingerprint image is first transformed to the spatial frequency domain via the two dimensional Fourier transform (Eq. 1): f (x, y) =
M−1 N −1 1 ux vy F (u, v)exp{j2π( + )} M N u=0 v=0 M N
(1)
for x = 0, 1, 2, . . . , M − 1 and y = 0, 1, 2, . . . , N − 1. In the Fourier domain, the information within a fingerprint is mostly contained in the radial bands of frequencies whose angular extent follows from the predominant ridge orientation in the print. Distinguishing characteristics of a fingerprint appear as small deviations from the dominant spatial frequency of the ridges. This means that spatial frequency for each individual is located in the annular region, while low spatial frequency information at the center manifests to background intensities. Since each region in the fingerprint contributes to the whole Fourier domain, the magnitude spectrum is invariant to Fig. 2. Wedges and Rings the translation of the finger. As opposed to the original work in [16], instead of using the power spectrum obtained by squaring the magnitude of the transform, the 4th power of the magnitude is used to form more distinctive features. The ring features are formed by summing the 4th power spectral values in circular rings of varying radius r and constant thickness Δr. The features of the ith ring out of n rings is given in Eq. 2: λi (r) = |F (u, v)|4 (2) u
v
where r ≤ (u2 + v 2 )1/2 ≤ r + Δr and 0 ≤ r ≤ N/2 and the maximum image dimension is N . Similarly, wedge segments with varying orientation Φ and segment width ΔΦ are computed as: βj (Φ) = |F (u, v)|4 (3) u
v
where Φ ≤ arctan(v/u) ≤ Φ + ΔΦ. Here j refers to the j th wedge of total m wedges. Since the power spectra of real functions are symmetrical, these computations only extend over a half-plane of each spectrum. The resulting feature vector then becomes f = [λ1 , λ2 , . . . , λn , β1 , β2 , . . . , βm ].
An Embedded Fingerprint Authentication System
4
371
The Proposed Fingerprint Authentication Scheme
The secure authentication scheme is developed by using the W&R fingerprint features. In fact, this template can easily be adapted to any biometric feature (face, iris, retina, etc.). Moreover, instead of W&R method, more sophisticated features might be applied if the system resources are sufficient to handle the such complex algorithms. Nevertheless, using W&R features in a limited closed set is a good starting point to show the integration of popular concepts such as biometrics, cryptography and random numbers. The whole system requires a personal identification number p id that might be stored on a token, smart card or a glossary card and a private key comprised of TRNs which are generated by our TRNG. Using only a password is not recommended because most of them are usually forgotten or easily guessed. The authentication system is comprised of Enrollment and Verification phases. The enrollment phase presented in Fig. 3(a), is the registration part where the user first introduces himself to the system by mounting his smartcard. Note that, the smartcard includes both p id and private key keyp id of the individual. After capturing the fingerprint image I(x, y) of the person by a sensor, f is computed. Consequently, f is encrypted by keyp id which uses the randomly generated numbers. Finally the encrypted feature E{f } and the private key keyp id are stored on a database with the corresponding p id. Here, p id is the access PIN number of the individual which is also used as the index of him in the database.
(a)
(b)
Fig. 3. Enrollment (a) and Verification (b) Phases of the Proposed System
At the verification phase presented in Fig.3(b), a query fingerprint image I (x, y) is captured and f is computed. Concurrently, the corresponding encrypted feature E{f } is selected with the given p id. Here, p id is accepted as an index in the fingerprint template database. The encrypted feature is then decrypted, D{E{f }} = f to obtain the stored feature again. The decision mechanism finally compares f with f . Verification is based on the Euclidean Minimum Distance Classifier (EMD) in which recognition is done by comparing the euclidean distance between f and f features. If the verification succeeds, keyp id on the smartcard is modified by the TRNG with a new private key to obtain full security.
372
M. Erat et al.
To measure the performance objectively, we run the matching algorithm on images acquired by UPEK’s TouchStrip TCS3 sensor. This is a silicon strip sensor ideal for portable devices such as notebook PCs and flash drives. Users simply swipe their finger over the sensor for reliable authentication and protection of their digital and physical assets. The database consists of 440 images acquired from 88 distinct fingers with the size of 124 × 180 pixels. In order to obtain the performance characteristics genuine and impostor comparisons are evaluated. For genuine comparison, each instance of a finger is compared with the rest of the instances resulting in 5 × (5 − 1)/2 = 10 tests per finger. On the other hand for impostor comparison test, the first instance of each finger is compared against the first instance of all other fingers resulting in a total of 88 × (88 − 1)/2 = 3828 tests. Note that, accuracy of the system is dependant to the number of wedges and rings during the extraction of f . Since the proposed embedded system involves time-consuming encryption and preprocessing stages, experimentally the number of rings and wedges are selected to be 12 and 10, respectively. Moreover, arithmetic mean of more than 1 features obtained from the same finger might be computed at the training stage to get better results for each person. Recently, 8.12% equal error rate is obtained among the set of 440 enhanced fingerprints but a more accurate system can be implemented by training each person with more fingerprints. Average duration for the verification phase is less than 1 second in this set. For the encryption back-end Advance Encryption Standard (AES) is used but the system is modular enough to replace it with another encryption standard. AES has a block size of 128 bits yielding at least 128 bit keys. The fast performance and high security of AES makes it charming for our system. AES offers markedly higher security margins: a larger block size, potentially longer keys, and (as of 2005) freedom from cryptanalytic attacks.
5
Conclusions
This study presents a fingerprint authentication system where the encrypted fingerprint templates are safely stored on a database with an index number that might be loaded on any access device (smart or glossary card, token, etc.). The main contribution of this paper is that it reports a practical application of a secure embedded biometric authentication system which proposes the integration of new-generation hardware-implemented TRNGs and a fast fingerprint recognition algorithm. The security is enhanced first by using a TRNG instead of a less secure PRNG; second, by dynamic changing of passwords at each successful login; and third, by using the fingerprint of an individual. The resulting system can easily be mounted on any PC or embedded PC via its PCI interface to produce a truly random key. It is obviously seen that, unless the attacker learns the private key of the individual, it is impossible to grasp the encrypted biometric template of the person whether he seize the whole database. The security is enhanced by the alternation of private keys at every successful authentication. In this study, W&R-based fingerprint features are used due to the advantage
An Embedded Fingerprint Authentication System
373
of their less computational complexity and better performing for low-quality prints. Perhaps, as a further study various fingerprint features, different types of biometrics or fusion of biometric modalities might be used. Additionally, the AES-based encryption background of the system might also be revised by a more powerful scheme such as an elliptic curve cryptosystem.
References 1. Bucci, M., Germani, L., Luzzi, R., Trifiletti, A., Varanonuovo, M.: A High Speed Oscillator-based Truly Random Number Source for Cryptographic Applications on a SmartCard IC. IEEE Trans. Comput. 52, 403–409 (2003) 2. Petrie, C.S., Connelly, J.A.: A Noise-Based IC Random Number Generator for Applications in Cryptography. IEEE Trans. Circuits & Systems I 47(5), 615–621 (2000) 3. Jun, B., Kocher, P.: The Intel Random Number Generator. Cryptography Research, Inc. white paper prepared for Inter Corp. (1999)http://www.cryptography.com/resources/whitepapers/IntelRNG.pdf 4. Menezes, A., Oorschot, P.V., Vanstone, S.: Handbook of Applied Cryptology. CRC Press, Boca Raton, USA (1996) 5. Schneier, B.: Applied Cryptography, 2nd edn. John Wiley & Sons Ltd, West Sussex, England (1996) 6. Holman, W.T., Connelly, J.A., Downlatabadi, A.B.: An Integrated Analog-Digital Random Noise Source. IEEE Trans. Circuits & Systems I 44(6), 521–528 (1997) 7. Bagini, V., Bucci, M.: A Design of Reliable True Random Number Generator for Cryptographic Applications. In: Proc. of CHES, pp. 204–218 (1999) 8. Stojanovski, T., Kocarev, L.: Chaos-Based Random Number Generators-Part I: Analysis. IEEE Trans. Circuits & Systems I 48(3), 281–288 (2001) 9. Delgado-Restituto, M., Medeiro, F., Rodriguez-Vazquez, A.: Nonlinear Switchedcurrent CMOS IC for Random Signal Generation. Electronics Letters 29(25), 2190– 2191 (1993) 10. Yalcin, M.E., Suykens, J.A.K., Vandewalle, J.: True Random Bit Generation from a Double Scroll Attractor. IEEE Trans. on Circuits & Systems I: Fundamental Theory and Applications 51(7), 1395–1404 (2004) 11. Von Neumann, J.: Various Techniques Used in Connection With Random Digits. Applied Math Series - Notes by In: Forsythe, G.E.(ed.) National Bureau of Standards, vol. 12, pp. 36–38 (1951) 12. National Institute of Standard and Technology.: A Statistical Test Suite for Random and Pseudo Random Number Generators for Cryptographic Applications. NIST 800-22 (2001) http://csrc.nist.gov/rng/SP800-22b.pdf 13. Chikkerur, S., Cartwright, A.N., Govindaraju, V.: Fingerprint Enhancement Using STFT Analysis. Jour. Pattern Recogn. 40(1), 198–211 (2007) 14. Erat, M., Danisman, K., Ergun, S., Kanak, A.: A Hardware-Implemented Truly Random Key Generator for Secure Biometric Authentication Systems. MRCS, pp. 128–135 (2006) 15. Maio, D., Maltoni, D., Jain, A.K., Prabhakar, S.: Handbook of Fingerprint Recognition. Springer, Heidelberg (2003) 16. Willis, A.J., Myers, L.: A Cost-effective fingerprint Recognition system for Use with Low-quality Prints and Damaged Fingertips. Jour. Pattern Recogn. 34, 255–270 (2001)
A New Manifold Representation for Visual Speech Recognition Dahai Yu, Ovidiu Ghita, Alistair Sutherland, and Paul F. Whelan School of Computing & Electronic Engineering, Vision Systems Group Dublin City University, Dublin 9, Ireland
[email protected] Abstract. In this paper, we propose a new manifold representation capable of being applied for visual speech recognition. In this regard, the real time input video data is compressed using Principal Component Analysis (PCA) and the low-dimensional points calculated for each frame define the manifolds. Since the number of frames that from the video sequence is dependent on the word complexity, in order to use these manifolds for visual speech classification it is required to re-sample them into a fixed number of keypoints that are used as input for classification. In this paper two classification schemes, namely the k Nearest Neighbour (kNN) algorithm that is used in conjunction with the twostage PCA and Hidden-Markov-Model (HMM) classifier are evaluated. The classification results for a group of English words indicate that the proposed approach is able to produce accurate classification results. Keywords: Visual speech recognition, PCA manifolds, spline interpolation, kNearest Neighbour, Hidden Markov Model.
1 Introduction In recent years, visual speech recognition has became an active research topic and plays an essential role in the development of many multimedia systems such as audiovisual speech recognition (AVSR) [4], mobile phone applications and sign language recognition [10]. The inclusion of lip visual features as additional information in the development of audio or hand recognition algorithms has received interest from computer vision community because this information is robust to acoustic noise. The aim of this paper is to detail the development of a visual speech recognition system that is able to achieve speech recognition using only the visual features that are extracted from the input video sequence. A review of the computer vision literature indicates that several approaches have been proposed to address the visual speech recognition based on the features that are extracted from the lips contour. In 1995, Luettin et al [6] applied Active Shape Models to identify the lips and extract the features that are used for visual speech recognition. In the same year, Bregler and Omohundro [18] proposed a different technique where the lip motions are encoded using nonlinear manifolds that are used to identify standard English phonemes. Later, Richard Harvey [17] proposed a new approach for speech recognition where the W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 374–382, 2007. © Springer-Verlag Berlin Heidelberg 2007
A New Manifold Representation for Visual Speech Recognition
375
central part is a morphological transform called the sieve that is applied to calculate simple one-dimensional (1D) and two-dimensional (2D) measurements that are able to sample the lips shapes. The statistics calculated from these 1D and 2D measurements are concatenated into a feature vector that is used to train a standard HMM classifier. In 2002, Gordan et al [12] propose to apply a network of support vector machines (SVM) classifiers for visual speech recognition [12]. This work was further advanced by Foo and Lian [9] and Dong et al [16] where adaptive boosting and HMM classifiers were applied to recognize visual speech elements. More recently, Yau et al [14] propose the use of image moments and multi-resolution wavelet images for visual speech recognition. In their approach, the input video data is represented by the motion history image that is decomposed by applying the discrete stationary wavelet transform. From this short literature review we can conclude that most of the work was focused on the robust identification of small independent speech elements (called visemes [9]) while the word recognition is viewed as a simple combination between standard visemes. In visual speech process, a viseme (which is a mouth shape or a short sequence of mouth dynamics that are required to uniquely generate a phoneme or a group of phonemes in the visual domain) is regarded as the smallest unit that can be identified using visual information from the input video data. Although the words can be theoretically formed from a combination of standard visemes, in practice due to various pronunciation styles similar visemes can be associated with different visual signatures. In addition to this, the viseme identification within words is problematic since the transitions between consecutive visemes are not always easy to identify. In order to alleviate these problems, in this paper we formulate the visual speech recognition as the process of recognizing individual words based on a manifold representation. While this approach to visual speech recognition is appealing since provides a generic framework to recognise words without resorting to viseme identification, a few important issues need to be addressed. The first problem is to evaluate the discriminative power offered by the manifold representation while the second problem consists of designing the classification scheme that returns optimal results. These issues will be addressed in detail in the following sections of this paper. This paper is organized as follows. Section 2 presents an overview of the proposed system while Section 3 describes the methodology applied to obtain the manifolds from input data. In Section 4 two classification schemes are detailed and their performance is analysed in detail in Section 5. Concluding remarks and discussions on the future work are provided in Section 6.
2 System Overview The developed system for visual speech recognition consists of three main steps. In the first step, the lips are extracted from input video data. In order to achieve this goal we calculate the pesudo-hue [3] from the RGB components and the lips are segmented by applying a histogram based thresholding scheme (see Fig. 1). The aim of the second component is to generate the Expectation Maximization PCA (EMPCA) manifolds and perform manifold interpolation and re-sampling. The role of the
376
D. Yu et al.
(a)
(b)
(c)
(d)
(e)
Fig. 1. Lip detection algorithm. (a) RGB image. (b) Pseudo-hue image. (c) Image resulting after the application of the histogram-based thresholding. (d) Lips region –pseudo hue. (e) Lips region – grayscale.
Fig. 2. Block diagram of the proposed visual speech recognition system: Step (1) Lip Detection from the pseudo-hue component. Step (2) Features Extraction. Step (3) Classification.
third component is to classify the manifolds calculated for the input speech sequence into a number of words that are contained in a database. For this implementation two classification schemes are evaluated, namely the second stage EM-PCA used in conjunction with a kNN classifier and the HMM classifier. The block diagram of the proposed visual speech recognition system is depicted in Fig. 2.
3 Manifold Representation 3.1 EM-PCA Mathematical Background Expectation-Maximization PCA (EM-PCA) is an extension of the standard PCA technique by incorporating the advantages of the EM algorithm in terms of estimating the maximum likelihood values for missing information. This technique has been originally developed by Roweis [7] and its main advantage over the standard PCA is the fact that it is more appropriate to handle large high dimensional datasets especially when dealing with sparse training sets. The EM-PCA procedure has two distinct stages, the E-step and M-step: E-step: W = (VTV)-1 V-1A;
M-step: Vnew = AWT(WWT)-1
(1)
where ‘W’ is the matrix of unknown states, ‘V’ is the test data vector, ‘A’ is the observation data matrix, and T defines the transpose operator. 3.2 Manifold Calculation from Input Data Human lips are highly deformable objects and it can be noticed that during the speech process they show variations in shape, color, reflection and also their relation to
A New Manifold Representation for Visual Speech Recognition
377
surrounding features such as tongue and teeth is very complex [5]. For visual speech recognition purposes, we would like to extract the information associated with the lip motions from the frames that define the input video sequence. As indicated in previous section, the lips are segmented in each frame by thresholding the pseudo-hue component calculated from RBG data and we encode the appearance of the lips in each frame as a point in a feature space that is obtained by projecting the input data onto the low dimensional space generated by the EM-PCA procedure. The feature points obtained after the projection on the low-dimensional space are joined by a plotline based on the frame order and in this way we generate a surface in the feature space that is called manifold [18]. In other words, the image region surrounding the lips in each frame (see Fig. 1e) is compressed using the EM-PCA technique that is detailed in Section 3.1. Since the manifolds encode the lips motion through image compression, the shape of the manifolds will be strongly related with the words spoken by the speaker and recorded in the input video sequence. In Fig. 3 is illustrated the variation between three independent image sequences of the same word in the EM-PCA feature space. It can be noted that the shapes of the manifolds are very similar and can be interpreted as a word “signature”.
Fig. 3. Manifolds generated from three image sequences representing a same word. The appearances of the manifolds indicate that their shapes are similar and contain information in regard to the word spoken.
While the shape of the manifolds can be potentially used to discriminate between different words, they cannot be used directly to train a classifier or to recognize an unknown input image sequence. This is motivated by the fact that the number of frames contained in the input image is variable and depends on the complexity of the word spoken by the speaker. In this way, short words such as “I”, “shy”, etc. have associated a small number of frames and as results the manifolds will be defined by a small number of feature points. Conversely, longer words such as “banana” and “another” have associated larger image sequences and the number of feature points that defines the manifolds is larger. This is a real problem when these manifolds are used to train a classifier as the number of feature points is different. To circumvent this problem, we interpolate the feature points using a cubic spline and then re-sample uniformly the manifolds into a pre-defined number of keypoints. This procedure is appropriate as it allows a standard manifold re-sampling and they can be used to train a classifier or to obtain the classification result for an unknown image sequence.
378
D. Yu et al.
3.3 Manifold Interpolation Using a Cubic Spline Function The application of cubic spline interpolation has two main advantages. Firstly, it allows us to generate a smooth surface for EM-PCA manifolds and secondly it reduces the effect of noise (and the influence of objects surrounding the lips such as teeth and tongue) associated with the feature points that form the manifolds in the EM-PCA space. This is clearly shown in Fig. 4 where we illustrate the appearance of the manifolds obtained after the application of cubic interpolation. Fig. 4 illustrates the interpolated manifolds generated for words “slow” and “shy”. 3.4 Manifold Re-sampling into a Predefined Number of Keypoints As mentioned earlier, in order to generate standard data for training/recognition we need to uniformly re-sample the manifolds into a pre-defined number of keypoints. This re-sampling procedure will allow the identification of a standard set of keypoints as illustrated in Fig. 5. We decided to use uniform re-sampling since this procedure will generate keypoints that are equally distanced on the interpolated manifold surface and accurately sample the intrinsic information associated with the manifold’s shape.
(a)
(b)
Fig. 4. Manifolds resulting after cubic spline interpolation. (a) Word “Slow” – two image sequences; (b) Word “Shy” – two image sequences.
(a)
(b)
(c)
Fig. 5. Uniform re-sampling of the interpolated manifolds (keypoints = 20) (a) two image sequences - word “Art”; (b) two image sequences – word “Slow”; (c) two image sequences word “Shy”. Note the good correspondence between the keypoints generated by re-sampling manifolds for similar words.
A New Manifold Representation for Visual Speech Recognition
379
Fig. 6. Five words plotted in the second stage PCA space. Interpolated manifolds were resampled into 20 keypoints.
4 Classification 4.1 Second Stage EM-PCA and kNN Classification The second stage EM-PCA is applied to encode the temporal characteristics of the manifolds that are provided from the first stage of EM-PCA that is applied to generate the feature points that form the manifold [11]. For this implementation, this is achieved by taken the key points obtained after manifold re-sampling for each word from the training data and compress this data using EM-PCA (only the largest three components were retained). In this way will be generated a data point for each class of words that are used to train a k-NN classifier (k=3). This is illustrated in Fig. 6 where 10 different instances of five words obtained after the application of the second stage PCA space are depicted. Once the training is complete the unknown datapoint is compressed using EM-PCA and is classified according to the distance to the nearest neighbour contained in the database [8]. 4.2 HMM Classifier The second classification scheme evaluated in this paper is the HMM where the keypoints associated with each class of words are used directly for training [9, 16]. For this implementation we have constructed a HMM classifier for each word class, i.e. if we have k words in the database then we train k HMMs. The topology of the HMM classifier employed for this implementation is left to right and the number of states is set to three. The states of the HMM model are as follows. The first state encodes the transition from the initial state of the visual speech to articulation, the second state describes the articulation process while the last state models the transition from articulation to the end of the visual speech. These three states define the visual speech model that is used in the classification process. The length of the observation sequence is set as the number of keypoints and the maximum number of iterations is set to 30.
5 Experimental Results A number of experiments were carried out to assess the performance of the proposed system. For this study, we have created a database consisting of 10 words
380
D. Yu et al.
(a)
(b)
(d)
(c)
(e)
Fig. 7. Classification error rate achieved by the kNN classifier. (a) Manifold re-sampling: 10 keypoints. (b) Manifold re-sampling: 20 keypoints. (c) Manifold re-sampling: 50 keypoints. (d) Manifold re-sampling: 100 keypoints. (e) Classification accuracy (average 85%).
(a)
(b)
Fig. 8. Classification error rate achieved by HMM (a) Error recognition rate; (b) Rate for incorrect recognition. The average classification accuracy is 95%. From (a), it can be observed that the recognition rate is higher than that achieved by the kNN based classification scheme.
(30 examples for each word) generated by one speaker where 10 examples of each word are used for training and 20 examples are used for testing. The input database is divided into 10 classes and the classification results achieved by the kNN classifier
A New Manifold Representation for Visual Speech Recognition
381
when used in conjunction with second stage PCA are illustrated in Fig. 7. The classification results achieved by the HMM classifier are depicted in Fig. 8. By comparing the error recognition rate and the rate for incorrect recognition it can be concluded that the HMM classifier clearly outperforms the kNN classifier. Based on the evaluation of the experimental results we can observe that the best results are obtained when the interpolated manifolds are re-sampled to 20-30 keypoints. Another important finding resulting from this investigation is the fact that the manifolds offer a good discrimination (average classification success rate is 95% for HMM classifier) and they are suitable features to be used for visual speech recognition.
6 Conclusions This paper describes the development of a visual speech recognition system where the main emphasis was placed on the evaluation of the discriminative power offered by a new manifold representation. In this regard, the manifolds are generated from the image data surrounding the lips and this data is compressed using an EM-PCA procedure into a low-dimensional feature space. Since these manifolds are defined by a different number of frames they cannot be used directly as inputs for classification. To address this problem we propose to interpolate the manifolds and then re-sample them uniformly into a predefined number of keypoints. It is useful to note that in our experiments we have included image data that was generated by a single speaker and the word database contained a small number of words. We also noticed that the recognition rate of complexity words such as “banana” and “another” is generally lower than the recognition rate obtained for simpler words such as “I” and “shy”. In the future, we aim to evaluate the proposed approach on a larger number of words are generated by different individuals and to include the temporal information as an additional cue in the recognition process.
References 1 Eveno, N., Caplier, A., Coulon, P.: Accurate and quasi-automatic lip tracking. IEEE Trans. Circuits Syst. Video Techn. 14(5), 706–715 (2004) 2 Ghahramani, Z.: Machine Learning Toolbox, Version 1.0 01-04-96, University of Toronto 3 Tian, Y.L., Kanade, T.: Robust lip tracking by combining shape colour and motion. In: Proc. of the Asian Conference on Computer Vision, pp. 1040 –1045 (2000) 4 Nefian, A.V., Liang, L.H., Liu, X., Pi, X.: Audio-visual speech recognition. Intel Technology & Research (2002) 5 Eveno, N., Caplier, A., Coulon, P.Y.: A new color transformation for lips segmentation. In: IEEE Fourth Workshop on Multimedia Signal Processing, pp. 3–8, Cannes, France (2001) 6 Luettin, J., Thacker, N.A., Beet, S.W.: Active Shape Models for Visual Speech Feature Extraction. University of Sheffield, U.K., Tech. Rep. 95/44 (1995) 7 Roweis, S.: EM algorithms for PCA and SPCA. Advances in Neural Information Processing Systems 10, 626–632 (1998) 8 Cootes, T., Edwards, G., Taylor, C.: A comparative evaluation of active appearance model algorithms. In: Proc. of the British Machine Vision Conference, pp. 680–689 (1988)
382
D. Yu et al.
9 Foo, S.W., Lian, Y.: Recognition of visual speech elements using adaptively boosted HMM. IEEE Trans. on Circuits Syst. Video Techn., 14(5), 693–705 (2004) 10 Shamaie, A., Sutherland, A.: Accurate recognition of large number of hand gestures. In: Proc of Iranian Conference on Machine Vision and Image Processing, University of Technology, Tehran (2003) 11 Das, S.R., Wilson, R.C., Lazarewicz, M.T., Finkel, L.H.: Gait recognition by two-stage principal component analysis., Automatic Face and Gesture Recognition, pp. 579–584 (2006) 12 Gordan, M., Kotropoulos, C., Pitas, I.: Application of support vector machines classifiers to visual speech recognition. In: Proc. of the 2002 Int. Conf. on Image Processing (2002) 13 Hong, X.P., Yao, H.X., Wan, Y.Q., Chen, R.: A PCA based visual DCT feature extraction method for lip-reading. In: Proc. of Intelligent Information Hiding and Multimedia Signal Processing, pp. 321–326 (2006) 14 Yau, W.C., Kumar, D.K., Arjunan, S.P., Kumar, S.: Visual speech recognition using image moments and multi-resolution wavelet images. Computer Graphics, Imaging and Visualisation, pp. 194–199 (2006) 15 Cetingul, H.E., Yemez, Y., Erzin, E., Tekalp, A.M.: Discriminative analysis of lip motion features for speaker identification and speech-reading. IEEE Trans. on Image Processing 15(10), 2879–2891 (2006) 16 Dong, L., Foo, S.W., Lian, Y.: A two-channel training algorithm for Hidden Markov Model and its application to lip reading. EURASIP Journal on Applied Signal Processing 2005(9), 1382–1399 (2005) 17 Harvey, R., Matthews, I., Bangham, J.A., Cox, S.: Lip reading from scale-space measurements. In: Proc. of Computer Vision and Pattern Recognition, pp. 582–587 (1997) 18 Bregler, C., Omohundro, S.M.: Nonlinear manifold learning for visual speech recognition. In: Proc. of the International Conference on Computer Vision, pp. 494–499 (1995)
Fingerprint Hardening with Randomly Selected Chaff Minutiae ˙ Alper Kanak1,2 and Ibrahim So˜ gukpınar2 1
National Research Institute of Electronics and Cryptology, TUBITAK-UEKAE Kocaeli, Turkiye 2 Gebze Institute of Technology, Kocaeli, Turkiye
[email protected],
[email protected] Abstract. Since fingerprints provide a reliable alternative for traditional password based security systems, they gain industry and citizen acceptance. However, due to the higher uncertainty and inherent complexity associated with biometrics, using pure biometric traits does not present a reliable security system especially for large populations. This paper addresses this problem by proposing a hardening scheme which combines the fingerprint minutiae-based template and user-specific pseudo random data to enhance security. In the proposed scheme, a set of randomly selected user-specific chaff minutiae features are stored in a smartcard and a subset of this set is used at each acquisition. The set of chaff minutiae is combined with the template set and scrambled to form a fixed-length hardened feature. The graph based dynamic matching algorithm is transparent to the proposed hardening scheme anyhow it runs as if pure original template and query features are used. Our experiments show that biometric hardening reduces error rate to 0% with several orders of magnitude separation between genuine and impostor populations.
1
Introduction
Among the various computer security techniques, cryptography has been identified as one of the most important solutions in digital world. Symmetric or asymmetric cryptographic schemes are extensively used by means of secure information exchange over a network. Although there exist very powerful encryption techniques, there are problems associated with very long keys which are hard to remember and require a reliable key management protocol. Such long keys cause key generation, dissemination and storage problems. It is obviously seen that cryptographic systems authenticate these long keys, not the user. Current authentication systems use tokens, smart cards or just passwords to represent a user, or claim to represent a user, but this does not really distinguish between authorized users and unauthorized people. This problem is usually not cared especially for small scale authentication systems. However, for the large-scale authentication systems a more distinguishing trait is required. Biometrics, which are permanently associated with the user, can obviate the need to carry tokens, remember passwords and keys. However, biometrics cannot W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 383–390, 2007. c Springer-Verlag Berlin Heidelberg 2007
384
˙ So˜ A. Kanak and I. gukpınar
be directly used to secure data because they raise privacy and security concerns. Since biometrics carry permanent information they can not be revoked or cancelled. Moreover, biometric data is uncertain and inherently complex. The acquired biometric data for any individual may differ in time due to the biological changes (aging, wounds, illnesses, etc.) and geometrical changes (rotation-scaletranslation variance) or be distorted due to the environmental affects (noise). Biometrics offer an inextricable link between the authenticator and its owner, something passwords or token cannot do. When used in key generation, this link can replace the password if the problems associated with dependency to environmental affects, pose variance, image quality or any other distortion are overcome. Over the past several years, although there have been a great research effort to integrate biometrics with cryptography, the number of these researches is limited to just a few. Soutar et.al [3] attempted to hide secret information within fingerprint features. Davida et. al [6] presented a cryptographic key extraction scheme by using iris codes. Monrose et. al. [7,8] introduced a technique to harden traditional passwords by adding error tolerant bits derived from user’s voice or keystrokes. The fuzzy commitment scheme [10] is presented as a method of combination of auxiliary information with biometric information. But, the main criticism of this method is that it requires an ordered feature. To reduce the strict need of ordered features, fuzzy vault [9] is proposed but it is still not obvious how it will handle alignment requirements of the feature representations. A practical application of fuzzy vault to fingerprints is presented in [12]. In summary, the major weaknesses of the reported works in literature is that either they assume aligned features [3,12,10,9] or present matching in the transformed space domain [2,14]. Another problem of such systems is the storage problem where the biometric template is not stored securely (on PC or database) [6]. In this paper, a hardening scheme which combines the secret and sufficient truly random locking set stored in a secure environment (smartcard) with the user-specific biometric features is proposed. The matching algorithm requires neither prealignment nor hashing (as proposed in [2,4,5,14,3]) and it is transparent to the feature set not knowing that it was hardened or not. In order to harden the feature set, some chaff points are added to the original data, instead of storing the original features. The variable number of chaff minutiae is required due to the vulnerabilities during acquisition to form a fixed size hardened feature. All these random numbers are selected from a random number pool saved in a smartcard. Additionally, the encrypted version of the hardened biometric is stored in the smartcard to ensure a more secure system. The graph based matching, namely k-plet, is used as the matching algorithm [15]. K-plets are directly used on the hardened biometric feature so no more transformation is required.
2
Background Information and Related Works
Wide range of techniques to generate and protect cryptographic keys is mostly based on biometric hardening and keying. In biometric hardening, the biometric
Fingerprint Hardening with Randomly Selected Chaff Minutiae
385
template is combined with user specific pseudo random information. This idea is used for concealing biometric data in which a one-way transformation function is applied to hide the original data [1]. This is similar to password salting in traditional password-based security systems. A good example of biometric hardening was proposed by Teoh et.al (Facehashing)[2] to derive cryptographic keys from multiple face instances of a person. In this study, a tokenized facehashing method is applied by discretizing an integral transform of the biometric input with a set of tokenized pseudo random number. Then a cryptographic interpolation is done using Shamir secret sharing where the biometric template and user specific pseudo random string are used as shares to derive the key. In order to get a more distinguishing impostor and genuine distribution user must carry huge random number sets in his token to obtain a long enough biometric bit string. Another hardening attempt was presented in [3] where the secret information is hidden within fingerprints. In this study fingerprint image is combined with a user specific random key to generate a BioscryptT M [3]. However, this method does not carry rigorous security guarantees and resulting equal error rates are still unknown. There also exist other techniques proposed by numerous researchers. For instance, Savvides et. al. [4] convolve random kernels with face images to generate cancelable templates. Another case study [5] presents several constructs for cancelable templates using feature domain transformations. In biometric keying, cryptographic key is directly generated from the biometric signal using an error tolerant binary representation. In the approach proposed by Davida et. al. [6], user specific hamming codes are stored on tokens and are used to offset the errors caused during acquisition and feature representation. However, storing error-correcting bits in the database leads to the leakage of user information. Monrose et .al. present a technique to generate a cryptographic key from key stroke dynamics and speech [7,8]. In this study biometric features are converted to an error tolerant bit representation by comparing each value to a fixed global threshold. Although cryptographic key is strengthened in each successful login, selecting the right threshold and the uncertainty due to the errors during acquisition are the drawbacks of the system. Another weakness of this work is that it adds insufficient entropy to the passwords (just 15 and 60 bits obtained from keystroke and voice patterns, respectively). Another example of biometric-based key generation, called Fuzzy Vault Scheme [9], is an improvement upon the previous work by Juels and Wattenberg [10]. In this scheme a secret is placed in a vault and locked by an unordered set. At the verification phase, the vault is unlocked if two unordered sets overlap to a great extent. Clancy et. al. [12] proposed a fingerprint vault based on the fuzzy vault scheme. In this approach multiple minutiae location sets per finger are used to find the canonical positions of minutia which form the registering unordered set. They add some chaff points to strengthen the lock. The drawback of this approach is that they assume fingerprints are pre-aligned. To address the frequent impossibility to properly align minutiae points, Uludag et. al. [13] proposed to use features independent of global rotation and translation. However, they reported that the genuine users can unlock the vault successfully while the
386
˙ So˜ A. Kanak and I. gukpınar
complexity of attacks that can be launched by impostor users is high. One critical limitation of fuzzy-based approaches is that they include high time complexity due to the need for evaluating multiple point combinations during decoding. Using symmetric hash functions is another issue. In [14], the idea of combining results of localized matching into the whole fingerprint recognition algorithm is proposed. For each minutia feature vector of length 3 (x, y, θ) and its two nearest neighbors, a secondary feature vector of length 5 is generated which is based on the Euclidean distances and orientation difference between the central minutia and its nearest neighbors. Matching is performed on these secondary features which keep only limited information about neighborhoods, not the exact minutia positions. However, there is not much hardening considering randomness exists in this study. A nice case study considering Cartesian, radial and functional transformation relying on user specific random permutation is presented in [5]. However, the entropy retention problem still remains in this method.
3
Implementing Hardened Fingerprints
In this study, original biometric data of a person is hardened by randomly selected chaff data which are associated with the user. The chaff and the original features are combined to form a larger feature vector. The main advantage of the proposed approach is no transformation or space conversion is required. This approach may be used not only for hardening but also canceling the biometric. Here the problem is about the amount of the chaff data used for canceling the biometric. If too many chaff points are used, the biometric trait of an individual is suppressed and the distinguishing quality between genuine and imposter data reduces; else the biometric is not hardened enough. In the proposed hardening scheme a set of chaff features are stored in a pool which are saved in a smartcard’s EEPROM. All the secret data in the smartcard are generated randomly and is associated uniquely with a person. In this study, chaff and real minutiae are represented by triplets (x, y, θ) where x, y denote the cartesian coordinates and θ the angle of the associated ridge. When the biometric is captured during enrollment or verification, the minutiae triplets are then combined with the randomly generated data in the pool. Since the number of extracted minutiae is uncertain the amount of chaff data to be used is uncertain, too. For instance if N minutiae triplets are extracted at the enrollment stage and k = M − N chaff triplets are selected from the pool to generate a fixed hardened biometric feature with size M , the number k may differ at the verification stage due to the conventional uncertainty problems of biometrics. Moreover, this unordered set might be scrambled and encrypted by the user’s private key to increase secrecy and stored on the smartcard. Note that, when a genuine fingerprint is captured, the minutiae triplets are extracted from the image and a different unordered set of randomly selected chaff points from the pool on the smartcard are added to the test data. Meanwhile, the previously locked fingerprint feature set which was stored on the smartcard is decrypted by the user’s private key. Finally, a matching procedure is performed between the query
Fingerprint Hardening with Randomly Selected Chaff Minutiae
387
and the template minutiae set. Note that the matching scheme is transparent to the previous hardening staff. The whole hardening scheme is depicted in Fig. 1. During verification a graph based matching algorithm involving a novel representation called k-plet, which is defined as the local neighborhood of minutiae that is translation and rotation invariant, is implemented [15]. The local neighborhoods are matched using a dynamic programming based algorithm and the consolidation of the local matches is done by a coupled breadth first search algorithm. This algorithm propagates the local matches simultaneously in both fingerprints. The major advantage of this algorithm is that, no explicit alignment of minutiae sets is required. Perhaps the performance of the matching algorithm depends heavily upon the extracted features. Fig. 1. The Proposed Hardening Scheme In order to detect minutiae a robust enhancement algorithm is applied before feature extraction. We have applied Short Time Fourier Transform (STFT) analysis and contextual/non-stationary filtering in the Fourier domain [16]. This method is advantageous because all intrinsic images (ridge orientation, frequency and the region mask) are estimated simultaneously from STFT analysis. This advantage yields a significantly less time consumption and reduced space requirements as compared with other methods [17]. Once the enhanced fingerprint image is obtained, a simple adaptive binarization algorithm is applied. The resulting binary image is thinned by an iterative morphological process resulting in a single pixel wide map. The pixel wide map is then scanned sequentially and minutiae (represented as (x, y, θ)) are identified within a 4 × 4 window traversing the whole thinned image. Finally the spurious points are cleared out by applying some heuristic rules (See Fig. 2).
Fig. 2. Minutiae Extraction Procedure
4
Experimental Results and Analysis
In order to measure the performance, we run the matching algorithm on images from FVC2002 DB1 (DB1) and our database (DB2) which is acquired by UPEK’s
388
˙ So˜ A. Kanak and I. gukpınar
TouchStrip TCS3 sensor. This is a silicon strip sensor ideal for portable devices such as notebook PCs and flash drives. DB1, which is accepted as the benchmark database in literature, contains 800 images (100 distinct fingers, 8 instances each) with size of 388 × 374 pixels. DB2 consists of 500 (124 × 180 pixels) images acquired from 88 distinct fingers and 5 impressions for each finger. For genuine comparison, each instance of a finger is compared with the rest of the instances resulting in C × (C − 1)/2 tests where C = (number of instances per finger) × (number of fingers). On the other hand for impostor comparison test, the first instance of each finger is compared against the first instance of all other fingers resulting in a total of P × (P − 1)/2 where P is the number of people in the database. The fixed size M of the resulting hardened minutiae set is selected ˜ = 40 and N ˜ = 30 for according to the average size of the original minutiae set(N DB1 and DB2, respectively). It can be seen from Fig. 3 that biometric hardening increases the separation between genuine and impostor distributions. The order of the separation is highly related to the number of chaff minutiae. As M (or k) increases the separation increases, too. However, it is not recommended to use large M s because of the computational limitations. Note that for each case depicted in Fig. 3, N minutiae for each fingerprint image remains same whereas randomly selected k chaff minutiae differs yielding various M s. Since the matching strategy is based on a dynamic algorithm, the response time of the system is dependant to the number of compared minutiae. Practically, the total response times for M = 40 and M = 60 are approximately 0.6 and 1.1 seconds, respectively on a Pentium 4-2.8 GHz CPU and 2 GB RAM when DB2 images are used. For the encryption back-end, AES, DES, 3DES or any other algorithm might be applied. Table 1 compares the proposed scheme with the Table 1. Comparison of the Proposed Algorithm with minu-tiae-based hardening Some Other Major Methods sch-emes in terms of GenMethod FAR(%) GAR(%) uine Accept Rate(GAR), False Accept Rate (FAR) or Cartesian Transformation [5] 10−4 ∼ 67 False Reject Rate (FRR). 10−1 ∼ 91 As shown, the proposed −4 Radial Transformation [5] 10 ∼ 84 −1 scheme performs better 10 ∼ 97 −4 than the other methods in Functional Transformation [5] 10 ∼ 83 −1 the recognition perspective. 10 ∼ 97 However, although the Fuzzy Vault with Helper Data [11] 0 72.6 proposed method presents Symmetric Hash Functions [14] 0 ∼ 81 Fingerprint Vault [12] FRR= 20% − 30% a recognition performance contribution, the hardened Proposed Algorithm 10−1 100 feature set has still a weak non-invertibility problem yielding entropy problems. Thus, a more sophisticated transformation might be proposed to meet the non-invertibility, entropy retention, certainty and recoverability requirements of cancelable biometrics as a future study.
Fingerprint Hardening with Randomly Selected Chaff Minutiae
(pure minutiae)
(M = 40)
(pure minutiae)
(M = 30)
(M = 45)
(M = 50)
(M = 40)
(M = 50)
(M = 55)
(M = 60)
(M = 60)
(M = 70)
(M = 70)
(M = 80)
(M = 80)
(M = 90)
389
Fig. 3. Genuine (straight lines) and Impostor (dashed lines) Population Distributions for DB1 (left two columns) and DB2 (right two columns)
5
Conclusion
Biometrics, especially fingerprints, present obvious advantages over traditional security schemes where a certain amount of data kept in a password, token or a smart-card. However, since the biometrics are permanently associated with a person, they can not be revoked or replaced. In this study, the original minutiae set of a person is hardened by pseudo-randomly selected chaff minutiae triplets to construct revocable templates. In order to improve accuracy, STFT-based enhancement is applied. Matching is based on a graph-based algorithm which gives promising results. One of the main advantages of the proposed scheme is that no space conversion is required for hardening. Thus, the hardening scheme is
390
˙ So˜ A. Kanak and I. gukpınar
transparent to the matching algorithm. This means that the proposed scheme is compatible with different minutiae matching algorithms. Moreover, the features are stored inside the smartcard to provide more security.
References 1. Ratha, N.K., Connel, J.H., Bolle, R.: Enhancing Security and Privacy in Biometrcis-based Authentication System. IBM Systems Jour., 3(40), 614–634 (2001) 2. Teoh, A.B.J, Ngo, D.C.L, Goh, A.: Personalised Cryptographic Key Generation Based on FaceHashing. Computers & Security 23(7), 606–614 (2004) 3. Soutar, C., Roberge, D., Stojanov, S.A., Gilroy, R., Vijaya Kumar, B.V.K.: Biometric Encryption Using Image Processing. In: Proc. SPIE, Optical Security and Counterfeit Deterrence Techniques II. vol. 3314, pp. 178–188 (1998) 4. Savvides, M.,Vijaya Kumar, B.V.K., and Khosla, P.K.: Cancelable Biometric Filters for Face Recognition. In: Proc. of Int. Conf. on Pattern Recognition, pp. 922– 925 (2004) 5. Ratha, N.K., Connel, J.H., Bolle, R., Chikkerur, S.: Cancelable Biometrics: A Case Study in Fingerprints. In: Proc. of Int. Conf. on Pattern Recognition (2006) 6. Davida, G.I., Frankel, Y., Matt, B.: On Enabling Secure Applications Through OFF-line Biometric Identification. In: Proc. IEEE Symp. Privacy and Security, pp. 148–157 (1998) 7. Monrose, F., Reiter, M.K., Wetsel, S.: Password Hardening Based on Keystroke Dynamics. In: Proc. ACM Conf. Computer & Communication Security, pp. 73–82 (1999) 8. Monrose, F., Reiter, M.K., Li, Q., Wetsel, S.: Using Voice to Generate Cryptographic Keys. In: Proc. 2001: A Speaker Odyssey, Speaker Recognition Workshop, pp. 237–242 (2001) 9. Juels, A., Sudan, M.: A Fuzzy Vault Scheme. Des. Codes Cryptography 38(2), 237–257 (2006) 10. Juels, A., Wattenberg, M.: A Fuzzy Commitment Scheme. In: Tsudik, G. (ed.) Proc. ACM Conf. Computer & Communications Security, pp. 28–36 (1999) 11. Uludag, U., Jain, A.K.: Securing Fingerprint Template: Fuzzy Vault with Helper Data. In: Proc. IEEE Workshop on Privacy Research in Vision, New York City, NY (2006) 12. Clancy, T.C., Kiyavash, N., Lin, D.J.: Secure Smartcard-based Fingerprint Authentication. In: Proc. ACM SIGMM Multimedia, Biometrics Methods & Applications Workshop, pp. 45–52 (2003) 13. Uludag, U., Pankant, S., Prabhakar, S., Jain, A.K: Biometric Cryptosystems: Issues and Challenges. In: Proc. of the IEEE, vol. 92(6) (2004) 14. Tulyakov, S., Farooq, F., Govindaraju, V.: Symmetric Hash Functions for Fingerprint Minutiae. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3687, pp. 30–38. Springer, Heidelberg (2005) 15. Chikkerur, S., Cartwright, A.N., Govindaraju, V.: K-plet and CBFS: A Graph based Fingerprint Representation and Matching Algorithm. In: Int. Conf. Biometrics (2006) 16. Chikkerur, S., Cartwright, A.N., Govindaraju, V.: Fingerprint Enhancement Using STFT Analysis. Jour. Pattern Recogn. 40(1), 198–211 (2007) 17. Maio, D., Maltoni, D., Jain, A.K., Prabhakar, S.: Handbook of Fingerprint Recognition. Springer, Heidelberg (2003)
Wavelet-Based Fingerprint Region Selection Almudena Lindoso, Luis Entrena, and Judith Liu-Jimenez University Carlos III of Madrid, Electronic Technology Department, Butarque 15, 28911 Leganes, Madrid, Spain {alindoso, entrena, jliu}@ing.uc3m.es
Abstract. In this paper a novel approach for detecting fingerprint regions with relevant information is presented. This method is based on the capability of the wavelet transform to select image information considering at the same time spatial and frequency domains. The method has been tested with two fingerprint data bases providing excellent results. With this method the fingerprint core can be detected and also the background can be detached, providing an efficient region selection for any feature extraction method, preprocessing and matching algorithms. Keywords: Fingerprint, Biometrics, wavelet.
1 Introduction Nowadays fingerprint is the most widely used and studied biometric technique because of its universality, distinctiveness, and decreasing cost of the sensing devices [1]. Although well-studied, fingerprint is still a challenge in some aspects. The accuracy of matching algorithms is usually determined by the fingerprint image quality. Preprocessing, feature extraction and matching are affected by bad quality regions inside fingerprint images. Preprocessing usually makes a great effort in increasing the quality of bad regions (low contrast, blur, etc.). Most of the times, this preprocessing effort leads to artifacts that reduce the accuracy of the extraction and matching steps. Therefore, a measure of fingerprint image quality or a measure of the quantity of information is generally required. Wavelet transform has been widely used for image compression making possible the detection of redundant information. This wavelet property has been applied for fingerprint compression [2]. Some works have also proposed the use of wavelets for fingerprint enhancement [3], [4], [5], [6]. The wavelet properties have been recently used for fingerprint aliveness detection [7], [8]. Other approaches have used the capability of the wavelet transform to distinguish the characteristics of the information inside images. From this point of view, several works have focused on determining a fingerprint feature vector extracted from the fingerprint wavelet transform. In [9] [10] the fingerprint feature vector is constructed with the standard deviation of the wavelet coefficients. This vector is extracted from a region near the fingerprint core, but this region is determined manually. In both works several wavelet families W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 391–398, 2007. © Springer-Verlag Berlin Heidelberg 2007
392
A. Lindoso, L. Entrena, and J. Liu-Jimenez
were used, achieving the best results for Symmlet and Daubechies families. These feature vectors achieve high matching accuracy. In [12] the variance of the coefficients is used to refine the feature vector proposed in [9]. In [13] the feature vector is built with the statistics of the wavelet transform (average and standard deviation) and the cooccurrence matrix (contrast, energy, entropy, local homogeneity, cluster shade, cluster prominence and maximum probability of the wavelet coefficients). In other approaches, feature vectors are built with the energy of selected coefficients of packet wavelets [14] and with the Generalized Gaussian density and energy distribution of coefficients [15]. Finally, in [11] fingerprint matching is performed by computing the mean square error of directional images extracted from the mid-range frequencies of the wavelet transform of the fingerprint. All these related works using wavelets for fingerprint recognition demonstrate that wavelets have the capability to classify information confined inside fingerprints. In this paper a new method for detecting relevant regions inside a fingerprint is presented. This method is based in the capability of the wavelet transform to select image information considering at the same time spatial and frequency domains. This approach allows identifying core location and best quality areas in a fingerprint. Then, this information can be used during feature extraction or matching steps as a measure of confidence in the results. On the other hand, the capability of selecting good quality areas is crucial for some fingerprint matching methods, such as correlationbased methods [1]. The remaining of the paper is organized as follows: section 2 presents the proposed region selection method; section 3 presents and discusses the experimental results achieved with this method and finally section 4 presents the conclusions of this work.
2 Region Selection The Wavelet Transform is a hierarchical sub-band decomposition of the information in which the sub-bands are spaced logarithmically in frequency [2]. For images, the dyadic wavelet transform divides each wavelet level into four subbands of coefficients. For level 1 wavelet transform these bands are: low frequencies (LL1), mid range frequencies (LH1, HL1) and high frequencies (HH1). For successive levels, the transform is applied to the low frequency coefficients of the previous level, which means level 2 coefficients are determined with the wavelet transform of LL1. Several families can be used for wavelet transform. The family must be selected in accordance with the characteristics of image to be transformed. Daubechies and Symmlet families have been proved the best for some fingerprint wavelet transform applications [9]. For most images, the amplitude of wavelet coefficients increases when the wavelet level increases. Fingerprint wavelet coefficients do not show the same distribution. Fingerprints are oscillatory patterns whose wavelet coefficients are located in the mid range frequency bands [9]. Considering this behavior, usually not many wavelet levels are required to extract the relevant information.
Wavelet-Based Fingerprint Region Selection
393
Related work demonstrates that wavelet coefficients can be used for matching fingerprints [9], [10], [11], [12], [13], [14], [15]. These works demonstrate the capability of wavelets to determine fingerprint feature vectors. In our work, this wavelet capability is used to determine the regions that have suitable information for fingerprint matching. In our approach, Level 3 dyadic wavelet transform of the fingerprint is computed. Haar and Daubechies wavelet families are considered. The original fingerprint is divided in windows of WxH pixels. For all the levels, the sub-band coefficients are associated to the original fingerprint window they were extracted from. This spatial correspondence is necessary to determine the relevance of the information inside a fingerprint area. The behavior of the wavelet coefficients must be determined in order to measure the fingerprint quantity of information. To this purpose, several statistics are computed in each sub-band of each level considering a window of NxM pixels. The statistics considered are the following: average (1), energy (2), standard deviation (3) and entropy (4).
∑ C (i, j)
(1)
∑ (C (i, j ))
(2)
Average:
Energy:
Standard deviation:
Entropy:
W ⋅H
2
W ⋅H
∑ ( Average − C (i, j ))
2
W ⋅H
− ∑ prob(C (i, j )) ⋅ log 2(C (i, j )) W ⋅H
(3)
(4)
In the formulae above, C(i,j) are the wavelet coefficients and W, H are respectively width and height of the windows for the statistic computation. After these computations, an average of the statistics of all the sub-bands is computed for each window of the original image. This results in a measure of the quantity of information the fingerprint contains on a window basis. With this approach the fingerprint region with maximum quantity of information can be selected over other non relevant regions. Depending on the statistic used, the region with maximum magnitude varies. In the next section, the results obtained with each statistic are discussed.
394
A. Lindoso, L. Entrena, and J. Liu-Jimenez
3 Experimental Results The proposed approach has been tested with two fingerprint data bases: FVC 2006 DB2 B [16] (10 fingers, 12 samples per finger), and our own data base made with the FX3000 sensor of Biometrika [17] (7 fingers, 6 samples per finger). The resolution of the images in both databases is 569 dpi. In order to test the method, a visual application was developed. This application shows the magnitude of the statistics together with the fingerprint. The wavelet decomposition was made until level 3 with Haar and Daubechies dyadic wavelets. The statistics were computed by windows of 5x7 pixels for all the sub-bands in each wavelet level. Some larger windows (10x10 pixels) were used for visualization. These visualization windows show the average of the statistic considering all the coefficients bands for all the wavelet levels. In order to give more relevance in the average to certain coefficient bands, each of the 4 coefficients bands of each level were weighted by a different factor. A color palette was assigned to the magnitude of the statistics (red to blue, being red the highest magnitude). Haar wavelets are the most simple wavelet family but the results achieved with them were worst than the results achieved with Daubechies wavelets. In the experiments, energy and average of the windows did not show good discriminatory properties for region selection. The best results were achieved with the standard deviation of the level 2 Daubechies wavelet coefficients, considering only mid-range frequency coefficients for both levels (LH1, HL1, LH2, and HL2). Our experiments have been centered in mid-range frequencies because fingerprint presents an oscillatory pattern with higher order coefficients in these bands. However, considering only level 1 coefficients (LH1, HL1) the discrimination was not appreciable. Then the mid-range frequency coefficients of level two were included in the average (LH2, HL2), achieving higher discrimination. If all wavelet coefficients are included in the average, again the discrimination is reduced. This demonstrates that the crucial information is condensed over these sub-bands. Moreover, for this statistic the results considering level 3 coefficients reduce the discrimination, thus demonstrating that level 2 wavelet computation is enough to distinguish the area with relevant information. Figures 1, 2 and 3 show the region selection results for several samples of the same finger in our own database and FVC 2006 DB2 B database. In these Figures the highest magnitude blocks (red color in our palette) are the darkest ones. In order to detect them easily in the images the highest magnitude region has been delimited. For all fingerprints in both databases, the core was always inside the region with maximum magnitude (darkest blocks in Figures 1, 2 and 3). Figures 2 and 3 show that the proposed region selection is effective even with low quality fingerprints. Higher precision core detection can be made with finer windowing. The shade that appears near the image borders in all fingerprints of Figures 1, 2 and 3 is present in the initial image acquired from the sensor. The images showed in figures 1, 2 and 3 are the superposition of the fingerprints and the average of the statistics of the wavelet coefficients. Note that this shade is not present in the average windowing.
Wavelet-Based Fingerprint Region Selection
395
Fig. 1. Region selection for five samples of finger 2 (own database)
Fig. 2. Region selection for five samples of finger 4 (FVC 2006 DB2 B)
The entropy of the Daubechies wavelet level 3 coefficients allows detaching the fingerprint from the background. In this case, increasing the level showed higher discrimination between fingerprint and background. Level 3 wavelet showed perfect background detection, so that a higher wavelet level is not required to increase accuracy. Mid-range coefficients of all the bands were used (LH1, HL1, LH2, HL2, LH3 and HL3). On the other hand, including all coefficients of the three sub-bands causes a decrease in the accuracy of the fingerprint detection.
396
A. Lindoso, L. Entrena, and J. Liu-Jimenez
Fig. 3. Region selection for five samples of finger 8 (FVC 2006 DB2 B)
Fig. 4. Background detection for one sample of all the fingers (own database)
Figures 4 and 5 show the proposed background detection for one sample of all the fingers of the two databases. In this case the color palette is reverse; the lightest blocks represent the highest magnitude. The proposed background detection is effective for all the fingerprints of both databases. The uniformity of the edge detection decreases slightly with low quality fingerprints, but even for those fingerprints the proposed method is excellent. For all the samples tested nearly all the fingerprint area was in the highest entropy magnitude and only marginal border windows were in the next palette color.
Wavelet-Based Fingerprint Region Selection
397
Fig. 5. Background detection for one sample of all the fingers (FVC 2006 DB2 B)
Wavelet computation requires high resolution fingerprint images. To overcome this limitation, some tests have been made with low resolution sensors. These images were preprocessed incorporating interpolated pixels in order to make possible the computation of level 3 wavelets. The results achieved with these low resolution images were similar to the high resolution ones. Thus, the approach is also applicable to low resolution images.
4 Conclusions In this paper, a new method for extracting relevant regions of fingerprints is presented. The experiments performed with the proposed method have demonstrated excellent results for both region selection and background detection. For all the fingerprints tested in our experiments the core was inside the region selected and it showed very good discrimination in most cases. Finer average windowing can lead to core detection with higher precision. The results of this work prove that this method is fast, simple and effective for selecting a relevant region of a fingerprint in order to perform afterwards any feature extraction, preprocessing or matching algorithms. Acknowledgments. This work was supported in part by the Ministerio de Ciencia y Tecnología (Spain) under Project TIC2003-01793.
References 1. Maltoni, D., Maio, D., Jain, A.K, Prabhakar, S.: Handbook of fingerprint recognition, pp. 83–171. Springer, Heidelberg (2003) 2. Walker, J.S.: A primer on wavelets and their scientific applications. CRC Press, Boca Raton, USA (1999)
398
A. Lindoso, L. Entrena, and J. Liu-Jimenez
3. Hatami, S., Hosseini, R., Kamarei, M., Ahmadi, H.: Wavelet based fingerprint image enhancement. In: IEEE International Symposium on Circuits and Systems, ISCAS 2005. vol. 5, pp. 4610–4613 (2005) 4. Wen, M.- l., Liang, Y., Pan, Q., Zhang, H.-C.: A Gabor filter based fingerprint enhancement algorithm in wavelet domain. In: IEEE International Symposium on Communications and Information Technology. ISCIT 2005. pp. 1468–1471 (2005) 5. Zhang, W.-P., Wang, Q.-R., Tang, Y.Y.: A wavelet-based method for fingerprint image enhancement. In: Proceedings. 2002 International Conference on Machine Learning and Cybernetics. vol. 4, pp. 1973–1977 (2002) 6. You, X., Yang, J., Tang, Y.Y., Fang, B., Li, L.: Skeletonization of Fingerprint Based-on Modulus Minima of Wavelet Transform. In: Li, S.Z., Lai, J.-H., Tan, T., Feng, G.-C., Wang, Y. (eds.) SINOBIOMETRICS 2004. LNCS, vol. 3338, Springer, Heidelberg (2004) 7. Moon, Y.S., Chen, J.S., Chan, K.C., So, K., Woo, K.C.: Wavelet based fingerprint liveness detection. Electronics Letters 41(20), 1112–1113 (2005) 8. Shuckers, S., Abhyankar, A.: Detecting liveness in fingerprint scanners using wavelets: Results of the test Dataset. In: Maltoni, D., Jain, A.K. (eds.) BioAW 2004. LNCS, vol. 3087, Springer, Heidelberg (2004) 9. Tico, M., Immonen, E., Ramo, P., Kuosmanen, P., Saarinen, J.: Fingerprint recognition using wavelet features. In: The IEEE International Symposium on Circuits and Systems, 2001. ISCAS 2001.vol. 2, pp. 21–24 (2001) 10. Tico, M., Kuosmanen, P., Saarinen, J.: Wavelet domain features for fingerprint recognition. Electronics Letters 37(1), 21–22 (2001) 11. Mokju, M., Abu-Bakar, S.A.R.: Fingerprint matching based on directional image constructed using expanded Haar wavelet transform. In: Proceedings. International Conference on Computer Graphics, Imaging and Visualization, CGIV 2004. pp. 149 - 152 (2004) 12. Fung, Y.-H., Chan, Y.-H.: Fingerprint recognition with improved wavelet domain features. In: Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, pp. 33–36 (2004) 13. Selvaraj, H., Arivazhagan, S., Ganesan, L.: Fingerprint verification using wavelet transform. In: Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications, ICCIMA 2003. pp. 430–435 (2003) 14. Huang, K., Aviyente, S.:Choosing best basis in wavelet packets for fingerprint matching. In: International Conference on Image Processing, ICIP ’04. vol. 2, pp. 1249–1252 (2004) 15. Huang, K., Aviyente, S.: Combining generalized Gaussian density and energy distribution in wavelet analysis for texture classification. In: Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers. vol. 2, pp. 2094–2098 (2004) 16. http://bias.csr.unibo.it/fvc2006/databases.asp 17. http://www.biometrika.it/eng/fx3000.html
Face Shape Recovery and Recognition Using a Surface Gradient Based Statistical Model Mario Castel´an1 and Edwin R. Hancock2 1
Centro de Investigaci´ on y Estudios Avanzados del I.P.N., Ramos Arizpe 25900, Coahuila, Mexico
[email protected] 2 The University of York, Heslington, York YO10 5DD, United Kingdom
[email protected] Abstract. In previous work [5] we have identified the gradient of the surface as the best representation for constructing Cartesian models of faces. This representation proved capable of capturing variations in facial shape over a sample of training data. The resulting statistical model can be fitted to Lambertian data using a simple non-exhaustive parameter adjustment procedure. In this paper we test the ability of the surface gradient-based model in two directions. First, we deal with non-lambertian images. Second, we use the model for face recognition purposes. Experiments with real world images suggest that the surface gradient model with the proposed parameter search can be used for accurate face shape recovery, showing a potential for face recognition applications.
1
Introduction
Acquiring surface models of faces is an important problem in computer vision and visualization, since it has significant applications in biometrics, computer games and production graphics. One of the most appealing methods is to use shapefrom-shading (SFS) [9], since this is a non-invasive process which mimics the capabilities of the human vision system [10]. For face shape recovery, however, the use of SFS has proved to be an elusive task, since the concave-convex ambiguity can result in the inversion of important features such as the nose. To overcome this problem, domain specific constraints have proved to be essential to improve the quality of the overall reconstructions, and the recovery of accurately detailed facial surfaces still proves to be a challenge. Despite the improvements achieved by using domain specific information, it is fair to say that no SFS scheme has been demonstrated to work as statistical SFS. The firsts to propose this techique were Attick et al. [2]. In their work, shape-coefficients are recovered so as to satisfy image irradiance constraints. The eigenmode surfaces, which they referred to as eigenheads, were parameterized in cylindrical coordinates. Unfortunately, a computationally expensive parameter search had to be carried out, since the fitting procedure involves minimizing the error between the rendered facial surface and the observed intensity image. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 399–407, 2007. c Springer-Verlag Berlin Heidelberg 2007
400
M. Castel´ an and E.R. Hancock
Later, Blanz and Vetter [4] decoupled the surface texture and shape representations, and performed principal component analysis (PCA) on the two components separately. Using full facial feature correspondences in cylindrical coordinates, they developed a model that could be fitted to image data. Their framework could be applied under pose and illumination changes. Unfortunately, linear combinations of shape and texture had to be constructed separately for the eyes, nose, mouth and surrounding areas. In a manner similar to that of the Atick et al, the fitting procedure aims to find a vector of parameters for texture and shape that minimize the error between the input image and the rendered reconstruction. The results delivered by fitting this morphable model proved to be accurate enough to generate photo-realistic views from an input image, though at the sacrifice of efficiency and simplicity. Cartesian coordinates were recently used by Dogvard and Basri [6] to construct a statistical model of faces using symmetry constraints. To overcome the problem of alignment errors they expressed the surface gradient in terms of a set of deformation coefficients. This allows shape-from-shading to be transformed into a linear system of equations that can be simply solved for the shape coefficients, and then used to reconstruct the surface height function for the face. Although accuracy is sacrificed, the method is computationally efficient. Height maps, however, are not the only way to representing 3D information in Cartesian coordinates. Alternative encodings can be drawn from 2.5D information such as the partial derivatives of a surface. Although the 2.5D representation is less appealing since it must be integrated to recover a surface, because of the image irradiance equation the 2.5D representation is closer to the raw image brightness data than a height surface. In previous work [5] we have identified the gradient of the surface as the best representation for constructing Cartesian models of faces. This representation proved capable of capturing variations in facial shape over a sample of training data. We showed how the model can be fitted to image brightness data using geometric constraints on surface normal direction provided by Lambert’s law [12]. In this paper we test the ability of the surface gradient-based model in two directions. First, we deal with variable albedo images. Second, we use the model for face recognition purposes. Experiments with real world images suggest that the surface gradient model with the proposed parameter search can be used for accurate face shape recovery, showing a potential for face recognition applications. The paper is organized as follows: in Section 2 we explain how the surface gradient model is constructed, Section 3 describes the parameter fitting procedure and Section 4 presents an experimental evaluation of the model. We finally present conclusions and future work in Section 5.
2
The Surface Gradient Based Model
For every visible point on a surface, there exists a corresponding orientation which is usually represented by either surface normal, surface gradient or the
Face Shape Recovery and Recognition
401
azimuth and zenith angles of the surface normal. In contrast to height data, directional information cannot be used to generate novel views in a straightforward way. However, given the illumination direction and the surface albedo properties, then the surface normal plays the central role in the surface radiance generation process. The Surface Gradient representation is based on the directional partial derivatives of the height function p = ∂Z(x,y) and q = ∂Z(x,y) . The set of first ∂x ∂y partial derivatives of a surface is also known as the gradient space. This is a 2D representation of the orientation of visible points on the surface. To explain how the surface gradient model was constructed, we commence with the representation of the surface gradient w.r.t. x. Each of the N surfaces in the training set may be represented N by long vectors p of p values. The mean ¯ is given by p ¯ = N1 height vector p k=1 pk . ¯ )|(p2 − p ¯ )| · · · |(pN − We form the matrix of centered long vectors P = [(p1 − p ¯ )]. We calculate the eigenvectors p ˆ hk of the matrix PT H and construct the model p p, Mp = PU
(1)
p = [ˆ where U up1 |ˆ up2 | · · · |ˆ upN ]. An out-of-training-sample centered long-vector of p ˘ −p ¯ , can be projected onto the model and represented using the vector values, p of coefficients ¯ ). bp = Mp T (˘ p−p
(2)
Let us now extend the above notation to the surface gradient w.r.t. y. We use the superscript q to refer to this representation. Using this notation, the statistical models for the surface gradient are p , Mq = Q U q. Mp = PU
3
(3)
The Parameter Fitting Procedure
In this section we explain the method used to fit the parameters of the models to image brightness data so that the irradiance equation is satisfied. This algorithm is similar to that proposed by Smith and Hancock [11] and draws ideas from the geometric shape-from-shading framework of Worthington and Hancock [12]. However, here we add an integrability enforcement step to the parameter fitting procedure. This is done using the global integration method of Frankot and Chellappa [7]. According to the geometric approach to SFS [12], the image irradiance equation is treated as a hard constraint. Compliance with Lambert’s law is effected by rotating an estimated surface normal onto the nearest location on the local irradiance cone. The rotated on-cone surface normal is given by n = Ψn
(4)
402
M. Castel´ an and E.R. Hancock
where Ψ is a rotation matrix computed from the cone apex angle and the angle between the current surface normal direction n at a given pixel in the image and the light source direction s. We have used this technique to fit the surface gradient model to brightness images of faces. If n is the field of initial surface normals estimated from the brightness data1 , then the iterative steps in the fitting process are defined as follows: 1. Transform the field of normals n into a surface gradient representation. 2. Subtract the mean shape and calculate the corresponding set of shape parameters using ¯) bp = Mp T (˘ p−p
and
¯ ). bq = Mq T (˘ q−q
(5)
3. Recover the surface shape from the best-fit parameters using ˘≈p ¯ + Mp bp p
and
˘≈q ¯ + Qq bq , q
(6)
4. Apply the integrability constraint [7]. This is done by generating a surface ˘ and q ˘. from the best fit parameters, p 5. From the reconstructed surfaces we calculate a field of surface normals. This is done by performing a bi-cubic patch fit to the surface height data. We enforce the irradiance constraint by rotating the recovered surface normals onto the irradiance cone using Equation 4. We then return to step 1. Instead of searching for valid linear eigenmode combinations that minimize the brightness error using exhaustive search, the parameter fitting procedure attempts to minimize the brightness error using simple geometric operations that satisfy the irradiance constraint provided by Lambert’s law. The method hence provides an intuitive and straightforward way for adjusting the shape coefficients to image brightness data.
4
Experiments
Although our main interest in this paper has been the use of the surface gradient representation for reconstructing facial shape, we have also performed some relatively limited experiments aimed at exploring their potential for recognition. To this end we have explored the distribution of the recovered surfaces. We have used Multi Dimensional Scaling (MDS)[13] to embed the faces in a low dimensional pattern space. We built a 200 eigenfaces model based on the ground truth database and determined the dissimilarity measure in the following manner: 1
We used a standard image gradient initialization. The input brightness data was taken from the Lambertian reillumination of the subjects in the database. The face database used for building the models was provided by the Max-Planck Institute for Biological Cybernetics in Tuebingen, Germany [3].
Face Shape Recovery and Recognition
403
Fig. 1. Plot of the first two coordinates defined through MDS on the dissimilarity matrix of the inter-distances of the ground truth database (points) and the recovered surfaces (stars)
1. Calculate the matrix B of vector coefficients for each in-training element in the database (the columns of B are parameter vectors b vectors representing the in-training sample faces). 2. Calculate the linear correlation coefficient between the columns of B. A correlation of 1 indicates a dissimilarity of 0, a correlation of -1 indicates a dissimilarity of 1. We used only the first 40 coefficients for each vector, and these account for at least 90% of the total variance of the model. We repeated this procedure using the recovered height surfaces from the 200 out-of-training-sample examples obtained after the 20th iteration of the algorithm to construct another height model from which the dissimilarity matrix was calculated in the same way as explained above. In Figure 1 the results of performing MDS are shown as points and stars for the ground truth and the recovered surfaces respectively. The distribution of dissimilarities are very similar for both set of data. Note how the spatial distribution of faces reveals clusters of similar characteristics (nose shape, face shape, weigh, gender, and even misaligned data). This suggests that the heigh
404
M. Castel´ an and E.R. Hancock
surfaces recovered using the method reflect the same shape distribution as the ground-truth parameters. In other words, the output of the method may be suitable for the purposes of recognition. We also present experiments with a number of real world face images. These images are drawn from the Yale B database [8] and are disjoint from the data used to train the statistical model. In the images, the faces are in the frontal pose and were illuminated by a point light source situated approximately in the viewer direction. The surface recovery results after twenty iterations are shown in Figure 2. From left to right we show the input image, the albedo map, frontal Lambertian re-illumination and re-illuminations using the albedo map. Note that the eye and eyebrow areas were manually removed to avoid reconstruction errors. However, there are errors due to the non-Lambertian nature of the input images. This evidences itself as instabilities around the boundaries of the face and in the proximity of the mouth and nose. Although the method struggled to recover the shape of the eye sockets, the overall structure of the face is well reconstructed. Moreover, the eyebrow location, nose length and width of the face clearly match those of the input images. In addition, Figure 2(last row) shows profile and close-to-profile views of the recovered surface with the albedo and reflectance texture-mapped onto the surface. There are a number of features to note from the figure. First, the reconstructed images agree well with the input. Second, the overall shape of the profile view is subjectively convincing. To complement the experiments on facial reconstruction, we have performed a simple face recognition investigation using the recovered surfaces. We tested the stability of the recognition results to changes in illumination. An eigenface model was generated by re-illuminating the facial surfaces recovered from the subjects in the Yale database. The re-illuminations were obtained using Lamberts law and an albedo estimate. The azimuth angle of the light source direction ranged from −π to π radians, sampled at 12 equally spaced positions. The zenith angle varied from −π/2 to π/2 degrees, again sampled at 12 equally spaced locations. For each of the five subjects there were 12 x 12 = 144 re-illuminations, and the resulting 865 re-illuminations were used to construct a statistical model. Each re-illuminated training image was characterized by its parameter vector in the eigenspace. For the test set we used the five subjects from the Yale database in a frontal pose illuminated by 64 different light source directions. We used these 5 × 64 = 320 simulated views for the purposes of recognition by the model. Note that the training set was constructed using renderings from the recovered surfaces (i.e like the ones shown in the last four columns of Figure 2). The test images correspond to real images of people re-illuminated under 64 different light source directions, and proved 320 test subjects. We computed the Euclidian distance between each parameter vector for the in-training (generated) synthetic re-illuminations and the out-of-training (real world) data. We located the element training-set that minimized the parameter-vector distance to the presented
Face Shape Recovery and Recognition
405
Fig. 2. Examples from the yale b database used for face shape recovery. From left to right, input image, albedo map, frontal Lambertian re-illumination and re-illuminations with the albedo map. Generated views of the recovered surfaces with the input images mapped onto the surfaces (last row). Table 1. Incorrect matches per subject Subject number 1 2 345 Number of incorrect matches 3 14 7 6 7
out-of-training-set example. The identity of this element was used to classify the out-of-training-set example. In this manner we aimed to recognize views of the subject re-illuminated from different light source directions. Recall that light-source effects are responsible for more variability in the appearance of facial images than changes in shape or identity [1]. The results of the recognition tests are shown in Table 1. The average percentage of error for the 320 test subjects was of 11.5%, which means that almost 90% of the faces resulted in a correct match, and gives an average of 7.4 incorrect matches for the 64 different light source directions per subject. In Figure 3 we
406
M. Castel´ an and E.R. Hancock
Fig. 3. Examples of entries that were incorrectly classified in the recognition scheme
present examples of the class of images that gives a common incorrect match in the verification process. Note that in all the cases the light source direction is highly slanted.
5
Conclusions
We have demonstrated that 3D statistical models of faces based on Cartesian representations of orientation data can work accurately without special heuristics. Experiments also suggest that face verification can be achieved using surfaces recovered from non-Lambertian entries. As future work we are planning to explore the behavior of the surface gradient representation with alternative methods for shape coefficient adjustment as well as more robust dealing with albedo changes.
References 1. Adini, Y., Moses, Y., Ullman, S.: Face recognition: The problem of compensating for changes in illumination direction. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 721–732 (1997) 2. Atick, J., Griffin, P., Redlich, N.: Statistical approach to shape from shading: Reconstruction of three-dimensional face surfaces from single two-dimensional images. Neural Computation 8, 1321–1340 (1996) 3. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: SIGGRAPH ’99, pp. 187–194 (1999) 4. Blanz, V., Vetter, T.: Face recognition based on fitting a 3d morphable model. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1063–1074 (2003) 5. Castel´ an, M., Hancock, E.R.: Using cartesian models of faces with a data-driven and integrable fitting framework. In: Campilho, A., Kamel, M. (eds.) ICIAR 2006. LNCS, vol. 4142, pp. 134–145. Springer, Heidelberg (2006) 6. Dovgard, R., Basri, R.: Statistical symmetric shape from shading for 3d structure recovery of faces. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 99–113. Springer, Heidelberg (2004) 7. Frankot, R., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 10, 438–451 (1988) 8. Georghiades, A., Belhumeur, D., Kriegman, D.: From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intell., pp. 634–660 (2001) 9. Horn, B., Brooks, M.: Shape from Shading. MIT Press, Cambridge (1989)
Face Shape Recovery and Recognition
407
10. Jognston, A., Hill, H., Carman, N.: Recognising faces: Effects of lightning direction, inversion and brightness reversal. Perception 21, 365–375 (1992) 11. Smith, W.A.P., Hancock, E.R.: Recovering facial shape using a statistical model of surface normal direction. IEEE Trans. Pattern Anal. Mach. Intell. 28(12), 1914–1930 (2006) 12. Worthington, P.L., Hancock, E.R.: New constraints on data-closeness and needle map consistency for shape-from-shading. IEEE Trans. on Pattern Analysis and Machine Intelligence 21(12), 1250–1267 (1999) 13. Young, F.W., Hamer, R.M.: Theory and Applications of Multidimensional Scaling. Eribaum Associates, Hillsdale, NJ (1994)
Representation of Facial Features by Catmull-Rom Splines Marco Maggini, Stefano Melacci, and Lorenzo Sarti DII, Universit` a degli Studi di Siena Via Roma, 56 — 53100 Siena, Italy {maggini, mela, sarti}@dii.unisi.it Abstract. This paper describes a technique for the representation of the 2D frontal view of faces, based on Catmull-Rom splines. It takes advantage of the a priori knowledge about the face structure and of the proprieties of Catmull-Rom splines, like interpolation, smoothness and local control, in order to define a set of key points that correspond among different faces. Moreover, it can compactly describe the whole face even if the face features have not been completely localized. The proposed model has been tested in practical contexts of face analysis and promising qualitative results are included to illustrate its versatility and accuracy.
1
Introduction
Face analysis is one of the most important tasks in computer vision. The human perception of faces has been studied by many scientists and psychologists and a lot of machine–based approaches, that try to simulate the human behavior, have been proposed during the last two decades. They involve automatic face recognition, localization and tracking [1], detection and classification of facial expressions [2] and many more. In feature based face analysis, the definition of a suitable face representation model plays a crucial role. Such representation should be compact and easy to compute. Common models are described by means of feature points, distances between face parts (eyes, mouth, nose, etc...) or parametric description of their contours [3]. In shape matching and retrieval tasks, descriptors based on parametric curves, like splines, have been shown to be really versatile [4,5], but they require high computational costs to find correspondences between shape boundaries. In frontal–view face pictures, contours of face parts can be found through automatic image segmentation or by manually placing a set of landmarks. In both cases, the whole contours are generally not available. This is frequently due to edge–based segmentation techniques, to particular light conditions or to partial occlusions. In such situations, the problem of matching parts of different faces becomes a very challenging task. In this paper we present a feature point based face representation that is obtained from the splines that describe the contours of each face part. This technique creates a predefined and consistent face model defined by a set of W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 408–415, 2007. c Springer-Verlag Berlin Heidelberg 2007
Representation of Facial Features by Catmull-Rom Splines
409
points that correspond among different faces, even if the localization of the contours is incomplete. The use of a priori knowledge about the face structure and of an appropriate parameterization makes the matching process fast and accurate. We exploit a set of cubic interpolating splines, known in literature as Catmull– Rom splines, to approximate the missing sections of the localized contours and to interpolate the existing ones. Interpolating splines have been shown to be appropriate for the description and reconstruction of partially localized shape contours [6], and Catmull–Rom splines, even if simple and easy to build, have been demonstrated to have optimum shape preserving proprieties [7]. This paper is organized as follows. In the next Section Catmull–Rom splines and the chosen parameterization will be presented. In Section 3 the face representation model and its extraction process will be described. Section 4 will show some applications of this model and qualitative results in practical contexts; finally Section 5 reports some conclusions.
2
Catmull–Rom Splines
The contours of the face parts, that can be extracted automatically by performing a segmentation of the face image, or manually, by choosing some landmark points, can be represented using interpolating curves. In this work such representation is obtained using Catmull-Rom splines (CRSs) [8]. In fact, CRSs guarantee smoothness and continuity, moreover they are invariant under affine transformations and allow us to locally control the shape of the curves. CRSs belong to the class of cubic interpolating splines with a simple piecewise construction; they are frequently used in computer aided design and animation, and can be estimated exploiting a simple and fast recursive algorithm [9]. Each segment of a CRS interpolates the curve between two points pi and pi+1 but it is defined by four control points pi−1 , pi , pi+1 , pi+2 . In matrix form, a CRS is described as ⎡ ⎤⎡ ⎤ −1 3 −3 1 pi−1 ⎢ 2 −5 4 −1 ⎥ ⎢ pi ⎥ 1 3 2 ⎥⎢ ⎥ t t t1 ⎢ q(t) = (1) ⎣ −1 0 1 0 ⎦ ⎣ pi+1 ⎦ , 2 0 2 0 0 pi+2 where t ∈ [0, 1] describes the curve from pi (t = 0) to pi+1 (t = 1). Moreover, the tangents at points pi and pi+1 are computed as di = 12 (pi+1 − pi−1 ), di+1 = 12 (pi+2 − pi ) . Given a set of n control points, with n ≥ 4, a CRS interpolates all the points1 , has C 1 continuity, and offers local control, meaning that a change in the position 1
This is always true in the case of closed curves, otherwise the first and the last control points are excluded from the open curve. Those two points can be added to the curve by defining the starting and ending tangents.
410
M. Maggini, S. Melacci, and L. Sarti
of a control point does not require to recompute the whole curve. Other classes of splines, frequently used in shape representation [4,5], like Cubic B-Splines, offer C 2 continuity and local control, nevertheless they are approximating curves. Generally such splines can be forced to interpolate some points, but this constraint increases the computational cost needed to estimate them. CRSs allow us to directly obtain interpolating curves, moreover their C 1 continuity can be considered adequate to represent face parts. CRSs can exploit an unbounded number of control points in order to describe both closed or open curves. In the latter case, the tangents at point p0 and pn−1 are not defined; in our implementation we set the endpoint tangents to 12 (p1 −p0 ) and 12 (pn−1 −pn−2 ), respectively. Finally, CRSs don’t meet the convex–hull property and the curve is not necessarily contained within the smallest convex-hull of the given set of points. However, as described in the next section, the relaxation of the convex–hull property can be an useful feature in the representation of face parts. A single CRS segment is defined by the parameter t, that ranges in [0, 1] (see eq. 1). Unfortunately, equally spaced samples in the parameter space are not equally spaced along the curve. In order to overcome this limitation, we need to redefine CRSs using arc–length parameterization [10]. The normalized arc–length parameter s ∈ [0, 1] is defined as a function of t, s = L(t), where L is a bijective and monotonically increasing function. Hence, a CRS segment q(t) = [x(t), y(t)] with t ∈ [0, 1] can be expressed w.r.t. s as r(s) = [x(L−1 (s)), y(L−1 (s))], where s ∈ [0, 1]. The arc length from 0 to t can be computed as
t L(t) =
(x (t))2 + (y (t))2 dt.
0
Since t ranges between 0 and 1, the length of the whole segment is L(1). In order to find the spline point that corresponds to a given arc length s , the function t = L−1 (s ) must be numerically approximated. Considering a CRS that interpolates a set of n control points, we can describe the whole curve, that is piecewise–cubic, using a global parameter gt ∈ [0, 1]. If gti is the global parameter value at the control point i, a value gt is referred to the spline segment for which gti ≤ gt < gti+1 holds. The local segment parameter can be computed by linearly interpolating between gti and gti+1 . If we use the arc–length parameterization, the global parameter gsi can be computed recursively as gs0 = 0 gsi = gsi−1 +
L (1)L (1) , with i = 1, ..., (n − 1), i k j=1
j
where Li (1) is the arc–length of the segment between i − 1 and i, while k = n or k = n − 1 for closed or open curves, respectively.
Representation of Facial Features by Catmull-Rom Splines
3
411
Representation of Facial Features
In order to extract a face representation for shape–based face analysis, we have to deal with an incomplete description of the contours of the face parts. In particular, using an image segmentation process, or manually selecting landmarks, contours are represented by sets of pixels that are not equally spaced along the boundaries. Since they don’t correspond among different faces, and their number is not predefined, the face parts that they describe are really difficult to compare (see Fig. 1(a)). In the general shape matching framework, defining correspondences between shapes involves processes of normalization, shape alignment, and search for pairs of points that match a specified criterion [5,11]. For example, contour descriptors based on parametric curves require to extract samples from each curve and to minimize an error function in order to find the best match between pairs of shapes [4,5], leading to high computational costs. To overcome these problems, we propose a fixed face representation model, defined by a set of feature points that can be easily extracted from an incomplete, not continuous and not uniform description of the boundaries of the face parts. This approach exploits information on the common shape of those boundaries, in order to efficiently derive a fixed set of points that allows us to easily compare different faces. Points belonging to different contours can be matched using a priori knowledge about the face structure, instead of considering other curve related features like points of curvature, that can be more difficult to identify when the curve is affected by noise. Given a frontal view face image, its representation can be obtained by – an automatic or manual extraction of point sets that belong to the boundaries of face parts; – the interpolation of each set, using CRSs; – the localization of a set of principal points (PPs) along each CRS, computed using a priori knowledge about the face structure; – the computation of a fixed number of secondary points (SPs). Given the coordinates of a set of points belonging to the contour of a face part, they must be ordered CCW or CW and interpolated using CRSs, in order to obtain a description of the whole boundary by means of a parametric curve (see Fig. 1(b)). As stated in Section 2, CRSs allow us to compute missing points by approximation and to obtain an accurate description of the real contour, even if the number of given points is small but distributed along the whole boundary. Moreover, CRSs are not constrained within the smallest convex–hull of the given set of points. Constraining the curve within that region would result in an underestimation of the real contour, especially when the number of initial points is small and sparse. For instance, even using just four points quite uniformly spaced along the contour of an eye, we obtain an accurate description of the real contour. The proposed face representation model considers the following face parts: face, nose, mouth, eyes and eyebrows (see Fig. 1 [12]). The feature points that define this model are extracted from the CRSs, that describe each part. PPs
412
M. Maggini, S. Melacci, and L. Sarti
(a)
(b)
(c)
(d)
Fig. 1. Face representation extraction: (a) Localized face parts - (b) CRS interpolation - (c) Principal points - (d) Principal and secondary points
are located using a priori knowledge on the structure of the face (see Fig. 1(c)) and represent salient face features; they are easy to identify even in presence of small changes in the head orientation. We can define one–to–one correspondences between PPs belonging to different faces, thanks to the common structure of the faces themselves. Considering the reference frame shown in Fig. 1(b), PPs of eyes, eyebrows and mouth correspond to the points of xmin and xmax along the spline of each associated contour. Those points describe the left and right “corners” of these face parts. The shape and the thickness of the eyebrows can be particularly variable among different faces, therefore we do not add any other PPs for these parts. No other PPs are added to the mouth as well, since the central point of the upper lip is difficult to be detected, for instance when the depicted person has a beard or mustaches. Moreover, the contour of the face starts and ends at the two points that have the same ordinates of the eyes, discarding the influence of hair in the upper part of the face. These two points and the one at the minimum ordinate, ymin , that corresponds to the chin, constitute the PPs of the face. Finally, the contour of the nose is described using two points at the eye level and other two points that correspond to the xmin and xmax of the nose interpolating curve. PPs can be easily computed given the CRSs that interpolate the contours of face parts, since they generally correspond to maxima or minima along a coordinate. The computation of the stationary points can be carried out considering that the first derivative of a CBR segment is defined as ⎡ ⎤ ⎡ ⎤ pi−1 −3 9 −9 3 ⎢ pi ⎥ 1 2 ⎥ t t 1 ⎣ 4 −10 8 −2 ⎦ ⎢ q’(t) = (2) ⎣ pi+1 ⎦ . 2 −1 0 1 0 pi+2 Moreover, the spline segment that must be considered for a PP is determined analyzing the coordinates of the control points that delimits it. SPs are the second type of points constituting the face representation model (see Fig. 1(d)) and are equally spaced between two consecutive PPs. Equally spaced points along the contours are used in shape modelling and statistical shape analysis [13], and constitute a quite accurate way of defining a set of correspondent points between shapes. Since SPs are defined between two consecutive PPs, the presence of noise in a spline portion could affect the positioning
Representation of Facial Features by Catmull-Rom Splines
(a)
(b)
(c)
413
(d)
Fig. 2. Face representation: (a) Original contours, PPs and SPs - (b) Contours rebuilt from PPs and SPs - (c) Comparison of the (dotted) original and rebuilt contours - (d) Example of another face, comparing the (dotted) original and rebuilt contours
of equally spaced points, but it has no effects on the rest of the curve. This property makes the presented face representation model more robust against errors in the segmentation process than shape description approaches based on the definition of just the starting point for parametric curves. Given an arc–length parameterization of the whole CRSs (see Section 2), an arbitrary number m of SPs between P P1 and P P2 , with associated global spline parameter gs1 and gs2 (gs1 < gs2 ), can be computed by uniform sampling of the spline segment that is comprised between P P1 and P P2 . Most of the computational cost needed to extract this face representation is required by the numerical approximation of the arc lengths. An arc length must be calculated for each control point in order to correctly parameterize the whole curve, and the number of integrations increases uniformly with the dimension of the initial set of points describing a contour. Every evaluation of the spline to find a new SP requires to find the segment that contains the SP and to approximate the inverse of the function L(t). In practice, after an automatic image segmentation, the initial set of points is neither small nor sparse, and a reduction of the computational costs can be obtained considering the whole curve arc–length parameterized but not the local spline segments. Exploiting this strategy the computation of L−1 (s) can be avoided. The proposed representation can be fast and easily computed, and can be used in every task that needs to compare faces. Furthermore, it can lead to a compact parametric approximation of the contours of face parts, by discarding the initially located points and considering the CRSs that interpolate the sets of PPs and SPs (see Fig. 2). In this case, the number of SPs must be chosen in order to find a trade–off between the approximation accuracy and the representation compactness. In practice, a very small number of SPs is generally needed.
4
Applications
We evaluated, from a qualitative point of view, the proposed face model in a face comparison task, using a quite small number of SPs (32), as can be seen in Fig. 1(d). Anyway there are no restrictions in the number of SPs that can be defined in each spline interval.
414
M. Maggini, S. Melacci, and L. Sarti
In many face analysis tasks, a mean face is required to find the salient characteristics of an input face. Furthermore, mean faces estimated categorizing subjects by race or gender, can be used in face matching tasks to reduce the search space. We manually marked a dataset of over 200 frontal face pictures2 ; subsequently, the CRS representation has been generated for each image. The mean faces can be seen in Fig. 3. Differences between male and female mean faces are pointed out by this compact face model. A possible task in which is required to discover and exaggerate singularities in the shape of face parts, is automatic caricature generation. We considered the technique described in [14], that allows to obtain a caricature as pcar = p + k(p − pmean ), where pcar is the position of the point p in the caricature, pmean is its position in the mean face and k is a weighting term. Some results are reported in Fig. 4. Fig. 4(b) shows the comparison between the original face model and the mean face. The caricature reported in Fig. 4(c) shows how the differences between the two faces are exaggerated. Finally, other examples can be seen in Fig. 4(d).
(a)
(b)
(c)
Fig. 3. Mean faces estimated with the proposed model: (a) Male - (b) Female - (c) Both
(a)
(b)
(c)
(d) Fig. 4. Caricaturing process based on our model: (a) Original face - (b) Comparison of the contours of the mean face (solid) and of the original face (dotted) - (c) Exaggeration of salient features - (d) Other examples of faces and caricatures 2
The dataset can be downloaded at http://www.dii.unisi.it/∼artist/feps/.
Representation of Facial Features by Catmull-Rom Splines
5
415
Conclusions
In this paper we described a face representation based on feature points. CRSs are used to efficiently determine the point positions, interpolating the shape of detected face features and approximating the eventual missing contours. Such kind of splines offers some useful properties like native interpolation, local control, affine invariance, and piecewise construction. The feature points of different faces can be compared defining a set of correspondences based on a priori knowledge about the face structure. Exploiting this knowledge, the matching process results faster and computationally less expensive than other spline–based shape matching approaches. Moreover, the proposed model leads to a compact and accurate parametric description of the shapes of face parts, that is independent of the number of points used to describe the contours at the end of the detection process. A preliminar qualitative evaluation, based on mean face estimation and caricature generation, shows interesting results.
References 1. Zhao, W., Chellappa, R., Phillips, P., Rosenfeld, A.: Face Recognition: A Literature Survey. ACM Computing Surveys 35(4), 399–458 (2003) 2. Tian, Y.I., Kanade, T., Cohn, J.: Recognizing action units for facial expression analysis. IEEE TPAMI 23(2), 97–115 (2001) 3. Chellappa, R., Wilson, C., Sirohey, S.: Human and machine recognition of faces: a survey. Proc. of the IEEE 83(5), 705–741 (1995) 4. Cohen, F., Huang, Z., Yang, Z.: Invariant matching and identification of curves using B-splines curve representation. IEEE TIP 4(1), 1–10 (1995) 5. Avrithis, Y., Xirouhakis, Y., Kollias, S.: Affine-invariant curve normalization for object shape representation, classification, and retrieval. Machine Vision and Applications 13(2), 80–94 (2001) 6. Kampel, M., Sablatnig, R.: Automated segmentation of archaeological profiles for classification. In: Intl. Conf. on Pattern Recognition 1 (2002) 7. Shankar Gautam, R.: Shape preservation behavior of spline curves. Arxiv preprint cs/0702026 (2007) 8. Catmull, E., Rom, R.: A class of local interpolating splines. Computer Aided Geometric Design, pp. 317–326 (1974) 9. Barry, P.J., Goldman, R.N.: A recursive evaluation algorithm for a class of CatmullRom splines. ACM SIGGRAPH Computer Graphics 22(4), 199–204 (1988) 10. Wang, H., Kearney, J., Atkinson, K.: Arc-length parameterized spline curves for real-time simulation. In: Intl. Conf. on Curves and Surfaces, pp. 387–396 (2002) 11. Belongie, S., Malik, J., Puzicha, J.: Matching shapes. Proc. of ICCV 1, 454–461 (2001) 12. Martinez, A.M., Benavente, R.: The AR face database. CVC TR 24 (1998) 13. Davies, R.: Learning Shape: Optimal Models for Analysing Natural Variability. PhD thesis, Department of Imaging Science, University of Manchester (2002) 14. Koshimizu, H., Tominaga, M., Fujiwara, T., Murakami, K.: On KANSEI facial image processing for computerized facial caricaturing system PICASSO. IEEE Intl. Conf. on Systems, Man, and Cybernetics 6, 294–299 (1999)
Automatic Quantitative Mouth Shape Analysis Augusto Salazar, Jorge Hernández, and Flavio Prieto Universidad Nacional de Colombia. Sede Manizales {aesalazarj, jehernandezl, faprietoo}@unal.edu.co
Abstract. In this paper we present a methodology to automatically analyze the soft tissue of the mouth. The methodology is based on some measures obtained from the lip contour. Process starts with the face localization, followed by the mouth area segmentation. A first approximation to the external lip contour is obtained by using active contours. The control points of the contours are used to calculate the four parametric functions that define the mouth template. Results show how the feature extraction algorithms and active contour adjustment perform. In addition, some tests were carried out on images of children with repaired cleft lip.
1
Introduction
There are many medical applications oriented to human face morphologic analysis like plastic surgery, cleft lip reconstruction, anthropometric studies, etc. In most of cases the analysis is done by comparing a set of key points with a set of automatically detected points [1, 2]. Engineering has contributed to more efficient studies (faster measure estimation), developing algorithms to estimate the separation between strategically located points [3]. Nevertheless, these kind of tools (computer-assisted) have a weakness: they need a supervisor to set the initial measure points. In this paper we propose a methodology founded in digital image processing techniques in order to solve the automatic selection of the initial measure points, aimed to help in soft tissue characterization of the mouth in resting position. This paper is organized as follows. In Section 2 we describe the proposed quantitative mouth shape analysis, and present the feature extraction processes. Section 3 illustrates about the mouth template design. Experimental results are shown in Section 4. Finally, conclusions are given in Section 5.
2 2.1
Feature Extraction Quantitative Mouth Shape Analysis
In lip surgical reconstruction, one of the most important factors for evaluating the results of the intervention is the aesthetic [4]. Unfortunately, there exists a large collection of features that must be considered to achieve an objective evaluation [5, 4, 6, 7, 8]. We focused this work on the analysis of the external W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 416–423, 2007. c Springer-Verlag Berlin Heidelberg 2007
Automatic Quantitative Mouth Shape Analysis
417
lip contour by selecting a set of features that serve as good descriptors of its morphology (See Fig. 1). Measures are obtained proportionally to the mouth width. This set of features are focused to establish asymmetry measures because of their relevance for establishing the quality of the surgical reconstruction [1].
F tml Cba Ls Vx Ml Mh Li
Feature Philtrum Width Concavity of the Cupid’s Arc Upper Lip Contour Vertices Mouth Width Mouth Height Lower Lip Contour
Fig. 1. Features of the Mouth Area
2.2
Face and Mouth Detection
Characterization begins with the subject detection, followed by the identification of mouth coordinates. In the region of interest (mouth image) features mentioned in section 2.1 are extracted. For subject detection, we use the skin detector proposed by [9]. The resulting image (white region Figure 2(c)) contains the boundary of the entire face. Figure 2 shows the different process stages.
(a)
(b)
(c)
Fig. 2. Face Detection. (a) Input Image. (b) Skin Detector. (c) Face Set.
Once the face has been located, mouth region can be found by performing a Prevalent Regions Analysis (PRA). This procedure uses the hue component of the HSV color space transformation; the regions with red predominance are weighted using the filter coefficients proposed in [10]. The resulting image is then binarized (threshold) and the region with larger area is used as the representation of the mouth. In Figure 3 each stage of the PRA procedure is shown. 2.3
Vertex Detection
The points where both upper and lower lip edges concatenate are called vertices (V x). To estimate V x’s coordinates, two techniques are considered: Vertical Gradient (VG) [10], and the proposed technique based on Searching Space Reduction
418
A. Salazar, J. Hernández, and F. Prieto
After Segmentation (SSRAS). The proposed technique is described as follows: i ) The input color image (Figure 3(g)) is mapped to HSV color space, and the hue component is filtered (Figure 4(a)). ii ) H component image is mapped to a new binary image (Figure 4(b)). iii ) Analysis of spatial distribution of white pixels is performed in order to get estimates of the mean (μ) and the standard deviation (σω ). Boundaries of the searching space are defined by its lower limits in μ±1.5σω and upper limits in μ±2.8σω (Figure 4(c)). iv ) Resulting bounded area is dilated to increase the size of the searching space (Figure 4(d)). v ) Original image values are retrieved in gray levels, and clipped within the bounded area (Figure 4(e))). vi ) Edge detection is achieved using a SUSAN operator [11] (Figure 4(f)). vii ) V x’s coordinates are computed from the outermost pixels inside each of the bands (Figure 4(g)). V x’s coordinates are also used to correct the mouth rotation related to the horizontal axis.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Fig. 3. PRA Method. (a) Search Region, (b) Eroded Image, (c) Search Region (Color Image), (d) Filtered Hue, (e) Filtered Hue (Thresholded), (f) Mouth Region and (g) Result Image.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Fig. 4. SSRAS Method. (a) Filtered Hue, (b) Binarized Filtered Hue, (c) Search Region, (d) Dilated Image, (e) Search Image, (f) Edge Detection and (g) V x’s Localization.
The performance of the automatic vertices detection was evaluated by comparing the vertices coming from the automatic detection (pa ) with those obtained from the manually labelled images (pm ). The evaluation was performed with 200 test images or Ground Truth. Using the error between each corresponding pair of points (εi = pm − pa ), the mean error (ε) and the mean quadratic error (Δε) were obtained. The error measures are shown in in Table 1. Based on error values presented in Table 1, it is shown that SSRAS method has better performance than VG method. That performance happens due to the border between the skin and the lips are not well-defined. SSRAS has an average processing time of 63.06 ms, whereas the VG technique took 62.45 ms.
Automatic Quantitative Mouth Shape Analysis
419
Table 1. V x’s Detection Error. V L (left vertex) and V R (right vertex). Technic Point ε(Pixels) Δε Technic Point ε(Pixels) Δε VG V L(x,y) (15,17) (0.668,0.615) SSRAS V L(x,y) (6,8) (0.262,0.312) V R(x,y) (12,19) (0.744,0.724) V R(x,y) (7,9) (0.389,0.326)
2.4
Final Lip Segmentation
The segmented region, in spite of detecting the mouth correctly, does not give a suitable description of the contour. So, the last step is an accurate segmentation of the lip contour. In order to achieve that goal we propose a segmenter called Red Enhanced and Dynamic Thresholding (REDT), which is summarized in the following steps: 1. Transform the input image (Figure 5(a)) into the Y Cb Cr color space (Figure 5(b)). 2. Contrast enhancement in red zones (Figure 5(c)) by performing Ienh = (IR + ICr ) − (IG + IB + ICb ), where IR , IG and IB are the RGB components of the input image, ICr and ICb are the red and blue chrominance components of the Y Cb Cr image. 3. Contrast-enhanced image segmentation. For the multiple threshold we developed a dynamic strategy: i ) To use the DPRA method in order to obtain a mouth mask (Figure 5(d)). ii ) To compute the mean value of the distribution (mvd) of gray levels in the masked Ienh (Figure 5(e)). iii ) To create a binary image by allowing the gray levels that have more pixels than the 10% of mvd. iv ) To choose as mouth region the largest one (Figure 5(f)).
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5. REDT Method. (a) Input image (b) Y CbCr Image (c) Red Enhanced Image (Ienh ) (d) Mask (e) Ienh masked and (f) Segmented Image.
3
Active Contours Based Mouth Template
We use a snake for the first approach to the external mouth contour (see Figure 6). The employed model is the one presented in [12], where an active contour is a parametric curve v(u, t) = [x(u, t), y(u, t)], u ∈ [0, 1], where t determines the temporal position of a point in the sequence moving through the spatial domain of an image. A snake is defined as an energy function (Equation 1) to be minimized. 1 Esnake =
Eint (v (u)) + Eimage (v (u)) + Eext (v (u)) du 0
(1)
420
A. Salazar, J. Hernández, and F. Prieto
(a)
(b)
(c)
Fig. 6. First Contour Approach. (a) Input (b) Segmentation (c) Fitted Contour.
where, Eint , Eimage and Eext are called internal energy, image energy and external energy, respectively. In order to reduce the analysis time we used the Fast Active Contours for Sampling algorithm introduced in [13]. After minimizing the contour energy function we proceeded with the mouth template estimation by adjusting four parametric functions. The order of these functions is chosen in such a way that the resulting curve fits the region as close as possible. 3.1
Lower Lip
To characterize the lower lip we use one function. Since we consider asymmetries, we must select a function that allows us to represent them. Third order functions are not able to fit the whole zone of the lip (Figure 7(a)), so we chose a fourth order function (Figure 7(b)). 3.2
Upper Lip
For the upper lip we used three functions with the same criterion used for the lower lip, one for the left side, another one for the right side, and a third one for the cupid’s arc. The first two functions are used to describe Ls in both sides (see Figure 1). A second order function does not provide a good adjustment (Figure 7(c)), so a third order function is chosen (Figure 7(d)). The last function is used for fitting Cba (See Figure 1). In order to adequately characterize the whole zone a third order functions is not enough (Figure 7(e)), so we chose a fourth order function (Figure 7(f)). The functions are obtained by Singular Value Decomposition of the overconstrained system of equations using the contour points. The vertices are the
(a) 3nd Order
(e) 3th Order
(b) 4th Order
(f) 4th Order
(c) 2th Order
(g) Unconstrained
(d) 3nd Order
(h) Constrained
Fig. 7. Parametric functions to fit the Lips
Automatic Quantitative Mouth Shape Analysis
421
boundary points between the cubic functions (side regions) and the lower lip function. Boundary points lie over the projection of the lowest point of the cupid’s arc between the function representing the middle region and the side sections. To preserve the continuity among the functions, each one is estimated using common points from their neighborhood (see Figure 7(g) and Figure 7(h)).
4
Results
We used a part of the database described in [14]. Images correspond to 660 children (330 male and 330 female) between 5 and 10 years old. Two pictures were taken to each child, for a total of 1320 images. Images are in JPEG color format, with a size of 2560 x 1920 pixels. Two hundred images of the data base (100 girls and 100 boys) were used for comparisons. The ground truth images are set-up by an expert using manual segmentation. 4.1
Evaluation of the Contour Fitting
The quality of the contour fitting was evaluated using the confusion matrix, which uses the following parameters: true positives T P , true negatives T N , false positives F P and false negatives F N . In the test, we obtained T P = 93.259%, T N = 99.264%, F P = 0.737% and F N = 6.741%. In order to evaluate the performance of the method we computed the following metrics: effectiveness P TN (T P R = (T PT+F N ) ), specificity (T N R = (T N +F P ) ) and precision (P R = TP FN (T P +F P ) ), whose ideal value is of 100%; positive error (P E = (T P +F N ) ) and P negative error (N E = (T NF+F P ) ), whose ideal value is of 0%. The obtained values were: T P R = 93.259%, T N R = 99.264%, P R = 99.216%, P E = 6.741% and N E = 0, 737%. Although positive error is large, this is not critical. As it can be seen on Figures 8(a)-8(d), the system reach a high degree of fitting, which results on low measurement errors (see Section 4.2). As one of our interests is to incorporate this tool within the treatment of children with cleft lip and palate, we have made some preliminary tests in order to have an idea of the performance of the system with this kind of images. Figures 8(e)-8(h) show those results. The main issue in characterization is the definition of the smoothing parameters. Due to the drastic changes on the edges of the
(a)
(e)
(b)
(c)
(f)
(d)
(g)
Fig. 8. Lips contour fitting results
(h)
422
A. Salazar, J. Hernández, and F. Prieto
corrected cleft lip, it becomes necessary the analysis on a larger region to obtain a continuos contour. However, results encourage us to continue, mainly because of the characterization of the cupid’s arc and the asymmetry, which allows a quantitative control of the healing evolution. 4.2
Measurement Error
The error metrics used were listed in the Section 2.3. Table 2 shows the measurements error. The values are given in millimeters. These errors are acceptable in medical applications related to the morphology of the face [7, 8]. The characteristics of concavity of the cupid’s arc and asymmetry were properly determined in the 200 images. Table 2. Measurement error Feature ε Δε Feature ε Δε Feature ε Δε Feature ε Δε F tml 0.701 0.202 Ml 1.141 0.268 Lslef t 0.828 0.328 Li 1.286 0.233 Mh 0.655 0.381 Lsright 0.996 0.390
5
Conclusions
We have developed a tool for automatic extraction of mouth area features, for quantitative analysis of its morphology. The segmentation method proposed uses the whole set of contour points. After that, active contours were used, and four parametric functions were adjusted to them. The parametric functions describe the external contour of the lips and eases the features extraction. The obtained system performance is good enough to allow us to use it in anthropometric studies focused to the characterization of a population and/or a specific pathology related to soft tissue of the mouth area. In order to make a mouth asymmetry analysis, an additional work must be done to establish tolerances, since when features like concavity and asymmetry are computed, a very small variation will produce the same result that a very large one. Nevertheless, the proposed system allows to give an index (computed from the values of the features of both sides of the mouth), which could aid us to establish this tolerance values. Further developments regarding the characterization of the boundary between upper an lower lips are being studied, allowing for broader applications of the system.
References 1. Ferrario, V.F., Sforza, C., Dellavia, C., Tartaglia, G.M., Colombo, A., Carú, A.: A quantitative three-dimensional assessment of soft tissue facial asymmetry of cleft lip and palate adult patients. The Journal of Craniofacial Surgery 14 (2003) 2. Yeow, V., Huang, M., Lee, S.T., Chong, S.F.: An anthropometric analysis of indices of severity in the unilateral cleft lip. Craniofacial Surgery (2002)
Automatic Quantitative Mouth Shape Analysis
423
3. Hurwitz, D., Ashby, E., Llull, R., Pasqual, J., Tabor, C., Garrison, L., Gillen, J., Weyant, R.: Computer-assisted anthropometry for outcome assessment of cleft lip (1999) 4. Millard, D.R.: Cleft Craft. The Evolution of its Surgery. Volume I. In: Unilateral deformity, Little, Brown and company, Boston, Massachussets (1976) 5. Farkas, L., Bryson, W., Klotz, J.: Is photogrammetry of the face reliable. Plastic and Reconstructive Surgery 66 (1980) 6. Farkas, L.: Antropometría normal y patológica en cabeza y cara. Cirugía Plástica, Reconstructiva y Estética 2 (1994) 7. Sforza, C., Dellavia, C., Dolci, C., Donetti, E., Ferrario, V.: A quantitative threedimensional assessment of abnormal variations in the facial soft tissues of individuals with down syndrome. Cleft Palate-Craniofacial Journal 42 (2005) 8. Krimmel, M., Kluba, S., Bacher, M., Dietz, K., Reinert, S., Farkas, L., Bryson, W., Klotz, J.: Digital surface photogrammetry for anthropometric analysis of the cleft infant face. Cleft Palate-Craniofacial Journal 43 (2006) 9. Hsu, R., A.M.M.J.A.: Face detection in color images. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 10. Torresani, L., Caprile, B., Coianiz, T.: 2d deformable models for visual speech analysis. Technical report, Istituto per la Ricerca Scientifica e Tecnologica (1996) 11. Smith, S., Brady, J.: SUSAN - a new approach to low level image processing. Int. Journal of Computer Vision 23, 45–78 (1997) 12. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. Int. J. Computer Vision 1, 321–331 (1987) 13. Hernadez, J., Prieto, F., Redarce, T.: Fast active contour for sampling. In: CERMA ’06. Proceedings of the Electronics, Robotics and Automotive Mechanics Conference, pp. 9–13. IEEE Computer Society, Los Alamitos, CA, USA (2006) 14. Mejía, I.: Extracción automática de características faciales para el estudio antropométrico en niños entre 5 y 10 años de la ciudad de manizales.Technical report (2004)
Colour Adjacency Histograms for Image Matching Allan Hanbury1, and Beatriz Marcotegui2 1
PRIP, Institute of Computer-Aided Automation, Vienna University of Technology, Favoritenstraße 9/1832, A-1040 Vienna, Austria
[email protected] 2 Centre de Morphologie Math´ematique, Ecole des Mines de Paris, 35, rue Saint-Honor´e, 77305 Fontainebleau, France
[email protected] Abstract. The use of 2D colour adjacency histograms for image matching in image retrieval scenarios is investigated. We present an algorithm for extracting representative colours from an image and a new method for matching 1D colour histograms and 2D colour adjacency histograms obtained from images quantised using different colour palettes. An experimental evaluation of the matching performance is done.
1
Introduction
The comparison of image colour distributions is an important tool in ContentBased Image Retrieval. It is often done by comparing colour histograms of images [1], which eliminates information on the spatial distribution of colours. In this paper, we investigate characterising images for image retrieval by a 2-dimensional histogram which includes information on colour adjacency. Such information in the form of a colour adjacency graph has been used successfully for object recognition [2]. The work closest to the approach presented in this paper is that of Lee et al. [3]. They first quantise the hue into seven hue components. After each pixel in an image has been assigned to one of these components, a matrix summarises the adjacency of the hue components at the pixel level. We propose to use a similar colour adjacency histogram, but having the following differences: the histogram uses a palette of colours determined for each image, and the histogram does not summarise the information at the pixel level, but at the region level. These regions are created by a combined colour quantisation and segmentation algorithm. The contributions of this paper are: a new algorithm for extracting representative colours from an image using hierarchical clustering and segmenting the image based on the results of the clustering, the construction of 2D colour adjacency histograms based on the colour quantisation and segmentation and a new method for matching the 1D and 2D histograms obtained from images quantised using different colour palettes.
This work was supported by the European Union Network of Excellence MUSCLE (FP6-507752), and the Austrian Science Foundation (FWF) under grant SESAME (P17189-N04). The authors wish to thank Fernand Meyer for helpful discussions.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 424–431, 2007. c Springer-Verlag Berlin Heidelberg 2007
Colour Adjacency Histograms for Image Matching
425
The paper is structured as follows. Section 2 discusses our combined colour quantisation and segmentation approach. The histograms used in image matching are described in Section 3. The results of image retrieval experiments are presented in Section 4. Section 5 concludes.
2
Colour Quantisation
There exist many algorithms for reducing or quantising the number of colours in an image, including many based on some form of clustering in colour space [4]. Our colour quantisation is done in two steps: (1) pre-segmentation using the watershed with volume extinction values and (2) a hierarchical clustering, fusing the closest colours of the pre-segmented regions in the CIELAB space. Due to the pre-segmentation, the resulting segmentation has large, regular regions. Deng et al. [5] also use colour clustering to obtain the dominant colours in an image. They segment an image, apply vector quantisation in each region and then cluster the resulting colours from all the regions by agglomerative clustering such that the minimum distance between two centroids exceeds a preset threshold. To obtain a segmentation, each pixel is assigned to the centroid closest to it in colour space. The resulting regions will however be more fragmented and less regular than those obtained using the method presented in this paper. While this presents no problem for classic one-dimensional histogram calculation, the small regions and irregular boundaries make the result unsuitable for determining colour adjacency and the boundary length between adjacent colours. The steps of the colour quantisation and segmentation algorithm are presented here and discussed further in the subsections referred to: 1. Pre-segmentation using the watershed with volume extinction values (Section 2.1). The result is a mosaic image, in which each region is replaced by the mean of the RGB values it contains. 2. Conversion of the mosaic image into the CIELAB space. 3. Extraction of one set of CIELAB coordinates for each region. 4. Hierarchical clustering of the extracted colour coordinates to obtain the required number of colour clusters (Section 2.2). 2.1
Pre-segmentation
The watershed using volume extinction values [6] creates a hierarchy based on the fusion of lakes in watershed catchment basins. During the flooding process, a record is kept of the merging in the form of a graph, where each node represents a lake and each edge represents the merging of two adjacent lakes [6]. The weight on each edge is the volume of the smallest lake involved in the merging when two adjacent lakes merge. It can be shown that the resulting graph is a minimal spanning tree (MST). To obtain a segmentation with k regions, one simply cuts the (k − 1) highest valued edges of the MST. The only parameter of this segmentation algorithm is the number of regions required.
426
A. Hanbury and B. Marcotegui
The flooding process is carried out on the gradient of the colour image. We use the gradient found to give the best results in a morphological waterfall segmentation in [7]. This is the saturation weighing-based colour gradient applied in the L1 norm 3D polar coordinate colour space [8]. This gradient gives a larger weight to the differences in hue when the saturation is high, and a larger weight to differences in luminance when the saturation is low. In order to simplify the image before segmenting it, thereby eliminating small regions, we make use of the morphological leveling [9]. The filter used to produce the marker for the leveling operator is the morphological alternating sequential filter [10], where the size of the filter refers to the number of subsequent opening and closing operations. In order to apply these filters to colour images, we apply the filter separately to each colour component. 2.2
Hierarchical Clustering
A hierarchical clustering method is a procedure for transforming a proximity matrix into a sequence of nested partitions [11]. We use an agglomerative clustering method working on the Euclidean distances between the coordinates of the colours in the CIELAB space (the CIELAB space is approximately perceptually uniform with this metric). At the start, the colour of each region in the mosaic image is taken to be a separate cluster in CIELAB space. The two clusters which have the smallest distance between them are fused into a single cluster, and this process is iterated until a single cluster remains. Distances between clusters are measured as the average Euclidean distance between all pairs of points in cluster r and cluster s (average linkage). The result of this clustering is a tree representing the order in which clusters have been fused with one another. The leaves of the tree are the points corresponding to the original CIELAB coordinates. The distance between two clusters which are fused is also stored. Once the tree has been built, a specified number of clusters k in the CIELAB space is obtained from the corresponding level of the tree. The representative colour of each of the k clusters is then calculated. The representative colour ci of cluster i is the mean of all the colours belonging to cluster i. One then has a palette of k colours F = {ci , i = 1, . . . , k} for an image. Each region in the mosaic image is then given a label indicating the colour cluster to which it belongs. This results in an image labelled by colour cluster membership. The segmentation of the image in Figure 1(a) into 16 colours is shown in Figure 1(b).
3
1D and 2D Histograms
We discuss the construction and similarity measures for 1D and 2D histograms calculated from the segmentation results. 3.1
Histogram Construction
Two types of histogram are tested, the standard 1D colour histogram and the proposed 2D colour adjacency histogram. The 1D colour histogram of an image with colour palette F containing k colours is P = {pi , i = 1, . . . , k}, where pi is
Colour Adjacency Histograms for Image Matching
427
Fig. 1. (a) Original image. (b) Segmentation into 16 colours. (c) 2D histogram. The colours corresponding to each row and column are shown to the left and above. A higher value in a histogram bin is indicated by a lighter grey.
percentage of the image area occupied by colour ci . The 2D colour adjacency histogram H(i, j) of an image containing k colours is a k × k matrix. The row index i and column index j (i, j ∈ 1, 2, . . . , k) index the colours in the colour palette. The matrix entry H(i, j), where i = j, contains the total length of the common boundary between all regions having colour indices i and j. Using a count of the number of times that regions occur next to each other proved to be less effective. This histogram is obviously symmetric, as the adjacency relation is symmetric. The entries on the diagonal H(i, i) = 0, as two adjacent regions with the same colour index cannot be distinguished. An example of such a histogram is shown in Figure 1(c). 3.2
Histogram Similarity
Once we have constructed the histograms for all images in a database, we wish to match the images based on the similarity of their histograms. To do this, we require a similarity measure between two histograms. We describe two approaches here, the first is our proposed approach based on colour matching and reorganisation of one of the histograms based on the results of the colour matching. For comparison, we use the quadratic colour histogram distance proposed in [5]. Similarity by Colour Matching and Histogram Reorganisation. This approach requires that each image in the database is quantised so as to have the same number of colours — we use k = 16. As each of the images in the database has a different colour palette, the first task is to match the closest colours in two colour palettes, where each palette contains k sets of CIELAB coordinates. We represent the palettes by F1 = {ci , i = 1, . . . , k} and F2 = {bj , j = 1, . . . , k},
428
A. Hanbury and B. Marcotegui
where F1 is assumed to be the palette of the query image, and F2 the palette of an image to be compared to it. The histograms corresponding to these images and their palettes are 1D histograms P1 and P2 , and 2D histograms H1 and H2 , corresponding to palettes F1 and F2 respectively. The aim is to match each of the colours in F2 to a colour in F1 . We begin by calculating a k × k distance matrix D(i, j), where the value in row i and column j contains the Euclidean distance between colour coordinates ci and bj . Two types of colour matching are tested, 1-to-1 matching and 1-to-many matching. In 1-to-1 matching, each colour in F1 may only be matched to a single colour in F2 . Colours are matched in the order of increasing Euclidean distance to create a list of matching colour indices. Each index to a colour in F1 and each index to a colour in F2 appears only once in the list. The first items in the list represent the better colour matches (lower Euclidean distance). An alternative to this greedy algorithm could be a global optimisation algorithm such as one of the algorithms for bipartite graph matching. However, such algorithms attempt to minimise the sum of the distances between the matched colours. This is however often better done by making a large number of “average matches” instead of a few extremely bad matches. This is not optimal, because if an image contains a completely different colour a bad match for this colour is better than average matches for all of them. In 1-to-many matching, each colour in F1 may be matched to more than one colour in F2 . In the resulting list of matches, indices to colours in F1 may occur more than once. After the colours have been matched, the entries in the 1D histogram P2 or the rows and columns in the 2D histogram H2 are reordered based on the colour match list so as to correspond to P1 or H1 as closely as possible. These reordered histograms are labelled P2r and H2r . To calculate histogram similarity, the histograms P1 and P2r in the 1D case, or H2 and H2r in the 2D case, are compared using histogram intersection modified to use the distances between the matched colours as weights to reduce the effect of poorly matched colours on the similarity calculation. We begin by normalising the histograms (both 1D and 2D) so that the sum of their elements is 1. For weighting, the function D returns the distance between the colour c in palette F1 and the colour in palette F2 that was matched to it (in practice, this can be read from the colour match list). For 1D histograms, the similarity S (P1 , P2r ) is calculated as k min [P1 () , P2r ()] w (D ) (1) =1
where the weight w is a colour similarity measure (between 0 and 1, with 1 indicating identical colours) based on those in [5,12]: d 1 − dmax if d ≤ T w (d) = (2) 0 otherwise where T is a threshold on colour similarity, which we set to dmax . In [12], dmax is set to the largest value in the distance matrix, thereby eliminating the need
Colour Adjacency Histograms for Image Matching
429
for a threshold. We set dmax to a constant value of 50, half the distance between black and white in the CIELAB space. The threshold T = dmax then simply avoids that the colour similarity value becomes negative. For 2D histograms, the similarity S (H1 , H2r ) is 2
k k
min [H1 (, n) , H2r (, n)] w (D ) w (Dn )
(3)
=1 n=+1
As we are comparing neighbouring colours, two weights are present, indicating how well each of the colours considered is matched. The sum is taken over only half of the histogram as it is symmetric. Quadratic Colour Histogram Distance. The quadratic 1D colour histogram distance used is proposed by Deng et al. [5]. We do the colour quantisation as described in Section 2. This technique can compare images having a different number of colours in their palettes. However it requires that the distance between the closest pair of quantised colours for an image exceeds a preset threshold Td . Therefore, instead of choosing a preset number of colours, we cut the tree resulting from the hierarchical clustering using a distance threshold of Td . Given two images having a possibly different number of quantised colours F1 = {ci , i = 1, . . . , k1 } and F2 = {bj , j = 1, . . . , k2 }, with corresponding histograms P1 = {pi , i = 1, . . . , k1 } and P2 = {qj , j = 1, . . . , k2 }, the distance between the histograms is given by D2 (P1 , P2 ) =
k1 i=1
p2i +
k2 j=1
qj2 −
k1 k2
2aij pi qj
(4)
i=1 j=1
where aij is the similarity coefficient between colours ci and bj given by w (dij ) in Equation 2, where dij is the Euclidean distance between the colours and the threshold T in Equation 2 is set to Td . The value of dmax is taken to be αTd , where we take α = 1.2 as done in [5].
4
Experiments
In many sets of personal photos, some common locations occur very often and should be characterised by specific colour adjacencies. As it would be useful to find all the photos taken in a specific location, we evaluate the capability of the algorithms to retrieve images of the same location as the query image. We perform retrieval experiments on a dataset of 108 personal photographs taken in four locations: a sofa, the area in front of a house, the forest and the beach. The house images are characterised by the colour of the wall, which is close to the colour of the sofa. The texture is more important in the forest images. To evaluate the matches, we use the R-precision measure, which is the precision at R, where R is the number of relevant images in the dataset. The value of R for each class is shown in the second column of Table 1. We tested both 1D
430
A. Hanbury and B. Marcotegui
Table 1. The per class and overall R-precision percentage values for retrieval using 2D histograms with 1-to-1 and 1-to-many matching (columns 3 and 4), and 1D histograms with 1-to-1 matching and the quadratic colour histogram distance (columns 5 and 6).
Overall Sofa Forest Beach Residence
R 1-to-1 2D 1-to-many 2D 1-to-1 1D quaddist 1D 74.2 74.2 76.5 73.9 32 66.9 74.0 70.3 68.9 5 66.7 75.0 69.4 61.1 9 59.0 54.0 87.0 59.0 58 81.7 77.7 78.9 80.5
(a) 1-to-1 2D
(b) 1-to-many 2D
(c) 1-to-1 1D
(d) quaddist 1D
(e) 1-to-1 2D
(f) 1-to-many 2D
(g) 1-to-1 1D
(h) quaddist 1D
Fig. 2. The first four images retrieved for a query image (shown left) in the (a)–(d) sofa and (e)–(h) beach class. The matching methods used are given below the images.
and 2D histograms, using 1-to-1 and 1-to-many matching for the 2D histograms, and 1-to-1 matching and the quadratic colour histogram distance for the 1D histograms. The R-precision values for each of these methods are shown in Table 1, where the overall R-precision and the R-precision per class are shown. The overall results obtained by all the methods tested are similar. However it can be seen that for different query images, different matching methods have the best retrieval results. This is particularly evident for beach class, where the R-precision for the 1D histogram with 1-to-1 matching is notably higher than for the other methods. For the sofa and forest classes, the 2D histograms with 1-to-many matching have the highest R-precision, while for the residence class there is little variation between the methods. These differences can also be seen for two individual queries in Figure 2. For the query image from the sofa class, the top four retrieved images for the 2D histograms are correct, whereas the 1D histogram results both include an image from the residence class. This situation is reversed for the beach class, where the 1D histogram with 1-to-1 matching
Colour Adjacency Histograms for Image Matching
431
retrieves four correct images, while all other methods include images from the residence class in the top four. Examining the 2D histograms more closely reveals why the girl in red was retrieved as a good match to the beach. By chance a few of the adjacent colours having a long common boundary are matched resulting in 2D histograms having high peaks at the same place. The 2D histograms do not consider the percentage of each colour present in the image. For the very similar beach images, this happens to be the best comparison measure.
5
Conclusion
We have investigated the use of 2D colour adjacency histograms for matching images in image retrieval scenarios. An interesting outcome is the class-dependent retrieval performance of the 1D and 2D histograms. However, for each class, at least one of the proposed methods outperforms the quadratic colour histogram distance. Further research will involve investigating the conditions under which 1D or 2D histograms perform better and hence designing an efficient combination of the 1D and 2D distance measures.
References 1. Schettini, R., Ciocca, G., Zuffi, S.: A survey of methods for colour image indexing and retrieval in image databases. In: Luo, R., MacDonald, L. (eds.) Color Imaging Science: Exploiting Digital Media, John Wiley, New York, NY (2001) 2. Matas, J., Marik, R., Kittler, J.: The color adjacency graph representation of multicoloured objects. Technical Report VSSP-TR-1/95, Univ. of Surrey (1995) 3. Lee, H.Y., Lee, H.K., Ha, Y.H.: Spatial color descriptor for image retrieval and video segmentation. IEEE Trans. on Multimedia 5(3), 358–367 (2003) 4. Scheunders, P.: A comparison of clustering algorithms applied to colour image quantization. Pattern Recognition Letters 18, 1379–1384 (1997) 5. Deng, Y., Manjunath, B.S., Kenney, C., Moore, M.S., Shin, H.: An efficient color representation for image retrieval. IEEE Trans. Image Proc. 10(1), 140–147 (2001) 6. Meyer, F.: An overview of morphological segmentation. International Journal of Pattern Recognition and Articial Intelligence 15(7), 1089–1118 (2001) 7. Angulo, J., Serra, J.: Color segmentation by ordered mergings. In: Proc. of the Int. Conf. on Image Processing. vol. II, pp. 125–128 (2003) 8. Hanbury, A., Serra, J.: Colour image analysis in 3D-polar coordinates. In: Michaelis, B., Krell, G. (eds.) Pattern Recognition. LNCS, vol. 2781, pp. 124–131. Springer, Heidelberg (2003) 9. Meyer, F.: Levelings, image simplification filters for segmentation. Journal of Mathematical Imaging and Vision 20, 59–72 (2004) 10. Soille, P.: Morphological Image Analysis, 2nd edn. Springer, Heidelberg (2002) 11. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988) 12. Hafner, J., Sawhney, H.S., Equitz, W., Flickner, M., Niblack, W.: Efficient color histogram indexing for quadratic form distance functions. IEEE Trans. on Pattern Analysis and Machine Intelligence 17(7), 729–736 (1995)
Segmentation of Distinct Homogeneous Color Regions in Images Daniel Mohr and Gabriel Zachmann Department of Computer Science, Clausthal University, Germany {mohr, zach}@in.tu-clausthal.de
Abstract. In this paper, we present a novel algorithm to detect homogeneous color regions in images. We show its performance by applying it to skin detection. In contrast to previously presented methods, we use only a rough skin direction vector instead of a static skin model as a priori knowledge. Thus, higher robustness is achieved in images captured under unconstrained conditions. We formulate the segmentation as a clustering problem in color space. A homogeneous color region in image space is modeled using a 3D gaussian distribution. Parameters of the gaussians are estimated using the EM algorithm with spatial constraints. We transform the image by a whitening transform and then apply a fuzzy k-means algorithm to the hue value in order to obtain initialization parameters for the EM algorithm. A divisive hierarchical approach is used to determine the number of clusters. The stopping criterion for further subdivision is based on the edge image. For evaluation, the proposed method is applied to skin segmentation and compared with a well known method.
1
Introduction
Image segmentation is an important task in computer vision for instance when tracking objects, it is used to identify the object to be tracked or parts of it. Other common applications can be found in medical image segmentation, for example to identify tumors or bones. Also, OCR software uses such algorithms to separate text from background and it has applications in image/video compression. One issue during initialization of a tracking system is to find the object’s spatial location in the image. Especially for nonrigid objects, color is the main feature to accomplish this task. Due to unknown lighting conditions, e.g. colored light, inhomogeneously colored background, different camera hardware, and other influences, the color of an object can be very different compared to its color under controlled conditions. Thus, a static color model of a target object may fail. However, in a dynamic color model, a method is needed to initially detect at least a large part of the object. Once this has been achieved, the color model can be adapted. On application to skin segmentation, a large part of skin has to be detected, then previously presented algorithms for skin segmentation such as [1] can be used to update the color model during tracking and, thus, refine the segmentation. The problem to detect a homogeneous colored object with a homogeneous color under controlled conditions is that we don’t know the color distribution W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 432–440, 2007. c Springer-Verlag Berlin Heidelberg 2007
Segmentation of Distinct Homogeneous Color Regions in Images
433
of the object under uncontrolled conditions. Using a color model obtained from images taken under different conditions can strongly differ and may fail. To address this problem our approach estimates the color model instead of using a precomputed one. Our method can basically divided into two steps. First, the image has to be segmented correctly into subsets, each representing such an object, and second, identify the correct subset representing the target object. In the following, we call those image subsets image regions. Our method is able to identify such a region by utilizing the separability of different regions in color space and that a target object has a certain direction relative to the image background in most cases. When applied to skin detection our approach makes no fixed assumption to skin color distribution, in contrast to other methods. To identify the image region representing the target, we only need a rough direction of skin relative to the background in color space. Obviously, clustering should be performed in color space. We compared the RGB, HSV and Lab histograms of several images. We could not observe any noticeable improvement to distinguish color regions in HSV and Lab space. Therefore we use the RGB color space and avoid a color space conversion. The regions can be appropriately modeled by three dimensional gaussians. We determine clusters by the EM algorithm with additional spatial constraints. Appropriate initialization values have to be calculated before performing the EM algorithm. In order to determine the number of clusters we use a hierarchical approach.
2
Related Work
In the past, several image segmentation algorithms have been introduced. Most approaches can be classified as graph cutting or color space clustering method. Graph cutting methods model the image as a weighted graph and map segmentation to graph cutting with a specific cost function. Color space clustering methods approximate image segments with an appropriate model in color space. Both use criteria such that resulting segments are similar to those a human person would define. In [2] the graph cutting method with a normalized cut criterion is used. Their approach avoids cutting small sets of isolated pixels. [3] used a multilevel hypergraph in order to segment gray level images with special interest on its performance of noisy images. Several color space clustering approaches have been presented. In [4] a morphological clustering followed by Markovian labeling is used for segmentation. [5] presented a k-mean clustering in HSI space with application to medical images. Special attention has been paid to the cyclic property of the hue component. In [6] a novel initialization scheme for fuzzy c-means clustering has been introduced. Dominant colors, defined as the most vivid and distinguishable colors, are calculated from reference colors. The colors nearest to the dominant colors are used as initial centroids. [7] compared histogram based and mixture model representation of skin and non-skin color. They constructed the color models for skin and non-skin classes
434
D. Mohr and G. Zachmann
(a)
(b)
(c)
(d)
Fig. 1. Before we perform clustering, the image is transformed into a decorrelated, unit variance space. Figures (a) and (b) shows the image in RGB space. Figure (c) and (d) shows the image after transformation into a decorrelated, unit variance space. Each example is shown from two different viewpoints.
from a dataset of nearly 1 billion hand labeled pixels. They examined that the histogram based representation is superior for very large training data sets. For small training data sets, the mixture model delivers better segmentation results. They reached a detection rate of 80% at a false positive rate of 8.5% for web images. The main disadvantage is the inflexibility of a static skin color model. It may have a low performance on images captured under different conditions than their training data set. [8] improved skin detection by a variational EM algorithm with spatial constraints. For initialization, they used the skin color model of [7]. In [9] a gaussian mixture model was used to improve a skin-similar space, which was built from a rough classification with a static skin model. The skin gaussian was identified by a Support Vector Machine using spatial and shape information. [10] proposed a skin segmentation method in YCbCr space, applying Bayesian decision rules. A face detector is used in [11] to generate a skin model and then applied to images to detect skin. [1] predicted changes of skin color during tracking with a second order Markov model. Skin and non-skin color histograms are updated based on feedback from current segmentation and prediction. Skin color changes are modeled as translation, scaling and rotation in color space. Their approach requires an initial detection of skin. It possibly fails if this initial skin segmentation has a too small or high false detection rate.
3
Our Approach
We use images captured under arbitrary lighting conditions and environment as input. The goal of our method is to segment the region of an image which represents the target. In our example, we want to identify skin regions, which are typically closer to red than common backgrounds. Nevertheless, color distribution can heavily deviate from red. In contrast to previously proposed skin detection methods, we do not start with a fixed assumption for skin color distribution, which could lead to high false detection. In our input images, we have
Segmentation of Distinct Homogeneous Color Regions in Images
435
Fig. 2. We use additional spatial constraints for the EM algorithm to get smoother regions. The above pictures show an image (left) in which we detect skin without constraints (middle), and with constraints (right).
many unknown influence factors. We know by the central limit theorem that a point set, that is influenced by many small factors can be well approximated by the normal distribution. Therefore, it is useful to model the color distribution of image regions with mixture of gaussians (GMM). We have chosen a well known method to estimate the parameters of a GMM, the EM algorithm [12]. A homogeneous color region is a region in the image space that represents an object, which has a homogeneous color under white uniform illumination. In images captured under unconstrained conditions, histogram form of such regions can be heavily stretched. To compensate potentially negative effects on the clus1 tering algorithm we transform1 the histogram by y i = S − 2 U T (xi − m) where U and S are obtained from the singular value decomposition [U, S, V T ] = svd(C) of the covariance matrix C (see Figure 1). 3.1
Spatial Constraints
A problem of color space clustering is that we often get many isolated pixels or very small regions in image space. An example is shown in Figure 2. To address this problem, we use spatial constraints to get smoother cluster borders. Before we can explain our smoothing method, we need to introduce the edge distance image D(I). The Laplace edge detection operator is applied to the input image I. The resulting image, containing the edges of the input image, is denoted with C(I). To each pixel xi ∈ C(I) the Filter F (xi ) =
max
xj ∈N (xi )
C(xj ) xi − xj + 1
(1)
is applied. N (xi ) is the k × k image neighborhood of xi . Thus, we obtain the ˜ ˜ image D(I). After normalizing the pixels of D(I) to [0, 1], the final edge distance image D(I) is obtained. 1
1
This transformation is similar to the whitening transform y = U S − 2 U T (x − m), but for our algorithm there is no need to perform the leftmost Matrix (U ) multiplication.
436
D. Mohr and G. Zachmann Initialization image/ imagepart
RGB histogram
white balancing
Shifted hue values
1D hue values
Fuzzy k-means
Hue value with min density
Probabilities for all pixels for all clusters
EM Input
Fig. 3. Initialization: A white balancing to the image part is performed and the hue values for all pixels are calculated. Hue values are shifted, such that values around 0 have minimal density. Fuzzy k-means is applied to this hue values. The resulting probabilities are used as input for the EM algorithm.
The idea of our spatial smoothing method is based on D(I). In a neighborhood N (xi ) of a pixel xi without an edge, all pixels in N (xi ) should have similar probability to belong to a particular cluster. If an edge is found in the neighborhood. If not, we have no usable information about the membership of pixels to clusters. Thus, in each iteration step of the EM algorithm, we calculate the average probability p¯(xi |θi ) of the neighborhood with size l × l for all pixels. Then, we use the edge distance image to interpolate between the probability of a pixel belonging to a cluster and the average neighborhood probability. The new probability pn (xi |θi ) is calculated through pn (xi )|θi ) = p(xi |θi )D(xi ) + (1 − D(xi ))¯ p(xi |θi ) 3.2
(2)
Initialization
The initialization step has a significant influence to the resulting clusters because the EM algorithm only guarantees to converge to a local minimum. Since we want to segment image regions with homogeneous color, it makes sense to initialize each cluster with a hue which differs from the hue of the other clusters as much as possible. Performing a simple hue clustering makes no sense. Consider an image part whose average color is some red value. We want to divide it into two parts. If we cluster with respect to hue, we get one big red cluster representing the whole image part. Additionally, one cannot presume that the greatest principal axis of a region with homogeneous color in color space is parallel to the gray axis. To take these issues into account, we perform a PCA based white balancing of each image part we want to segment. We have to consider the cyclic property of the hue. The handling with a metric which takes this cyclic property into account is difficult. To avoid this, we search for an hue value αmin ∈ [0, 360) in this cyclic color space with minimal point density and shift the point set about −αmin . Finally fuzzy-k-means clustering to these hue is performed. Figure 3 illustrates the initialization steps.
Segmentation of Distinct Homogeneous Color Regions in Images
437
Fig. 4. Hierarchical clustering: In each iteration, EM algorithm is performed with two kernels. The edge image is used to decide if further subdivision is necessary Clusters that need no further subdivision are compared with the skin direction vector to identify the correct skin cluster.
3.3
Hierarchical Clustering
An important question for clustering methods is the number of clusters a dataset should be divided into. Because it is hard to answer this question prior to clustering our image we decided to use an hierarchical method. There are two main approaches for hierarchical clustering, agglomerative and divisive. We use a divisive method for two reasons. First, agglomerative clustering can have quadratic complexity. Second, the divisive approach has the advantage that we do not need to subdivide all clusters in our case, thus yielding a significant further speedup. This is because we are interested in a homogeneous color region of a special color, for example skin, we can skip subdivision of regions, whose mean vector direction is too far away from the color of the destination object. Consider two images. The first image has a uniform illumination, the second one has strong highlights and shadows and colors are poorly saturated. Compared to the first image, the color distributions of image regions, including the target region, in the second image are closer to and stretched along the gray axis. Therefore we need to take the distribution parameters of the image into account when identifying the target region. In order to do that, we perform a whitening transform of the image colors before calculating the mean value of the target region. Let m be the mean value and [U, S, V t ] = svd(C) the SVD of the covariance matrix of the image, and mp the mean value of a image part. Then the transformed mean value m ˜ p is calculated through 1 m ˜ p = U · S − 2 · U T (mp − m) (3) The transformed mean value of a cluster has to be compared with the transformed mean vector m ˜ S characterizing the target region. This vector has to be
438
D. Mohr and G. Zachmann
Fig. 5. Results: The original images (first row), result from [7] (second row) and our approach (third row)
calculated in a preprocessing step. We do this with a small image data set for skin detection. The images are captured under different illumination conditions. The skin regions are hand labeled. During clustering we calculate the angle, weighted by the inverse distance, of the transformed mean vector m ˜ i of the ith cluster and m ˜S αi =
m ˜S ·m ˜i m ˜S ·m ˜i ·m ˜S−m ˜i
(4)
If αi < ε for some user defined ε, the cluster will be classified as a region that does not contain the region representing the target object. After clustering, the cluster with the largest αi represents the color distribution of the target object. The idea for our stopping criterion is based on the edge distance image D(xi ). The better a subdivision calculated through the clustering of an image part in color space into two parts represents a useful segmentation in image space, the nearer pixels on the region borders in image space should lie. Therefore, we evaluate clustering quality through the average edge intensity value on this region borders. If 1 D(xi ) > δ (5) |B1 | + |B2 | xi ∈B1 ∪B2
the clusters are split, otherwise not. B1 and B2 denote the region borders in image space.
4
Experimental Results
We applied our method for segmenting skin in images. Images with different illumination conditions and background have been selected. For parameter l,
Segmentation of Distinct Homogeneous Color Regions in Images
439
used in Section 3.1 to determine the neighborhood size for pixel probability averaging, we chose the value 3. We observed no further smoothing improvement for higher values of l and a smaller value would mean no or an asymmetric neighborhood. The parameter k to determine the neighborhood size to calculate the edge distance image depends on l because at a pixel we need to know if an edge in the l × l neighborhood exists. Thus, we need k ≥ l. The edge distance map is also used to calculate the stopping criterion. Because normally we do not find the region boundaries determined by color space clustering exactly at the edge pixels, we need some tolerance. Therefore, a higher value of k would be better. But the higher k, the higher the computation cost for the edge distance map. As a compromise we set k = 5. For the parameter δ used in Section 3.3 for stopping criterion, δ = 0.23 seems to work best for out test images. Only for the last shown image, where the edges are very poor between the black hairs and the neck due to very dark skin, δ = 0.17 works better. To our knowledge, previously presented skin segmentation methods used a static skin model or other information e.g. face detection for initialization. Our approach only uses the information of a rough direction of skin relative to background. We compare our method with the well know approach [7], because both can be used as initialization for finer (skin) segmentation. We used the Matlab source code provided by [1]. They used the method from [7] for initialization of their algorithm. To make a fair comparison, we disabled the morphological filter. It is clear, that on both methods a morphological or other filters could be applied as post-processing step, but this is not content of this paper. Figure 5 shows some results obtained with [7] and our approach. The first three images shown have a resolution of 250 × 250. On a Athlon 64 X2 Dual machine the algorithm processed each of the images in about 0.5 seconds. The last two images are obtained from [1] and demonstrates the performance on dark skin. The examples show that we can obtain a better detection rate. False positives occur only as small regions. More results can be found soon at our web page2 . In images in which skin can not be well approximated with a gaussian distribution, our algorithm will detect only a smaller part of the skin. If an image has a very uncommon background, for example a saturated red background our algorithm would have problems to identify the correct region.
5
Conclusions and Future Work
In this paper, we have proposed a new method for homogeneous color region segmentation in images. The method itself can be applied to any kind of homogeneous colored surface. In this paper we show its application and performance with skin detection. Our approach is based on a divisive hierarchical clustering in color space with spatial constraints that combines global color with local edge information. Robustness and accuracy are gained especially by using the input image itself to extract color distribution of the target region, and not a fixed distribution. Homogeneous color regions are modeled as 3D gaussians and 2
http://cg.in.tu-clausthal.de/research/skinseg
440
D. Mohr and G. Zachmann
parameters estimated by the EM algorithm. The cluster representing the target region, for example skin, is identified by comparing the mean value of each cluster with a vector obtained in a preprocessing step. For this comparison, the image color distribution is taken into account. In the future, we plan to extend our method to model-based approaches, too. This can also be used to improve the stopping criterion of the subdivision algorithm. Furthermore, we want to extend the color distribution model to handle warped clusters.
References 1. Sigal, L., Sclaroff, S., Athitsos, V.: Estimation and Prediction of Evolving Color Distributions for Skin Segmentation Under Varying Illumination. In: IEEE Conf. on Computer Vision and Pattern Recognition (2000) 2. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2000) 3. Rital, S., Cherifi, H., Miguet, S.: A segmentation algorithm for noisy images. Computer Analysis of Images and Patterns. In: 11th International Conference (2005) 4. Geraud, T., Strub, P.-Y., Darbon, J.: Color image segmentation based on automatic morphological clustering. In: International Conference on Image Processing, pp. 70–73 (2001) 5. Zhang, C., Wang, P.: A New Method of Color Image Segmentation Based on Intensity and Hue Clustering. In: Proceedings of the International Conference on Pattern Recognition. vol. 3617 (2000) 6. Kim, D.-W., Hyung Lee, K., Lee, D.: A novel initialization scheme for the fuzzy c-means algorithm for color clustering. Pattern Recognition Letters 25, 227–237 (2004) 7. Jones, M.J., Rehg, J.M.: Statistical Color Models with Application to Skin Detection. International Journal of Computer Vision 46(1), 81–96 (2002) 8. Diplaros, A., Gevers, T., Vlassis, N.: Skin detection using the EM algorithm with spatial constraints. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 3071–3075 (2004) 9. Zhu, Q., Cheng, K.-T., Wu, C.T., Wu, Y.L.: Adaptive Learning of an Accurate Skin-Color Model. In: IEEE International Conference on Automatic Face and Gesture Recognition (2004) 10. Chai, D., Bouzerdoum, A.: A Bayesian approach to skin color classification in YCbCr color space. Theme, Intelligent Systems and Technologies for the New Millennium (2000) 11. Wimmer, M., Radig, B.: Adaptive Skin Color Classificator. In: International Conference on Graphics, Vision and Image Processing (2005) 12. Bilmes, J.A.: A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report ICSI-TR-97-021
Estimating the Color of the Illuminant Using Anisotropic Diffusion Marc Ebner Universit¨ at W¨ urzburg, Lehrstuhl f¨ ur Informatik II Am Hubland, 97074 W¨ urzburg, Germany
[email protected] http://wwwi2.informatik.uni-wuerzburg.de/staff/ebner/welcome.html Abstract. The human visual system is able to determine the color of objects irrespective of the power distribution illuminating the scene. This ability is called color constancy. It would be highly interesting to mimic this ability of the human visual system. Accurate color reproduction is very important from consumer photography to automatic color based object recognition. In theory, if we knew the color of the illuminant for each image pixel, we would be able to compute a color corrected image which is independent of the illuminant. We suggest the use of anisotropic diffusion to estimate the illuminant locally for each image pixel.
1
Motivation
The human visual system is able to determine the color of objects irrespective of the type of illuminant used. The ability to compute color constant descriptors from the light entering the eye is called color constancy [1,2]. In contrast, colors measured by a digital sensor are not constant. The response of the sensor depends on the color of the illuminant. Obtaining accurate colors of objects is very important from consumer photography to automatic object recognition based on color. In general, accurate color rendition is required for many algorithms in color image processing. In most cases, we have multiple illuminants. For instance, we may have artificial lighting inside a room and additionally daylight falling through a window. If we have several different light sources, then the illuminant is no longer uniform over the entire image. In this case, we have to estimate the illuminant locally for each image pixel [3,4]. So far, only a few algorithms have been developed which address the problem of a locally varying illuminant. Most of the color constancy literature deals with a uniform illuminant. The algorithms which also take a spatially varying illuminant into account are the algorithms of Land and McCann [5], Horn [6], Blake [7], Moore et al. [8], Rahman et al. [9], Barnard et al. [10], and Ebner [4].
2
Locally Estimating the Color of the Illuminant
Ebner [4] has developed a parallel algorithm for color constancy which estimates the illuminant locally for each image pixel. The algorithm runs on a grid of W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 441–449, 2007. c Springer-Verlag Berlin Heidelberg 2007
442
M. Ebner
processing elements and uses isotropic diffusion to compute local space average color. Let a be the current estimate of local space average color and let c be the color of the pixel stored at the current processing element. The data which is available from neighboring elements is averaged and then a little amount from the color of the current element is added to the average. Let N (x, y) be the neighboring processing elements of element (x, y). The following two update equations are iterated until convergence a (x, y) =
1 |N (x, y)|
a(x , y )
(1)
(x ,y )∈N (x,y)
a(x, y) = a (x, y) · (1 − p) + c(x, y) · p
(2)
where p is a small percentage larger than zero. During each iteration a new estimate of local space average color is computed. The parameter p determines the extent over which local space average color is computed. For small values of p, local space average color is computed over a large area. For large values of p, local space average color is computed for a small area. For the gray world assumption to hold, we need to compute local space average color for a comparatively large area.
1
PE
Intensity
0.8
Right
+
0.6
−
0.4
Current
Left
0.2
0 0
1
2
3
4
x
Fig. 1. Let us assume a linearly changing illuminant and consider the processing element located at the center of the grid. The differences between the averages computed by this processing element will cancel which results in a series of vertical bands.
Let us now assume that the illuminant changes smoothly over the entire image from left to right. Figure 1 illustrates such a smooth horizontal change of the illuminant from left to right. Consider the data computed by the processing element in the center of the image. The average computed by the element on the left will be slightly lower whereas the average computed by the element on the right will be slightly higher than the current average. The data computed by processing elements above and below will be equivalent to the data computed by the current element. If we average the data from the neighboring elements, the differences between the element on the left and the element on the right will cancel. After convergence, one obtains a series of vertical bands. In other words, the estimate of local space average color will be correct, provided that we have a linearly changing illuminant.
Estimating the Color of the Illuminant Using Anisotropic Diffusion
443
1
PE
Intensity
0.8
Right
0.6
−
Current
+
0.4
Left 0.2
0 0
1
2
3
4
x
Fig. 2. Assuming that we have a non-linear change of the illuminant from left to right, then the average computed by the processing element to the left and the average computed by the processing element on the right will not be equivalent. The estimate computed by the processing element located in the center will be too low.
Let us now assume that we have a non-linear change of the illuminant as shown in Figure 2. In this case, the average computed by the element on the left and on the right will no longer cancel. The estimate of the illuminant will be slightly too low. We can solve this problem by computing space average color only along the vertical. Thus, we would be estimating the color of the illuminant independently for each column of the image. We could also try to reduce the area of support. For our parallel algorithm, the area of support is determined by the parameter p. If we make the area of support sufficiently small, then the assumption that the illuminant varies linearly may be correct. In practice, it may not be possible to make the area of support very small because then the gray world assumption may no longer hold. However, if the area of support is very large, then the area of support will likely straddle a non-linear transition of the illuminant.
3
Use of Anisotropic Diffusion
A more accurate estimate of the illuminant can be obtained if the averaging is done non-uniformly. This leads us to anisotropic diffusion which is frequently used to segment noisy images [11,12,13]. For our example, we have assumed a horizontally varying illuminant. The illuminant is constant along the vertical. Let us call these vertical lines the lines of constant illumination or iso-illumination lines. If the illuminant were changing from top to bottom, the iso-illumination lines would be oriented horizontally. In this case, we could average the data along the lines of the image. In general, of course, the change of the illuminant may be arbitrary across the image. A spotlight may cause a circular or elliptic change of the illuminant. Suppose that we knew how the iso-illumination lines run across the image. Then we could establish an estimate of the illuminant by averaging the data only along the iso-illumination lines. In other words, we suggest the use of anisotropic diffusion in order to estimate the illuminant locally for each image pixel. If we only average the data in a direction which is perpendicular to the illumination
444
M. Ebner Center of Curvature Line of Constant Illumination Left
Back
Front Left
P1 P2 Right
Front Right
Back
Fig. 3. Rotated coordinate system where the directions left/right point along the line of constant illumination. The directions front/back point along the gradient. A curved line of constant illumination is shown on the right.
gradient then the result will be independent of the area of support provided that it is sufficiently large. This is a very important point. If we average the data along the iso-illumination lines, the scaling parameter p could be made arbitrarily small. It would only depend on the size of the image. In contrast, if we compute isotropic local space average color, the area of support should be small for areas where the illuminant is changing very much and it would have to be large for areas where the illuminant is constant. If we compute isotropic local space average color along the iso-illumination lines then the estimate of the illuminant will be correct also for a non-linearly changing illuminant. Let us define a local coordinate system for each image pixel as shown in Figure 3 on the left. For each pixel we have four directions. Two are perpendicular to the iso-illumination line and two point along the iso-illumination line. The gradient of the illuminant defines this coordinate system for each image pixel. The problem is that we need to have some estimate of the illuminant in order to compute the gradient. We can use isotropic diffusion to get a rough estimate of the illuminant. This estimate can then be used to compute the gradient. Let a be the estimate computed using isotropic diffusion or some other method such as applying a Gaussian filter to the input image. Let [dxi , dyi ] be the gradient of color channel i. ∂ai dxi ∂x . ∇ai = = ∂a (3) i dyi ∂y Since we have a three-band color image, we would obtain slightly different gradients for each color band. We assume that a single gradient [dx, dy] describes the actual distribution of the illuminant. Thus, we need to combine the three gradients into one. One way to do this is to simply average the gradients. 1 dx = ∇ai (4) dy 3 i∈{r,g,b}
Estimating the Color of the Illuminant Using Anisotropic Diffusion
445
It would also be possible to use different weights for the three channels. We could also apply weights based on the magnitude of the gradient. Let a(front), a(back), a(left), and a(right) be the estimate of space average color in the corresponding directions front, back, left and right as shown in Figure 3. Let c(x, y) be the color at the current element. It is assumed that the locations front, back, left and right are a unit distance away from the current element. The data at these locations is computed using bilinear interpolation. We can calculate local space average color a by averaging the existing estimates obtained from the left and the right along the line of constant illumination 1 (a(x, y) + a(left) + a(right)) 3 a(x, y) = a (x, y) · (1 − p) + ci (x, y) · p
a (x, y) =
(5) (6)
where p is a small percentage. Data from the front/back direction can also be included by introducing an additional parameter q with q ∈ [0, 16 ] which describes the exchange along the front/back direction. 1 1 1 a (x, y) = − q a(left) + − q a(right) + a(x, y) + qa(front) + qa(back) 3 3 3 (7) If q is equal to zero, then we average data only along the line of constant illumination. If q is larger than zero, some data will also flow along the front/back direction. The reader should be aware that the update equations are given for the center elements only. Processing elements which are located on the boundary of the image can only average the data which is available. Note that the line of constant illumination may not be a straight line. In case of a spotlight it may be circular or elliptic. For actual images which contain local light sources this is likely to be the case. Thus, we have to determine the curvature for each image pixel. The curvature K of a point (x, y) on a surface F (x, y) is defined as [14] Fxx Fxy Fx Fyx Fyy Fy Fx Fy 0 K= (8) (Fx2 + Fy2 )3/2 with Fx =
∂F ∂x ,
Fy =
∂F ∂y
, Fx =
∂F ∂x ,
Fxy =
∂F ∂x∂y ,
Fyx =
∂F ∂y∂x ,
Fxx =
∂2F ∂x2
, Fyy =
∂2F ∂y 2
. Let dx and dy be the gradient computed from the estimated illuminant using isotropic diffusion. The curvature of the illuminant is then computed by ∂ ∂ ∂ setting Fx = dx, Fy = dy, Fxy = ∂x dy, Fyx = ∂y dx, Fxx = ∂x dx and Fyy = ∂ ∂y dy. From the curvature K, we can calculate the radius r of the curve [14] 1 (Fx2 + Fy2 )3/2 r= = . Fx Fxy Fy + Fx Fyx Fy − Fx2 Fyy − Fy2 Fxx K
(9)
446
M. Ebner
The center of the curvature lies on the positive side of the curve normal if K > 0 and on the negative side otherwise. If K = 0 then the line of constant illumination is indeed a straight line. The two points along the line of constant illumination from which we need to extract our current estimate of local space average color can be found as shown in Figure 3 on the right. We need to determine the intersection points between the unit circle and the circle determined from the current estimate of the illuminant. This will give us the two values a(P1 ) and a(P2 ) which can be used in the anisotropic averaging operation as described above. In order to compute the intersection points we assume that the center of the curvature is located at point (0, r). Then it is straight forward to compute the location of the two intersection points P1 and P2 from the two circles which are given by x2 + y 2 = 1 We solve for x and obtain x =
and
(x − r)2 + y 2 = r2 .
(10)
1 2r .
The y coordinates are obtained from the equation of the unit circle. We obtain y1/2 = ± 1 − 4r12 . Of course, in the general case, the center of the curvature does not lie on the X axis. In this case, we simply perform an appropriate rotation of the coordinate system. Local space average color at points P1 and P2 are again obtained using bilinear interpolation. Let a ˇi be the previous estimate of local space average color at points P1 and P2 . We compute 1 a (x, y) = (ˇ ˇ a(P1 ) + ˇ a(x, y) + ˇ a(P2 )). (11) 3 Once the data from neighboring elements is averaged, the color from the current element is slowly faded into the result ˇ a(x, y) = ˇ a (x, y) · (1 − p) + c(x, y) · p.
(12)
Again, p is a small value larger than zero. These update equations are iterated for each processing element until convergence. We can now handle curved non-linear transitions of the illuminant by averaging the data along the line of constant illumination.
4
Experimental Results
Barnard et al. [15] created a set of images to be used for color constancy research. We took two images from the data set which show the same scene under two different illuminants and combined them artificially. The illuminant of the first image is bluish. The illuminant of the second image is basically white. Figure 4 shows the combined images. A horizontal gradient was used to create the first image whereas a spotlight effect was simulated for the second image. The intensity of the of the first image was increased by a factor of three. Apart from having a color gradient, we thus also have an intensity gradient. Figure 4 shows the results for the algorithm described above. The parameter p was set to 0.0001. The parameter q was set to 0. The illuminant is roughly
Estimating the Color of the Illuminant Using Anisotropic Diffusion Input Images
447
Output Images (Isotropic) Output Images (Anisotropic)
Est. Iso-Illumination Lines Isotropic Local Avg. Color Anisotropic Local Avg. Color
Fig. 4. Two images from the database of Barnard et al. [15] were merged to simulate a horizontal illumination gradient and a spotlight effect. Results are shown using isotropic diffusion and anisotropic diffusion.
approximated by computing a blurred input image using exponential blur. The intensity was used to estimate the illuminant. For the first image, we obtain horizontal lines of constant illumination, whereas for the second image, we obtain circular lines of constant illumination. Next, we use anisotropic diffusion to estimate the illuminant locally for each image pixel. When we compare the estimate of the illuminant which was obtained using isotopic diffusion with the estimate of the illuminant which was obtained using anisotropic computation, we see that computation of local space average color using anisotropic diffusion more accurately estimates the color of the illuminant. Once the illuminant is known, we can compute a color corrected image. The resulting output images are shown in Figure 4. When comparing the two output images we see that some detail
448
M. Ebner
is missing from the resulting image when local space average color is computed using isotropic local space average color. In contrast, when local space average color is computed using anisotopic diffusion more detail is retained.
5
Conclusion
Many color constancy algorithms have been proposed. Few of the algorithms address the problem of estimating the illuminant locally for each image pixel. Most algorithms assume that the color of the illuminant is constant over the entire image. We have shown how anisotropic diffusion may be used to estimate the illuminant locally. First the illuminant is roughly approximated using Gaussian or exponential blur, then a refined estimate of the illuminant is computed. The algorithm runs on a grid of processing elements where data is exchanged only between neighboring elements. Each element estimates the line of constant illumination and then averages data along this line. As a result, we obtain an estimate which may also be used if the illuminant varies non-linearly across the image.
References 1. Zeki, S.: A Vision of the Brain. Blackwell Science, Oxford (1993) 2. Ebner, M.: Color Constancy. John Wiley & Sons, England (2007) 3. Ebner, M.: Color constancy using local color shifts. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 276–287. Springer, Heidelberg (2004) 4. Ebner, M.: A parallel algorithm for color constancy. Journal of Parallel and Distributed Computing 64, 79–88 (2004) 5. Land, E.H., McCann, J.J.: Lightness and retinex theory. Journal of the Optical Society of America 61, 1–11 (1974) 6. Horn, B.K.P.: Determining lightness from an image. Computer Graphics and Image Processing 3, 277–299 (1974) 7. Blake, A.: Boundary conditions for lightness computation in mondrian world. Computer Vision, Graphics, and Image Processing 32, 314–327 (1985) 8. Moore, A., Allman, J., Goodman, R.M.: A real-time neural system for color constancy. IEEE Transactions on Neural Networks 2, 237–247 (1991) 9. Rahman, Z., Jobson, D.J., Woodell, G.A.: Method of improving a digital image. United States Patent No. 5,991,456 (1999) 10. Barnard, K., Finlayson, G., Funt, B.: Color constancy for scenes with varying illumination. Computer Vision and Image Understanding 65, 311–321 (1997) 11. Didas, S., Weickert, J., Burgeth, B.: Stability and local feature enhancement of higher order nonlinear diffusion filtering. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) Pattern Recognition. LNCS, vol. 3663, pp. 451–458. Springer, Heidelberg (2005) 12. Monteil, J., Beghdadi, A.: A new interpretation and improvement of the nonlinear anisotropic diffusion for image enhancement. IEEE Transactions on Pattern Analysis and Machine Intelligence 21, 940–946 (1999)
Estimating the Color of the Illuminant Using Anisotropic Diffusion
449
13. Weickert, J.: A review of nonlinear diffusion filtering. In: ter Haar Romeny, B.M., Florack, L.M.J., Viergever, M.A. (eds.) Scale-Space 1997. LNCS, vol. 1252, pp. 3–28. Springer, Heidelberg (1997) 14. Bronstein, I.N., Semendjajew, K.A., Musiol, G., M¨ uhling, H.: Taschenbuch der Mathematik (5 edn.) Verlag Harri Deutsch, Thun und Frankfurt/Main (2001) 15. Barnard, K., Martin, L., Funt, B., Coath, A.: A data set for color research. Color Research and Application 27, 147–151 (2002)
Restoration of Color Images Degraded by Space-Variant Motion Blur ˇ Michal Sorel and Jan Flusser
Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Pod Vod´ arenskou vˇeˇz´ı 4, 182 08 Praha 8, Czech Republic {sorel,flusser}@utia.cas.cz
Abstract. We propose an algorithm for restoration from multiple color images degraded by camera motion blur. We consider the special case when the camera moves in one plane perpendicular to the optical axis without any rotations. The algorithm needs to know neither camera motion nor camera parameters. The proposed algorithm belongs to the group of variational methods that estimate simultaneously sharp image and depth map, based on the minimization of a cost functional. Feasibility of the algorithm is demonstrated by two experiments with real images.
1
Introduction
Subject to physical and technical limitations, the output of digital cameras is not perfect and substantial part of image processing research focuses on removing of various types of degradations. One of the frequent degradations is the blur caused by camera motion. For RGB image, it can be described by linear relation z i (x, y) = ui (x − s, y − t)h(x − s, y − t; s, t)dsdt, (1) where u = {uR , uG , uB } is a sharp image, h is called point-spread function (PSF) or mask and z = {z R , z G , z B } is the blurred image. In this paper, the PSF is considered to be the same for all three color bands. However, the proposed algorithm can be easily modified to work with separate PSFs for each band. If we assume planar scene perpendicular to the optical axis and steady motion of the pinhole camera in a plane parallel to the scene, it is well known that PSF is space-invariant one-dimensional rectangular impulse in the direction of camera motion. In general case, the PSF can be very complex depending on the camera motion, depth of scene and parameters of the optical system. The important task to find the sharp image u when we know the blurred image z and possibly the PSF h is called restoration, deblurring or, if h is
Research has been supported by the Czech Ministry of Education, Youth, and Sports under the project 1M0572 (Research Center DAR).
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 450–457, 2007. c Springer-Verlag Berlin Heidelberg 2007
Restoration of Color Images Degraded by Space-Variant Motion Blur
451
space-invariant, deconvolution. If even the PSF is not known, we speak about blind restoration or deconvolution. Blind restoration from only one image is an ill-posed problem. Therefore, it is often assumed that we have at least two observations of the same scene, which is denoted as multichannel (MC) restoration or deconvolution. There exist many methods for restoration of single image degraded by known space-invariant blur, so called space-invariant single channel (SC ) non-blind restoration techniques [1]. Many of them are formulated as linear problems, some others including important anisotropic regularization techniques such as [2] can be reduced to a sequence of linear problems. Extension of these methods to multichannel case is straightforward and many of them can be used in space-variant situations as well. Blind restoration requires more complicated algorithms as we need to estimate the unknown degradation. In space-invariant case, there exist several MC methods successfully solving the problem. We use the method [3] as part of the proposed algorithm. If there are no constraints on the shape of PSF and the way it can change throughout the image (space-variant blind restoration), the task is strongly underdetermined even in MC case. In case of space-variant blur caused by camera motion we can constrain the shape of the PSF and so reduce the number of unknowns. The majority of algorithms assumes that the PSF is space-invariant in a window of reasonable size. Unfortunately, the condition of space-invariance is often not satisfied, especially at the edges of objects. Then, so far, the only approach giving relatively precise results are MC variational methods that first appeared in the context of out-of-focus images in [4]. This approach was adopted in [5] where the camera motion blur was modeled by Gaussian PSF, locally deformed according to the direction and extent of blur. This method can be appropriate for small blurs.
2
Contributions
In this paper, we present a novel variational algorithm for restoration from multiple color images blurred by camera motion. We focused on the special case when the camera moves in one plane perpendicular to the optical axis without any rotations. This limitation includes not only the camera motion during the capture of one image but also the change of camera position between the images, which ensures that the depth map is common for all the images. The algorithm needs to know neither camera motion nor camera parameters. Compared to other variational methods [4,5], the proposed algorithm works for more complex camera motions. To suppress noise, we exploit correlation between color bands. We also propose an auxiliary algorithm for rough estimation of depth maps.
452
3
ˇ M. Sorel and J. Flusser
Relation Between PSF and Depth of Scene
To model camera motion blur, we need to express the PSF as a function of camera motion and depth of scene. In our case, thanks to the limitation of camera motion to one plane perpendicular to the optical axis [6] h(x, y; s, t) = d2 (x, y)h0 (sd(x, y), td(x, y)),
(2)
where the “prototype” PSF h0 (s, t) corresponds to the path covered by the camera while the shutter is open. Depth map d is given in focal length units. Equation (2) implies that if we know the PSF for an arbitrary fixed distance from camera, we can compute it for any other distance by simple stretching in the ratio of the depths.
4
Algorithm
First, we define notation for space-variant convolution (1) as z i = ui ∗v h. We also need the operator adjoint to the space-variant convolution [6] ui v h [x, y] = ui (x − s, y − t)h(x, y; −s, −t) dsdt. (3) G B Let us denote the blurred images at the input as z p = {z R p , z p , z p }. The proposed algorithm can be described as minimization of cost functional
1 2 p=1 P
E(u, w) =
ui ∗v hp (w) − z ip 2 + λu Q(u) + λw R(w)
(4)
i∈{R,G,B}
with respect to sharp image u = {uR , uG , uB } and inverse depth map w, 1 w(x, y) = d(x,y) . As we already mentioned, the depth map is the same for all the images. The first term of (4), called error term, is a measure of difference between blurred images z p and the image u blurred using information about depth of scene in w. The size of the difference is measured by L2 norm .. For image p, operator hp (w) gives the PSF (2) corresponding to the depth map represented by w. Its space-variant convolution with image u models the process of blurring. As mentioned above, in our case it is sufficient to known the PSF for one fixed depth and hp can be computed for an arbitrary depth using (2). For this purpose, we can apply space-invariant blind restoration method [3] on a suitable section of the images with approximately space-invariant blur. Besides the restored sections, this method provides also an estimate of masks (PSFs), which become the prototypes h0 from (2). As we usually do not know the real depth for this part of the scene, the depth map we compute is correct only up to a scale factor. This is however sufficient, since our primary goal is restoration. Note that hp incorporates the shift of the camera between images. The actual implementation of hp used in our experiments can be found in [6].
Restoration of Color Images Degraded by Space-Variant Motion Blur
The role of regularization terms Q(u) = |∇uR |2 + |∇uG |2 + |∇uB |2
453
and
R(w) =
|∇w|2
(5)
is to achieve well-posedness of the problem and incorporate prior knowledge about the smoothness of the solution. The chosen image regularization term Q suppresses noise effectively and prevents color artifacts at the edges [7]. Tikhonov regularization term R is a good choice for most scenes. The same way as in [4,5], the regularization parameters λu and λw are set by the trial-and-error method. For details on image regularization, see [1,6,7]. For efficient minimization, we need to know at least the gradient (Fr´echet derivative) of the functional. Readily it equals the sum of the gradients of individual terms. The gradients of the regularization terms are ∂Q ∇ui ∂R = − div and = −∇2 w, (6) i R 2 G 2 B 2 ∂u ∂w |∇u | + |∇u | + |∇u | where the symbol ∇2 denotes Laplacian operator. Let us denote the error term as Φ. Then, its gradients can be expressed as [6] ∂Φ = eip v hp (w) i ∂u p=1 P
and
∂Φ = ∂w
i∈{R,G,B}
ui
P p=1
eip v
∂hp (w) , ∂w
(7)
∂h (w)
p where ∂w [x, y; s, t] is the derivative of the mask related to image point (x, y) with respect to the value of w(x, y). Finding the minimum of the cost functional is high-dimensional nonlinear problem with a huge amount of local minima. Experiments confirmed that the right choice of initial depth map estimate is essential to prevent the algorithm from getting trapped in a local minimum. For this purpose, we developed our own method which can be described as follows. We compute 2 z i1 ∗ h2 (w) − z i2 ∗ h1 (w) (8)
i∈{R,G,B}
for a sequence of values w covering the interval of possible inverse depths. Experiments have shown that it is sufficient to take the step corresponding to change in the support of the PSF of about 0.1 pixel. Obviously, if there is no noise and w is correct, the value of (8) should be zero since z i1 = ui ∗ h1 (w), z i2 = ui ∗ h2 (w) and convolution is commutative. In practice, for each pixel the algorithm simply takes w minimizing the average value of (8) over some finite window. In addition, if we have an estimate of noise levels σ1 and σ2 in images z 1 and z 2 , it proved beneficial to subtract σ22 h1 (w)2 + σ12 h2 (w)2 from (8), to compensate for the bias produced by the noise. Details can be found in [6]. Let us denote the initial depth map estimate we have got as w0 . The algorithm iterates through minimizations in subspaces corresponding to unknown matrices ui and w:
454
ˇ M. Sorel and J. Flusser
1. for n = 1 : Ng 2. un = arg minu E(u, wn−1 ) 3. wn = arg minw E(un−1 , w) 4. end for 5. uNg +1 = arg minu E(u, w Ng ) Note that the steps 2, 3 and 5 themselves consist of a sequence of iterations. In the rest of this section we will discuss the minimization methods used in respective subspaces. The minimization with respect to color bands ui (steps 2 and 5) is the well known problem of non-blind restoration [1]. Similarly to [2], we reduce the problem to a sequence of linear problems solved by conjugate gradient method. In a very simplified manner, the idea is as follows. Let un,m be the current estimate of the image minimizing the cost functional (4) for a fixed w n−1 . We will replace the regularization term Q by quadratic term 1 |∇uR |2 + |∇uG |2 + |∇uB |2 2 G 2 B 2 + |∇uR n,m | + |∇un,m | + |∇un,m | . 2 2 + |∇uG |2 + |∇uB |2 |∇uR | n,m n,m n,m (9) Obviously, it has the same value as Q in un,m . The right term of (9) is constant for now and consequently it does not take part in actual minimization. We have got a “close” linear problem solution of which becomes a new estimate un,m+1 . It can be shown [2] that un,m converges to the desired minimum un for m → ∞. In the subspace corresponding to the inverse depth map (step 3) we apply the simple steepest descent algorithm. The optimum step length in one direction is found by interval bisection method.
5
Experiments
We present two experiments with real images. Both were taken from a digital SLR camera mounted on a framework that limits motion or vibrations to one vertical plane. The first experiment documents behavior of our algorithm for images blurred by one-dimensional harmonic motion of the camera. The scene is chosen relatively simple but so as the extent of blur varies significantly throughout the image. We took two images Fig. 1(a) and (b) with camera vibrating approximately in horizontal and vertical directions. To achieve large depth of focus, we set fnumber to F/16. The third image Fig. 1(h) was taken without vibrations as a ground truth. Two small sections were taken from the right part of the images and corresponding PSFs were computed (Fig. 1(c)) using blind deconvolution algorithm [3]. These PSFs serve as the prototype PSFs h0 from relation (2). Similarly, we computed masks in the central part of the image (Fig. 1(d)) and we can see that the extent of blur is about half compared to the PSFs Fig. 1(c), which is in agreement with our model. Next, we applied algorithm (8) to get
Restoration of Color Images Degraded by Space-Variant Motion Blur
(a) horizontal motion blur
455
(b) vertical motion blur
(c) sections from the juicebox on the (d) sections from the proximity of imright and corresponding PSFs age center and corresponding PSFs
(e) initial depth map estimate
(f) after minimization, λw = 10−6 , λu = 10−4
(g) result of restoration, λfu = 10−4
(h) ground truth image
Fig. 1. We took two images from the camera mounted on a device vibrating in horizontal (a) and vertical (b) directions. For both images, the shutter speed was set to 5s and aperture to F/16. Image size 870 × 580 pixels.
456
ˇ M. Sorel and J. Flusser
(a) first blurred image
(b) second blurred image
(c) depth map, λw = 10−6
(d) depth map, λw = 5 × 10−6
(e) restoration using depth map (c)
(f) restoration using depth map (d)
(g) ground truth image
(h) PSFs computed from (a), (b)
Fig. 2. We took two images from a camera mounted on vibrating framework. For both images the shutter speed was set to 1.3s and aperture to F/22. The prototype PSFs in the upper row of (h) were computed from the blossom area slightly left and down from the image center. The PSFs computed from the upper-right corner of the LCD screen (bottom line of (h)) again correspond to equation (2). Image size 800 × 500 pixels.
Restoration of Color Images Degraded by Space-Variant Motion Blur
457
an initial estimate of depth map in Fig. 1(e). In the algorithm, we averaged the error by 7 × 7 window and the result was smoothed by 23 × 23 median filter. Finally, we applied the procedure from p. 453 resulting in depth map (f) and image (g). The second experiment Fig. 2 was set up to show limitations of the proposed algorithm. The scene is much more complex with a lot of small details and there are many places where the depth changes rapidly. Also the vibrations were made more complex. The structure of the experiment is the same as in the first experiment except that we demonstrate the result for two different levels of depth map regularization. We can see that if we use less regularization, there are visible wave-like artifacts on the wall in the background. On the other hand, if we use more regularization, it pronounces visible ringing effects on the places, where the depth suddenly changes.
6
Conclusion
We have presented an algorithm for image restoration from two or more color images of the same scene blurred by camera motion. We considered the special case, when the camera moves in one plane perpendicular to the optical axis. This type of motion is more complex than those considered in previously published literature. We showed that the proposed algorithm works well with real images. An important direction of future research is the extension of the algorithm to general camera motion. Basic idea is discussed in [6].
References 1. Banham, M.R., Katsaggelos, A.K.: Digital image restoration. IEEE Signal Processing Mag. 14(2), 24–41 (1997) 2. Chambolle, A., Lions, P.: Image recovery via total variation minimization and related problems. Numer. Math. 76(2), 167–188 (1997) ˇ 3. Sroubek, F., Flusser, J.: Multichannel blind iterative image restoration. IEEE Trans. Image Processing 12(9), 1094–1106 (2003) 4. Rajagopalan, A.N., Chaudhuri, S.: An MRF model-based approach to simultaneous recovery of depth and restoration from defocused images. IEEE Trans. Pattern Anal. Machine Intell. 21(7) (1999) 5. Favaro, P., Burger, M., Soatto, S.: Scene and motion reconstruction from defocus and motion-blurred images via anisothropic diffusion. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 257–269. Springer, Heidelberg (2004) ˇ 6. Sorel, M.: Multichannel Blind Restoration of Images with Space-Variant Degradations. PhD thesis, Charles University in Prague (submitted 2007) 7. Tschumperl´e, D., Deriche, R.: Diffusion PDE’s on vector-valued images. IEEE Signal Processing Mag. 19(5), 16–25 (2002)
Real-Time Elimination of Brightness in Color Images by MS Diagram and Mathematical Morphology Francisco Ortiz Image Technological Group Dept. Physics, Systems Engineering and Signal Theory. University of Alicante P.O. Box 99, 03080 Alicante, Spain
[email protected] Abstract. This paper proposes a real-time method for the detection and elimination of brightness in color images. We use a 2D-histogram that allows us to relate the signals of luminance and saturation of a color image and to identify the specularities in a given area of the histogram. This is known as the MS diagram and it is constructed from a polar color model. We use a new connected vectorial filter based on color morphology to eliminate the brightness. This filter operates only in the bright zones previously detected, reducing the high cost of processing of connected filters and avoiding oversimplification, in single-processing and multiprocessing environments. Keywords: brightness elimination, color mathematical morphology, connected vectorial filters.
1 Introduction Real-time image processing differs from “ordinary” image processing in that the same correct results must be obtained in critical time. Real-time imaging covers a multidisciplinary range of research areas including image compression, image enhancement and filtering, visual inspection, etc [1]. Indeed, a goal in computer vision is to identify objects of real scenes in the shortest time possible or in a deadline. Sometimes, this goal is not easy since a bad adjustment of illumination can introduce brightness (highlights or specular reflectance) in the objects captured by the vision system. The presence of brightness causes problems in low-level computer vision methods and in high-level operations. To be able to eliminate the highlights in captured scenes, we must identify them first. The dichromatic reflection model proposed by Safer [2] is a tool that has been used in many methods for detecting specularities. This model supposes that the interaction between the light and a dielectric material produces different spectral distributions in the object, i.e., the specular and diffuse reflectances. Bajcsy et al [3] use a chromatic space based on polar coordinates that allows the detection of specular and diffuse reflections by means of the previous knowledge of the captured scene. Klinker et al [4] employ a pixel-clustering algorithm which has been shown to work well in detecting brightness in images of plastic objects. These previous approaches W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 458–465, 2007. © Springer-Verlag Berlin Heidelberg 2007
Real-Time Elimination of Brightness in Color Images
459
have produced good results but they have requirements that limit their applicability, such as the use of stereo or multiple-view systems, high time of processing, the previous knowledge of the scene, or the assumption of a homogeneous illumination, without considering the inter-reflections present in most typical real scenes. In this paper we explain a new and real-time system for the detection and elimination of brightness in color images by means of two main steps: - Detection: we use a 2D-histogram of achromatic and saturation signals from a 3Dpolar coordinate color representation. This new representation allows us to obtain a specular reflectance map of the image. - Elimination: we develop a real time vectorial geodesic reconstruction algorithm, which has low cost and avoids the over-simplification of the image. The organisation of the paper is as follows: In Section 2 we present the color space for processing, together with the MS diagram used to detect the specular reflectance. In Section 3, we develop the color morphology and an extension of the geodesic operations to color images. In Section 4 we show the algorithm for detecting and eliminating specularities. Experimental results in highlights detection are shown in Section 5. Finally, our conclusions are outlined in the last section.
2 Color Space for Processing and MS Diagram In the last years, the color spaces based in polar coordinates (HLS, HSV, HSI…) are widely used in image processing [5,6,7]. Important advantages of these color spaces are: good compatibily with human intuition of colors and separability of chromatic values from achromatic values. The intuitive systems are widely used in image processing as they represent the information in similar way to the brain. They can be represented by a single system, which Levkowitz and Herman define as GLHS [8], adapted for image processing by Serra’s L1-norme [9]. We shall denote the intensity function by m, which is the mean of the r, g and b values, where (r,g,b) are coordinates of the RGB color space (Figure 1.a), with r∈ [0,255], g∈[0,255] and b∈[0,255]. As such, m corresponds to the l in the LHS-triangle model:
m=
1 ( r + g + b) 3
(1)
We define a color space (MSH), where the m signal calculated is a normalization (0 ≤ m ≤ 255) of the achromatic axes of the RGB cube. The value of s is given by Serra’s L1-norme as: 3 ⎧1 if (b + r ) ≥ 2 g ⎪ 2 ( 2r − g − b ) = 2 ( r − m ) s=⎨ (2) 1 3 ⎪ (r + g − 2b) = (m − b) if (b + r ) < 2 g 2 ⎩ 2
We propose to exploit the existing relation between m and s that permits the detection of specular reflections in a digital image, independently of the hue of the object in which the brightness appears. Figure 1.b shows the MS diagram as the positive projection of all of the corners of the cube in a normalisation of the achromatic line to the m signal.
460
F. Ortiz
s b
White
g
c1
c2
c3
c4
Black
r Black
(a)
white
m
(b)
Fig. 1. RGB cube and its transformation in MS diagram. (a) 3D projection. (b) Shape and limits of MS diagram.
3 Color Morphology and Vector Connected Filters The definition of morphological operators needs a totally ordered complete lattice structure [10]. The color pixels do not present, a priori, this structure and it is necessary to impose an order relationship in the color space. Several studies have been carried out on the application of mathematical morphology to color images [11,12,13]. The approach most commonly adopted is based on the use of a lexicographical order, which imposes total order on the vectors. This way, we avoid the false colors in an individual filtering of signals. On the other hand, it is important to define the color space in which operations are to be made. The preference or disposition of the components of the polar model in the lexicographical ordering depends on the application and the properties of the image. For our application we define a lattice with a lexicographical order of olex=(m→s→h). As such, we put more emphasis on the intensity signal m. Afterwards, we analyse the saturation. Next, we compare a hue distance value only if the pixels are colored. 3.1 Connected Vectorial Filters
Morphological filters by reconstruction have the property of suppressing details, preserving the contours of the remaining objects [14,15]. The use of these filters in color images requires an order relationship among the pixels of the image. For the vectorial morphological processing the lexicographical ordering, previously defined olex , will be used. As such, the infimum ( ∧ v ) and supremum ( ∨ v ) will be vectorial operators, and they will select pixels according to their order olex in the MSH generalised color space [16]. Once the orders have been defined, the morphological operators of reconstruction for color images can be generated and applied. An elementary geodesic operation is the geodesic dilation. Let g denote a marker color image and f a mask color image (if olex(g)≤ olex(f), then g ∧ v f = g). The vectorial geodesic dilation of size 1 of the marker image g with respect to the mask f can be defined as:
Real-Time Elimination of Brightness in Color Images
(1)
δ v f ( g ) = δ v (1) ( g ) ∧ v f
461
(3)
where δ v (1) ( g ) is the vectorial dilation of size 1 of the marker image g. This propagation is limited by the mask f. The vectorial geodesic dilation of size n of a marker color image g with respect to a mask color image f is obtained by performing n successive geodesic dilations of g with respect to f: (n) (1) δ v f ( g ) = δ v f ⎡δ v ( n-1) ( g ) ⎤ (4) ⎢⎣ f ⎥⎦
( 0) with δ v f ( g ) = f . The vectorial reconstruction by dilation of a mask color image f from a marker color image g, (both with Df=Dg and olex ( g ) ≤ olex ( f ) ) can be defined as: ( n) Rv f ( g ) = δ v f ( g )
(5)
( n) (n +1) where n is such that δ v f ( g ) = δ v f ( g) .
4 Algorithm for Detecting and Eliminating Specularities The specularities in the chromatic image have values of high luminance m and low saturation s. An important consideration is that not all the images have the same dynamic range and, therefore, the m and s values of their specularities do not correspond with the positions of the MS diagram previously presented. We have opted for a neighbourhood-based morphological contrast enhancement which considers the local features of the images. Specifically, we have applied a top-hat contrast operator defined as (WTH is a white top-hat and BTH is a black top-hat): m' = m + WTH (m) − BTH (m)
(6)
The result of the local enhancement by the top-hat is that the specular reflectance pixels are positioned on c3 and c4 lines of the MS diagram (Figure 1.b). These lines indentify different specularities along their coordinates, from the most intense to the dullest. The optimum value of saturation (ssp) on c3 and c4 lines for all the brightness in images it has been deduced in our numerous experiments [17]. The detection of the specularities stops, in all of the cases, at a maximum saturation of: s sp =
mmax 10
(7)
where mmax=255, smax=255, and at higher values no additional pixels in the image are detected as brightness. It is now easy to calculate the value of msp as:
462
F. Ortiz
msp =
2 smax − 3mmax −3
(8)
This way, all the specularities are located on c3 and c4 lines of the MS diagram, between [msp, mmax] for m signal and [0, ssp] for s signal. To eliminate the highlight that was previously detected with the MS diagram, we use a vectorial opening by reconstruction (VOR) applied only in the specular areas of the image and their surroundings. The size e of the structural element of the erosion will determine the success of the reconstruction and the final cost of the operations, since this size imposes the process area of the filters. In the geodesic filter, f is first eroded. The eroded sets are then used as sets for a reconstruction of the original image. The new filter (VOR) is defined, taking into account the fact that, in this case, the operation will not affect all the pixels (x,y), but only those in which h(x,y)=1:
γ v (f,n 'h) = ⎧⎨δ v (n' ) (ε v (e) ( f )) | ∀f ( x, y) ⇒ h( x, y ) = 1⎫⎬ f
⎩
(9)
⎭
where e is a 8-connected structuring element and n’ is such that (n') (n'+1) (e) δ v f (ε v (e) ( f )) = δ v f (ε v ( f )) . The vectorial erosion of the opening by reconstruction is also done with a structural element of size e. This erosion replaces highlight pixels (high olex) by the surroundings chromatic pixels (low olex). Next, the vectorial geodesic dilation (iterated until stability) reconstructs the color image without the recovering of the specularities. This is the same approach successfully used in the detection of color cells in real-time medical imaging [16]. The possibilities of parallel processing in our algorithm are limited. Nevertheless, in order to achieve the results in a lower time, an alternative configuration for multiprocessor environment is possible, i.e. the vectorial erosion of the opening by reconstruction can be made in all the pixels of the original image f (in a second processor), in parallel with the detection step of the algorithm (first processor). This way, the first vectorial erosion (e=1) of the top-hat is re-used. The task graph of this new configuration is shown in Figure 2. We will evaluate this alternative in the following section. e=4 e=3
e=1
ε Input image
ε
Parallel processing
ε
e=2
ε
n’
δ -
+
Pixel segment
δ
δ
Image output
VOR
Morphological contrast enhancement
Fig. 2. Task graph for a parallel processing of the algorithm. A minimum cost approach for brightness elimination: multiprocessor configuration.
Real-Time Elimination of Brightness in Color Images
463
5 Experimental Results and Real-Time Aspects We now present a sample of the results obtained from the application of our method for brightness elimination in more of fifty scenes (Figure 3). In addition, we show a cost comparison for the two configurations of the algorithm: single-processor and multiprocessor. From the visual results obtained (Figure 4), the effectiveness of our method for the detection and elimination of specular reflectance can be observed. The oversimplification does not appear since the reconstruction only functions in bright areas. Furthermore, the results are obtained at a much lower computational time which is compatible with real-time image processing systems. The reconstruction task is the most critical operation. For this reason, the size e of the structuring element of morphological operations will depend of the application and real-time requirements, i.e. a low e (1,2) is recommended for visual inspection and a high e (3,4,…) is the best in multimedia and image restoration.
(a)
(b)
(c)
Fig. 3. Colour images for empirical study. (a) “Apples”, (b) “Tomatoes” and (c) “Ballons”.
In Table 1, we show a comparison of temporal execution costs between the new algorithm for the elimination of specularities in color images and a global geodesic filter that operates in the entire image. As can be seen, the new method avoids the high computational cost of the geodesic processing for textured images (“Balloons”). In other color scenes, the obtained results have been similar (one second of time).
(a)
(b)
(c)
Fig. 4. Elimination of specular reflectance of real color images in Figure 3. (a) “Apples”, e=(2), “Tomatoes”, e=(3), and (c) “Ballons”, e=(2). Over-simplification is not present in the results.
464
F. Ortiz
Table 1. Final CPU times (s) for brightness elimination by means of a global geodesic filter, and the proposed algorithm for single-processing and the multiprocessing configurations for multimedia applications.
Color images and real sizes
Global filter
Proposed algorithm (Single-processing) times (s)
Proposed algorithm (Multiprocessing) times (s)
Apples (300x320)
9.09
1.34
Tomatoes (440x270)
11.73
1.95
0.85 1.09
Balloons (250x300)
19.45
0.56
0.39
6 Conclusions In this paper, we have presented a method for the detection and elimination of specular reflectance in color images for real-time computer vision applications. The possibility of eliminating highlights in color images without causing oversimplification has been demonstrated. In addition, the elimination of brightness has been obtained with a very low processing time, in single-processor and multiprocessor configurations, with respect to a global geodesic reconstruction. This permits to achieve real-time requirements in image processing, even in very textured images. The detection and elimination of brightness is obtained independently of the material of the objects on which they appear, without any need of multiple-view or previous knowledge of the scenes. Based on the success shown by these results, we are working to improve our method for eliminating specularities. Now, we work in the automatic calculation of the size e of the structuring element required in the vectorial erosion of the algorithm. Also, we work with other color spaces and multiprocessor configurations to reduce the processing time required in vectorial operations as much as possible.
References 1. Laplante, P.: Real-time systems design and analysis: an engineer’s handbook, 3rd edn. Wiley-IEEE Press, New York (2003) 2. Shafer, S.A.: Using color to separate reflection components. Color Research Appl. 10, 210–218 (1985) 3. Bajcsy, R., Lee, S., Leonardis, A.: Detection of diffuse and specular interface reflections and inter-reflections by color image segmentation. International Journal on Computer Vision 17, 241–271 (1996) 4. Klinker, G., Shafer, S.A.: kanade, T.: Image segmentation and reflection analysis through color. In: Proc. SPIE. vol. 937, pp. 229–244 (1988) 5. Palus, H.: Representations of colour images in different colour spaces. In: Sangwine, S., Horne, R. (eds.): The Colour Image Processing Handbook, pp. 67–75 (1998) 6. Wyszecki, G., Stiles, W.S.: Color Science, Concepts and Methods, Quantitative Data and Formulas, 2nd edn. John Wiley, New York, NY (1982)
Real-Time Elimination of Brightness in Color Images
465
7. Plataniotis, K., Venetsanopoulus, A.: Color Image Processing and Applications. SpringerVerlag, Berlin (2000) 8. Levkowitz, H., Herman, G.: GHLS, a generalized lightness, hue and saturation color model. Graphical Models and Image Processing 44(4), 271–285 (1993) 9. Serra, J.: Espaces couleur et traitement d’images. Tech. Report N-34/02/MM. Centre de Morphologie Mathématique, École des Mines de Paris (2002) 10. Serra, J.: Image analysis and Mathematical Morphology. vol. I, and Image Analysis and Mathematical Morphology (1982) vol. II: Theorical Advances, Academic Press, London (1988) 11. Hanbury, A., Serra, J.: Morphological operators on the unit circle. IEEE Transactions on Image Processing 10(1.12), 1842–1850 (2001) 12. Comer, M., Delp, E.: Morphological Operations for Colour Image Processing. In: Journal of Electronic Imaging 8, 279–289 (1999) 13. Angulo, J.: Morpholohie mathématique et indexation d’images couleur. Application à la microscopie en biomedicine. PhD Thesis. École des Mines de Paris (2003) 14. Vicent, L.: Morphological Grayscale Reconstruction in Image Analysis: Applications and Efficient Algoritms. IEEE Transactions on Image Processing 2, 176–201 (1993) 15. Crespo, J., Serra, J., Schafer, R.: Theoretical aspects of morphological filters by reconstruction. Signal Processing 47, 201–225 (1995) 16. Ortiz, F., Torres, F., De Juan, E., Cuenca, N.: Colour mathematical morphology for neural image analysis. Journal of Real Time Imaging 8(1.6), 455–465 (2002) 17. Torres, F., Angulo, J., Ortiz, F.: Automatic detection of specular reflectance in colour images using the MS diagram. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 132–139. Springer, Heidelberg (2003)
Surface Reconstruction Using Polarization and Photometric Stereo Gary A. Atkinson and Edwin R. Hancock Department of Computer Science University of York, York, YO1 5DD, UK {atkinson,erh}@cs.york.ac.uk
Abstract. This paper presents a novel shape recovery technique that combines photometric stereo with polarization information. First, a set of ambiguous surface normals are estimated from polarization data. This is achieved using Fresnel theory to interpret the polarization patterns of light reflected from dielectric surfaces. The process is repeated using three different known light source positions. Photometric stereo is then used to disambiguate the surface normals. The relative pixel brightnesses for the different light source positions reveal the correct surface orientations. Finally, the resulting unambiguous surface normal estimates are integrated to recover a depth map. The technique is tested on various objects of different materials. The paper also demonstrates how the depth estimates can be enhanced by applying methods suggested in earlier work.
1
Introduction
Reconstructing the 3D surface geometries of objects from one or more 2D images is a much studied area of computer vision. Many attempts at this difficult task aim to recover fields of surface normals of the objects which are then converted into depth maps using integration algorithms [4]. Shape-from-shading [12] and photometric stereo [10] are examples of such efforts. In shape-from-shading, spatial variations in pixel brightnesses are used to estimate the normals subject to illumination and/or geometric assumptions. In photometric stereo, the normals are constrained by using several images of the target object, each illuminated from a different direction. This paper presents a method that combines photometric stereo with polarization information to fully constrain the surface normals. The polarization state of reflected light has been used extensively for shape recovery in recent years [1,2,6,7,8]. All of this work stems from the fact that light is partially polarized when reflected from surfaces, as predicted by Fresnel theory [5]. The polarization results from the directionality of electronic dipoles induced in the reflecting medium caused by the electromagnetic field of the incident light ray. The surface azimuth angles (the angles of the projection of the surface normals onto the image plane) can be determined by the orientation of the polarization of the light reflected from the surface. The zenith angles (the W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 466–473, 2007. c Springer-Verlag Berlin Heidelberg 2007
Surface Reconstruction Using Polarization and Photometric Stereo
467
angles between the surface normals and the viewing direction) can be estimated from the extent to which the light is partially polarized [9]. One of the problems with methods that only use polarization is that the surface normals can only be determined up to an ambiguity. Although some success has been achieved at resolving the ambiguity from a single view [2,7], a complete disambiguation is unfeasible without further information (unless assumptions about the surface geometry are imposed). Rahmann and Canterakis [8] and Atkinson and Hancock [1] resolve the amˇara [3] have investigated biguity using two camera viewpoints. Drbohlav and S´ the idea of improving photometric stereo methods using polarization. If only two light source positions are used with traditional photometric stereo, which are both unknown, then the surface normals are only recovered up to a reguˇara reduce this ambiguity using polarization. lar transformation. Drbohlav and S´ This is at the expense of introducing the need for linearly polarized incident light. The method presented in this paper does not assume polarized incident light, but uses known light source positions. The method that we propose in this paper commences using standard polarization theory and techniques to ambiguously estimate the surface normals. This is described in Section 2. A variant of photometric stereo is then used to disambiguate the surface normals using three light source directions, as described in Section 3. The method deduces the correct choice of azimuth angle at each pixel independently by comparing the brightnesses as the light source position is moved. We then present a series of experiments on a range of surface geometries and materials to demonstrate the success of the disambiguation process. Full 3D reconstructions of these objects are also presented. Finally, we show how the accuracy of the technique can be enhanced by incorporating ideas from earlier papers in the field. All our experiments are presented in Section 4. We summarise our contribution and suggest directions for future work in Section 5.
2
Polarization in Computer Vision
Consider the reflection of a ray of light from a smooth surface. Fresnel theory [5] provides a means to calculate the ratio of the incident light intensity to the reflected light intensity for a given angle of incidence. Further to this, if the incident light is unpolarized, the theory predicts that the reflected ray will become partially polarized, again depending on the angle of incidence. In this paper, we study diffuse reflection, where light penetrates the surface and is scattered internally before being re-emitted [9]. This is in contrast to specular reflection, where the light is reflected from the surface directly. For diffuse reflection, Fresnel theory can be applied to light as it is re-emitted from the medium into air. This provides a relation between the polarization state of the reflected light and the angle of the reflection. The standard approach to measure the polarization state of reflected light is to take a succession of images of the reflecting surface with a polarizer placed in front of a camera. The polarizer is rotated to different angles between images.
468
G.A. Atkinson and E.R. Hancock
The measured intensity at each pixel varies sinusoidally with the polarizer angle. By performing (for example) a least squares fit on the measured pixel brightnesses as a function of the polarizer angle, the minimum and maximum intensities on the sinusoid, Imin and Imax , can easily be determined. The polarizer angle corresponding to maximum transmission, φ, can also be found. This leads to the azimuth angle of the surface, α, which is simply given by [9] α=φ or φ + 180◦ (1) The zenith angle of the surface, θ, can be estimated using the degree of polarization, which is defined to be ρ = (Imax − Imin ) / (Imax + Imin )
(2)
Fresnel theory can then be used to predict that the degree of polarization is related to the zenith angle by 2
ρ=
(n − 1/n) sin2 θ 2
2 + 2n2 − (n + 1/n) sin2 θ + 4 cos θ
n2 − sin2 θ
(3)
Degree of polarization
where n is the refractive index, which we take as 1.4 throughout this work (a typical value). Expressed slightly differently, if a surface point is observed such that the angle between its normal and the viewing direction (the zenith angle) is θ, then the observed degree of polarization is given by (3). Fig. 1 shows the relationship between the degree of polariza0.4 tion and the zenith angle. Fig. 2 shows a greyscale image of a porcelain bear and images that 0.2 encode the phase angle, φ, and the degree of polarization, ρ. To convert the phase image to 0 a field of azimuth angles, we 0 30 60 90 Zenith angle disambiguate the angles in (1) using the method described in Fig. 1. Relationship between zenith angle and dethe next section. Since ρ is a gree of polarization monotonic function of θ, the zenith angle can be found unambiguously from (3).
3
Disambiguation Using Photometric Stereo
The simple and efficient algorithm developed for this task is local and uses the variation in pixel intensities for changing light sources. This allows for the determination of the correct choice of azimuth angle from the two options in (1). Our aim is to obtain a field of unambiguous surface normals using several
Surface Reconstruction Using Polarization and Photometric Stereo
a
b
469
c
Fig. 2. (a) Greyscale image of a porcelain bear. (b) Phase (light areas have higher values). (c) Degree of polarization (dark areas have higher values).
different illumination directions. We obtain a greyscale image, a phase image and a degree of polarization image for each of three different light source positions. This provides three ambiguous angle azimuth estimates and three zenith angle estimates for each pixel, in addition to the measured intensities. The first task is to select the most reliable phase and degree of polarization for each pixel. This is done by comparing the measured intensities at each pixel. For any particular pixel, the phase and degree of polarization are used that correspond to the light source which gave the greatest intensity for that pixel. We make an exception to this rule where the maximum intensity is saturated (which we define as an intensity greater than grey level 250). In this case we use the highest intensity that is not saturated. We use the experimental arrangement shown in Fig. 3 for the disambiguation. Assume that the angles subtended by the camera and light sources from the object are equal (θL ) and that the distances between the object and the light sources are also equal (DL ).
Fig. 3. Geometry used for disambiguation by photometric stereo
470
G.A. Atkinson and E.R. Hancock
Fig. 4. View of a spherical target object from the camera viewpoint. Azimuth angles are disambiguated differently depending on whether the phase angle is less than 45◦ (regions A and A’), between 45◦ and 135◦ (regions B and B’) or greater than 135◦ (regions C and C’).
Fig. 4 shows the object and light sources from the point of view of the camera. In this case, the object is a perfect sphere. Let the intensity at pixel k observed under light source 1 be Ik(1) . Similarly, under light sources 2 and 3, the intensities are Ik(2) and Ik(3) respectively. Considering the geometry of the photometric stereo rig, we can see that for pixels falling into the region labelled A, we would expect Ik(2) > Ik(3) > Ik(1) . In region A’, which has the same phase angles as region A, we would expect Ik(1) > Ik(3) > Ik(2) . We therefore, have a simple means of disambiguating azimuth angles in these regions: φk if Ik(2) > Ik(1) if φk < 45◦ then αk = (4) ◦ φk + 180 otherwise We use these specific inequalities (as opposed to Ik(2) > Ik(3) or Ik(3) > Ik(1) ) since we expect the difference in intensity to be greatest between Ik(2) and Ik(1) for regions A and A’. Similar arguments can be used for the remaining regions of the sphere in Fig. 4. In region B we expect Ik(3) > Ik(2) > Ik(1) . Therefore, we have the condition: φk if Ik(3) > Ik(1) if 45◦ ≤ φk < 135◦ then αk = (5) ◦ φk + 180 otherwise Finally, in region C, Ik(3) > Ik(1) > Ik(2) . Which gives us our final condition: φk if Ik(3) > Ik(2) ◦ if 135 ≤ φk then αk = ◦ φk + 180 otherwise
(6)
We therefore have a complete means to disambiguate the azimuth angles from three light sources using (4 – 6). Since the zenith angles are fully constrained by (3) for diffuse reflection, the surface normals can be unambiguously determined using our proposed method.
Surface Reconstruction Using Polarization and Photometric Stereo
4
471
Experiments
This section presents the results of the above disambiguation process on realworld data. We also show how the azimuth angles can be used in conjunction with the zenith angles estimated from (3) to reconstruct three-dimensional surfaces. Finally, we show how the results can be improved by incorporating methods from earlier work in the field. Fig. 5 shows the azimuth angles obtained using our disambiguation method for two smooth porcelain objects, a slightly rough plastic object, an apple and an orange. For this experiment, θL = 45◦ and DL = 1.5m were used. Note that using too large a value for θL would result in problems due to cast shadows. The azimuth angles have clearly been disambiguated successfully for most regions. Only for small zenith angles, where the intensity has least variation with the light source and the degree of polarization is very low, are there any artifacts that reveal the light source distribution. To reconstruct the surface height from the unambiguous surface normals, we applied the well-known Frankot-Chellappa integration algorithm [4]. The algorithm uses the Fourier transform to convert a field of surface normals into a depth map by finding the nearest integrable solution. The reconstructions of our test objects are shown in Fig. 6. The figure shows reasonable general reconstructions but with some problems caused by specularities in the rougher surfaces. Also, the surfaces are flatter than the real objects as a result of roughness, inter-reflections and unknown refractive indices. These problems have been noted in previous work in shape recovery from polarization in diffuse reflection [1,7]. Essentially, because the polarizing properties of diffuse reflection are weak, the raw zenith angle estimates are too heavily contaminated by noise to provide sufficiently accurate reconstructions. This is especially true for rough surfaces [2].
Fig. 5. Top row: raw images of the test objects (illuminated from Source 2). Bottom row: disambiguated azimuth angles.
472
G.A. Atkinson and E.R. Hancock
Fig. 6. Surface reconstructions of the objects in Fig. 5.
Miyazaki et al. [7] address this issue by making the assumption that the histogram of zenith angles of the reconstructed surface matches that of a hemisphere. The histogram of the hemisphere takes the form N sin2 θ, where N is the number of points in the histogram. Atkinson and Hancock [2] use a different approach, where a relationship between the surface zenith angle and the pixel brightness is sought. Their method uses robust statistics to fit a curve to the histogram of pixel brightnesses and zenith angle estimates. Although more general to surface shape than the Miyazaki et al. method, it assumes that the light source and camera directions are identical, which necessitates a fourth light source position here. Fig. 7 shows how these two methods improve the surface reconstruction for the porcelain vase. Both methods show significant improvement, with the Atkinson and Hancock method marginally better on this occasion. For comparison, the mean angular errors across the entire vase were 19◦ when the the raw estimates were used for the reconstruction (Fig. 7a), 9◦ using the Miyazaki et al. method (b) and 6◦ for the Atkinson and Hancock method (c). It should be noted however, that these values are affected by alignment error. Both methods add very little computational cost to the algorithm ( few seconds). The Miyazaki et al. method is particulary useful for rough surfaces where direct zenith angle estimates are too degraded to be of any practical use.
(c) (b) (a)
Fig. 7. Profile of the vase reconstruction using (a) raw zenith angle estimates, (b) the zenith angles estimated using the Miyazaki et al. method, and (c) the zenith angles using the Atkinson and Hancock method. The exact profile is shown by the thick line.
Surface Reconstruction Using Polarization and Photometric Stereo
5
473
Conclusion
This paper presented a new shape recovery technique for computer vision, which used polarization information and photometric stereo. The method used Fresnel theory to estimate the zenith angles completely and the azimuth angles ambiguously, assuming that the reflection from the surface was diffuse. Three light source positions were then used to reliably disambiguate the azimuth angles. The resulting field of constrained surface normals was then converted into depth using the Frankot-Chellappa surface integration algorithm. Finally, the paper showed how the zenith angle estimates can be improved by histogram modification or by incorporating pixel brightness information. In future work, we hope to extend the technique to allow for more general lighting conditions by modifying conditions (4–6). We also hope to improve the way in which the zenith angles are enhanced by estimating relations between the pixel brightnesses and the zenith and azimuth angles. This would remove the need for the fourth illumination direction that was mentioned in the previous section. Alternatively, the polarization data could be processed using singular value decomposition (SVD) in a similar fashion to Yuille et al. [11]. Previously, SVD has only been used in photometric stereo for shape estimation up to a linear transform, but the additional information contained within polarization should allow for unambiguous reconstruction.
References 1. Atkinson, G.A., Hancock, E.R.: Shape estimation using polarization and shading from two views. IEEE Trans. Patt. Anal. Mach. Intell. (to appear) 2. Atkinson, G.A., Hancock, E.R.: Recovery of surface orientation from diffuse polarization. IEEE Trans. Im. Proc. 15, 1653–1664 (2006) ˇ ara, R.: Unambiguous determination of shape from photometric 3. Drbohlav, O., S´ stereo with unknown light sources. In: Proc. ICCV, pp. 581–586 (2001) 4. Frankot, R.T., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. IEEE Trans. Patt. Anal. Mach. Intell. 10, 439–451 (1988) 5. Hecht, E.: Optics, 3rd edn. Addison Wesley Longman, London, UK (1998) 6. Miyazaki, D., Kagesawa, M., Ikeuchi, K.: Transparent surface modelling from a pair of polarization images. IEEE Trans. Patt. Anal. Mach. Intell. 26, 73–82 (2004) 7. Miyazaki, D., Tan, R.T., Hara, K., Ikeuchi, K.: Polarization-based inverse rendering from a single view. In: Proc. ICCV, vol. 2, pp. 982–987 (2003) 8. Rahmann, S., Canterakis, N.: Reconstruction of specular surfaces using polarization imaging. In: Proc. CVPR, pp. 149–155 (2001) 9. Wolff, L.B., Boult, T.E.: Constraining object features using a polarisation reflectance model. IEEE Trans. Patt. Anal. Mach. Intell. 13, 635–657 (1991) 10. Woodham, R.J.: Photometric method for determining surface orientation from multiple images. Optical Engineering 19, 139–144 (1980) 11. Yuille, A.L., Snow, D., Epstein, R., Belhumeur, P.: Determining generative models for objects under varying illumination: Shape and albedo from multiple images using SVD and integrability. Intl. J. Comp. Vis. 35, 203–222 (1999) 12. Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape from shading: A survey. IEEE Trans. Patt. Anal. Mach. Intell. 21, 690–706 (1999)
Curvature Estimation in Noisy Curves Thanh Phuong Nguyen and Isabelle Debled-Rennesson LORIA Nancy Campus Scientifique - BP 239 54506 Vandoeuvre-l`es-Nancy Cedex, France {nguyentp,debled}@loria.fr
Abstract. An algorithm of estimation of the curvature at each point of a general discrete curve in O(n log2 n) is proposed. It uses the notion of blurred segment, extending the definition of segment of arithmetic discrete line to be adapted to noisy curves. The proposed algorithm relies on the decomposition of a discrete curve into maximal blurred segments also presented in this paper.
1
Introduction
A lot of applications in image processing require the geometrical measuring of represented discrete objects. In the framework of the discrete geometry, estimators of geometrical parameters have been proposed but they rely on the recognition of discrete line segments which is very sensitive to the noise existing in the studied curves [1,2,3]. The boundary of such discrete objects is often noisy due to acquisition process. Therefore the concept of blurred segment was introduced [4], it allows the flexible segmentation of discrete curves, taking into account noise. Relying on an arithmetic definition of discrete lines [5], it generalizes such lines, admitting that some points are missing. A curvature estimator was proposed in [6] and used in an application to the arc detection in technical documents [7], the complexity of this curvature estimator algorithm is in O(n2 ) (n is the number of points of the studied curve) and it can only be applied to 8-connected simple curves. We propose in this paper an extension to general curves of this algorithm. First the recognition algorithm of blurred segments proposed in [4] is extended to general curves [8]; the problem of adding or removing a point is studied. Then thanks to the decomposition of curve into maximal blurred segments, we propose an algorithm of calculation of the curvature at each point of a general discrete curve in O(n log2 n). The paper is organized as follows. In Section 2, after recalling definitions related to blurred segments, we study the problem of adding (or removing) a point to (from) a blurred segment of width ν in the case of a general discrete curve. Then we propose an extension to the noisy curves of the notion of maximal segment of a discrete curve. An algorithm to determine all maximal blurred segments of a discrete curve is given in Section 3. In Section 4, after recalling
This work is supported by the ANR in the framework of GEODIB project.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 474–481, 2007. c Springer-Verlag Berlin Heidelberg 2007
Curvature Estimation in Noisy Curves
475
the definition of the curvature estimator adapted to noisy curves, we propose a new algoritm for the determination of the curvature at each point of a discrete curve. Examples are also given.
2 2.1
Blurred Segment of Width ν Definitions
The notion of blurred segments relies on the arithmetical definition of discrete lines [5] where a line, whose slope is ab , lower bound μ and thickness ω (with a, b, μ and ω being integer such that gcd(a, b) = 1) is the set of integer points (x, y) verifying μ ≤ ax − by < μ + ω. Such a line is denoted by D(a, b, μ, ω). Let us recall definitions [4] that we use in this paper (see Figure 1): Definition 1. Let us consider a set of 8-connected points Sb . The discrete line D(a, b, μ, ω) is said bounding for Sb if all points of Sb belong to D. Definition 2. Let us consider a set of 8-connected points Sb . A bounding line of Sb ω−1 is said optimal if its vertical distance (i.e. max(|a|,|b|) ) is minimal, i.e. if its vertical distance is equal to the vertical distance of conv(Sb ), the convex hull of Sb . Definition 3. A set Sb is a blurred segment of width ν if its optimal boundω−1 ing line has a vertical distance less than or equal to ν i.e. if max(|a|,|b|) ≤ ν.
y
x
Fig. 1. D(5, 8, −8, 11), optimal bounding line (vertical distance = sequence of gray points
2.2
10 8
= 1.25) of the
Add (or Remove) a Point to (from) the Blurred Segment of Width ν
In this section we study the problem of adding (or removing) a point to (from) a blurred segment of width ν. The algorithm of recognition of blurred segments of width ν presented in [4] is executed in linear time however it only considers the incremental addition of a point to a blurred segment in the first octant. We present here the general case which requires the incremental calculation of both height and width of the convex hull after adding or removing a point. To do that, we use the results given in [9] and [8] that we briefly recall below.
476
T.P. Nguyen and I. Debled-Rennesson
Dynamic Estimation of the Convex Hull The problem of dynamic estimation of the convex hull of a set of points when adding (or removing) a point to (from) this set was proposed by M.H. Overmars, J. van Leeuwen [9]. In this work, a convex hull is considered as the union of two parts: the upper convex hull (Uhull ) and the lower convex hull (Lhull ). Uhull and Lhull are updated after each operation of addition or removal of a point, the cost of these operations are estimated by the following theorem [9]. Theorem 1. The convex hulls Uhull and Lhull of the set S of n points may be dynamically kept, in the worst case, in O(log2 n) by an operation of addition or removal. Determination of Height and Width of the Convex Hull We use the double technique of binary search [8] to determine the height and width of the convex hull. In [8], the convex hull is also considered as the union of two parts Uhull and Lhull . The double technique of binary search permits to find the vertical width of the convex hull by using the concavity property of the function height(x) = Uhull (x) − Lhull (x) in O(log2 n).
3 3.1
Maximal Blurred Segments of Width ν Definitions and First Property
The notion of maximal segment of a discrete curve was proposed in [1,3] and relies on the discrete line segments. This structure enables a global understanding of the discrete curve to be analyzed. We propose here an extension of that notion to the blurred segments, adapted to noisy curves, by using the same notations as in [3]. Let us consider a discrete curve called C, the points of C are indexed from 0 to n − 1. C is a general curve and the points of C can be disconnected . We note Ci,j a set of successive points of C increasingly ordered from index i to j. Definition 4. The predicate ”Ci,j is a blurred segment of width ν” is denoted by BS(i, j, ν). The first index j, i ≤ j, such that BS(i, j, ν) and ¬BS(i, j + 1, ν) is called the front of i and noted F (i). Symmetrically, the first index i such that BS(i, j, ν) and ¬BS(i − 1, j, ν) is called the back of j and noted B(j). Definition 5. Ci,j is called a maximal blurred segment of width ν and noted M BS(i, j, ν) iff BS(i, j, ν) and ¬BS(i, j + 1, ν) and ¬BS(i − 1, j, ν). It is obvious that an equivalent characterization for a maximal blurred segment of width ν, M BS(i, j, ν), is to show that F (i) = j and B(j) = i. In this work, we use the notion of blurred segment of width ν which is maximal on the right or left sides: Definition 6. Ci,j is called a maximal blurred segment of width ν on the right side (resp. on the left side) and noted M BSR (i, j, ν) (resp. M BSL (i, j, ν)) if F (i) = j (resp. B(j) = i). Property 1. Let C be a discrete curve, M BSν (C) the sequence of maximal blurred segments of width ν of the curve C. Then, M BSν (C) =
Curvature Estimation in Noisy Curves
477
{M BS(B1 , E1 , ν), M BS(B2 , E2 , ν), ..., M BS(Bm , Em , ν)} and satisfies B1 < B2 < ... < Bm . So we have: E1 < E2 < ... < Em . Proof: We consider 2 consecutive maximal blurred segments M BS(Bi , Ei , ν) and M BS(Bi+1 , Ei+1 , ν). By hypothesis, Bi < Bi+1 , let us suppose that Ei ≤ Ei+1 , then M BS(Bi+1 , Ei+1 , ν) becomes a part of M BS(Bi , Ei , ν). Therefore M BS(Bi+1 , Ei+1 , ν) is not a maximal blurred segment, that is contradictory. 3.2
Algorithm for the Segmentation of a Curve C into Maximal Blurred Segments
We propose an algorithm (see Algorithm 1) which determines all maximal blurred segments of width ν of a discrete curve C according to the conditions given in section 3.1. To do that, property 1 is used. Complexity Each point of the curve is scanned at most twice in this algorithm. The cost of determining a new optimal bounding discrete line when we add (or remove) a point to (from) a blurred segment is in O(log2 n). Hence the complexity of this algorithm is in O(n log2 n).
4
Discrete Curvature of Width ν
4.1
Definition
We recall in this section the curvature estimator which is adapted to noisy curves [6]. It is directly deduced from the estimator proposed by D. Coeurjolly [2] for the 2D curves without noise. This technique can be seen as a generalization of the classical order m normalized curvature [10]. Let C be a discrete curve, Ck is a point of the curve. Let us consider the points Cl and Cr of C such that : l < k < r, BS(l, k, ν) and ¬BS(l − 1, k, ν), BS(k, r, ν) and ¬BS(k, r + 1, ν). The estimation of the curvature of width ν at the point Ck shall be determined thanks to the radius of the circle passing through the points Cl , Ck and Cr . To determine the radius Rν (Ck ) of the circumcircle of the triangle [Cl , Ck , Cr ], we use the formula given in [11] as follows (see Figure 2.a): −−−→ −−−→ −−−→ Let s1 = ||Ck Cr ||, s2 = ||Ck Cl || and s3 = ||Cl Cr ||, then Rν (Ck ) =
s 1 s2 s3 (s1 + s2 + s3 )(s1 − s2 + s3 )(s1 + s2 − s3 )(s2 + s3 − s1 )
s Then, the curvature of width ν at the point Ck is Cν (Ck ) = Rν (C with s = k) −−−→ −−−→ sign(det(Ck Cr , Ck Cl )) (it indicates concavities and convexities of curve). As indicated in [2], the degenerated cases, which correspond for example to colinear half-tangents, may be independently tested and, thus, a null curvature is affected to the considered point.
478
T.P. Nguyen and I. Debled-Rennesson
Algorithm 1. Algorithm for the segmentation of a curve C into maximal blurred segments of width ν Data: C - discrete curve with n points, ν - width of the segmentation Result: M BSν - the sequence of maximal blurred segments of width ν begin k=0; Sb = {C0 }; M BSν = ∅; a = 0; b = 1; ω = b, μ = 0; ω−1 while max(|a|,|b|) ≤ ν do k++; Sb = Sb ∪ Ck ; Determine D(a, b, μ, ω) of Sb ; end bSegment=0; eSegment=k-1 ; M BSν = M BSν ∪ CbSegment,eSegment ; while k < n − 1 do ω−1 while max(|a|,|b|) > ν do bSegment++ ; Sb = Sb \ CbSegment ; Determine D(a, b, μ, ω) of Sb ; end ω−1 while max(|a|,|b|) ≤ ν do k++ ; Sb = Sb ∪ Ck ; Determine D(a, b, μ, ω) of Sb ; end eSegment=k-1; M BSν = M BSν ∪ CbSegment,eSegment ; end end
4.2
Algorithm for the Estimation of the Curvature of Width ν at Each Point of C
We propose in this section a new algorithm for the determination of the curvature of width ν at each of the n points of a curve C. The complexity of this algorithm is better than the one of the naive algorithm, in O(n2 ). It consists in calculating at each point Ck , the maximal blurred segment on the right side, M BSR , the maximal blurred segment on the left side, M BSL , then the circle passing through the 3 points: left extremity of M BSL , Ck , right extremity of M BSR . Principle of the Algorithm Let M BSR (k, r, ν) and M BSL (l, k, ν) be the maximal blurred segments on the right and left sides of the point Ck . Then it exists r ≤ k and l ≥ k such that M BSR (k, r, ν) ⊂ M BS(r , r, ν) and M BSL (l, k, ν) ⊂ M BS(l, l, ν). Let us then consider the decomposition of C into maximal blurred segments: M BSν (C) = {M BS(B1 , E1 , ν), M BS(B2 , E2 , ν), ..., M BS(Bm , Em , ν)} with B1 < B2 < ... < Bm and E1 < E2 < ... < Em . We look for the indices i and j such that i is the first index such that Ei ≥ k and j is the last index such that Bj ≤ k. So it is obvious that l = Bi , r = Ej and that the curvature of width ν at the point Ck is the inverse of the radius of the circumcircle of the triangle [Cl , Ck , Cr ]. More generally, we have the following simple result:
Curvature Estimation in Noisy Curves .7638 us: 14 Radio
Oy
Bi D(1 ,−
479
Ei
,5)
−2 1,2,
2,−3
,5)
D(
Ox T
(a) Estimation of the curvature at the point T with width 2
Bi+1
Ei+1
(b) Ei (Bi+1 ) is front (back) of points in first (second) bold edge
Fig. 2.
Property 2. Let L(k), R(k) be the functions which respectively present the indices of the left and right extremities of the maximal blurred segments on the left and right sides of the point Ck . – ∀k such that Ei−1 < k ≤ Ei , then L(k) = Bi – ∀k such that Bi ≤ k < Bi+1 , then R(k) = Ei This method is used in the algorithm 2 (see Figure 2.b). Complexity Both steps of labelling and estimation of the curvature at each point are executed in linear time. However, the determination of the maximal blurred segments are executed in O(n log2 n). Thus the complexity of our method is in O(n log2 n). It is more efficient than O(n2 log2 n) when we work with general curves as well as O(n2 ) for the simple curves with the existing methods.
Algorithm 2. Width ν curvature estimation at each point of C Data: C discrete curve of n points, ν width of the segmentation Result: {Cν (Ck )}k=0..n−1 - Curvature of width ν at each point of C begin Build M BSν = {M BS(Bi , Ei , ν)} ; m = |M BSν |; E−1 = −1; Bm = n; for i = 0 to m − 1 do for k = Ei−1 + 1 to Ei do L(k) = Bi ; for k = Bi to Bi+1 − 1 do R(k) = Ei ; end for i = 0(∗) to n − 1(∗) do Rν (Ci ) = Radius of the circumcircle to [CL(i) , Ci , CR(i) ]; s Cν (Ci ) = Rν (C ; i) end end
480
T.P. Nguyen and I. Debled-Rennesson
(a) Noisy circle, radius = 20
(c)
(b) max=-0.0444, mean=-0.0474
min=-0.0531,
(d) max=0.1761, min=-0.201
Fig. 3. Examples of curvature extraction with ν = 2
(*) The bounds mentioned in the algorithm are correct for a closed curve. In case of an open curve, the instruction becomes: For i = l to n - 1 - l with l fixed to a constant value. Indeed it is not possible to calculate a maximal blurred segment on the left side (resp. on the right side) at the first point (resp. at the last point) of the curve. Thus the calculation of the curvature begins (resp. stops) at the lth (resp. (n − 1 − l)th ) point of the curve. Results Two discrete curves (3.a and 3.c) are represented on the Figure 3 with the graph of their curvature values calculated at each point of the curves with width 2 (3.b and 3.d). The points of the curve 3.c corresponding to the peaks of the associated curvature graph 3.d are indicated by black pixels.
5
Conclusion
We have proposed the notion of maximal blurred segments of a discrete curve for a given width as an extension to noisy curves of the definitions proposed in [1,3]. This decomposition of a discrete curve enabled us to propose an optimized
Curvature Estimation in Noisy Curves
481
version of the algorithm of calculation of the curvature of width ν at each point of a general discrete curve. However, we expect to obtain a specific algorithm with a better complexity for simple curves where points are added in one direction. The extension of the results of this paper to 3D discrete curves will be subject to a next publication.
References 1. Feschet, F., Tougne, L.: Optimal time computation of the tangent of a discrete curve: Application to the curvature. In: Bertrand, G., Couprie, M., Perroton, L. (eds.) DGCI 1999. LNCS, vol. 1568, pp. 31–40. Springer, Heidelberg (1999) 2. Coeurjolly, D., Miguet, S., Tougne, L.: Discrete curvature based on osculating circle estimation. In: Arcelli, C., Cordella, L.P., Sanniti di Baja, G. (eds.) Visual Form 2001. LNCS, vol. 2059, pp. 303–312. Springer, Heidelberg (2001) 3. Lachaud, J.-O., Vialard, A., de Vieilleville, F.: Analysis and comparative evaluation ´ Damiand, G., Lienhardt, P. (eds.) of discrete tangent estimators. In: Andr`es, E., DGCI 2005. LNCS, vol. 3429, pp. 240–251. Springer, Heidelberg (2005) 4. Debled-Rennesson, I., Feschet, F., Rouyer-Degli, J.: Optimal blurred segments decomposition of noisy shapes in linear time. Computers & Graphics 30 (2006) 5. Reveill´e, J.P.: G´eom´etrie discr`ete, calculs en nombres entiers et algorithmique Th`ese d’´etat. Universit´e Louis Pasteur, Strasbourg (1991) 6. Debled-Rennesson, I.: Estimation of tangents to a noisy discrete curve. In: Vision Geometry XII. In: SPIE. vol. 5300, pp. 117–126 (2004) 7. Salmon, J.P., Debled-Rennesson, I., Wendling, L.: A new method to detect arcs and segments from curvature profiles. In: Proceedings of the 18th International Conference on Pattern Recognition, Hong-Kong, China (2006) 8. Buzer, L.: An elementary algorithm for digital line recognition in the general case. ´ Damiand, G., Lienhardt, P. (eds.) DGCI 2005. LNCS, vol. 3429, In: Andr`es, E., pp. 299–310. Springer, Heidelberg (2005) 9. Overmars, M., van Leeuwen, J.: Maintenance of configurations in the plane. J. Comput. and Syst. Sci. 23, 166–204 (1981) 10. Rosenfeld, A., Johnston, E.: Angle detection on digital curves. IEEE Transactions on Computers, 875–878 (1973) 11. Harris, J., Stocker, H.: Handbook of mathematics and computational science. Springer, Heidelberg (1998)
3D+t Reconstruction in the Context of Locally Spheric Shaped Data Observation Wafa Rekik1 , Dominique B´er´eziat2, and S´everine Dubuisson1 1
Universit´e Pierre et Marie Curie (UPMC), Laboratoire d’Informatique de Paris 6 (LIP6) 104 Avenue du Pr´esident Kennedy, 75016 Paris 2 Clime project/INRIA Rocquencourt B.P. 105 78153 Le Chesnay Cedex France
[email protected] Abstract. The main focus of this paper is 3D+t shape recovery from 3D spatial data and 2D+t temporal sequences. This reconstruction is particularly challenging due to the great deal of in-depth information loss observed on the 2D+t temporal sequence. Our approach embed a geometrical local constraint to handle the critical lack of information. This prior constraint is defined by a spherical topology because several applications may be concerned. It allows us to model relevantly the 3Dto-2D transformation that reduces each 3D image into a 2D frame. We then can build a 3D inaccurate inverse reconstruction of each 2D frame belonging to the video, i.e. 2D+t sequence. These inaccurate 3D images are enhanced by gradual motion compensation using a regularity criterion. Results on synthetic data are displayed. Keywords: 3D reconstruction, motion compensation, variational formulation, optical flow constraint.
1
Introduction
In some computer vision applications, we deal with volumetric, i.e 3D, acquisitions framing temporal, i.e 2D+t, ones (figure 1). First type of acquisitions provides a description of the 3D scene geometry, thus, purely spatial information. Second ones observe 2D object motion providing temporal and partial spatial kind of information. A possible approach to mine exhaustively these complementary datasets is to carry out a complete spatio-temporal or a 3D+t scene reconstruction. 3D+t sequences are restored by recovering the 3D original volume from each 2D frame belonging to the 2D+t sequence. As the matter of fact, 3D reconstruction is an inverse problem. In this particular case, it is also an ill posed problem since a single 2D frame is hardly sufficient to recover the 3D original volume. Indeed, this frame exhibits a great deal of spatial distortion and in-depth loss of information caused by the projective transformation that reduces the 3D real structure into a 2D image. To handle this critical lack of information, we introduce a prior data model. It consists in a geometrical constraint defined a spherical topology. Data observing locally spheric shaped objects can W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 482–489, 2007. c Springer-Verlag Berlin Heidelberg 2007
3D+t Scene Reconstruction in the Context
Volume1 (3D)
2D Projections
483
Volume2 (3D)
3D+t full sequence
Fig. 1. Recovering a 3D+t sequence from a couple of 3D images framing a series of 2D projections
be issued from remote sensing acquisitions and especially meteorology, from astronomical ones focusing on the sun and from bio-cellular microscopic imagery. A relevant biologic application is described in [1] where biologists study dynamic of structures of interest evolving on a spherical cell simulation surface. In this context, for instance, spatial datasets consist in multifocus acquisitions. Hence, they are not actual 3D images since they are formed by a series of equal sized slices, where signal in each slice is bothered by luminosity coming from adjacent ones. A spatial deconvolution stage is therefore needed when biologists calibrate correctly the video microscopy system to cell wall simulation experiments. Due to this limitation and to respect generality, we use synthetic data to validate implemented algorithms. Our approach uses this geometrical prior constraint in order to model relevantly the 3D-to-2D transformation. This modeling allows us to build a 3D inaccurate inverse reconstruction of each 2D frame belonging to the video sequence. These inaccurate 3D images are enhanced by gradual motion compensation using a regularity criterion. We proceed with a short overview of 3D reconstruction methods using multiple/single projective views in section 2. Then, we move to the description of our 3D+t reconstruction approach in section 3. Experimental results on synthetic datasets are given in section 4.
2
State-of-the-Art
In the literature, a wide scope of approaches addresses the 3D reconstruction problem. Volume based methods aim at restoring an accurate description of luminosity in each voxel of the 3D reconstructed image, while surface based ones provide a 3D model of the object of interest. This model requires shape and texture recovery of the outer visible side of the observed object. Namely, tomographic reconstruction [2] focuses on restoring the original volume from multiple projective views. Stereo-vision [3] approaches make use in general of the triangulation principle in order to recover the surface shape of selected targets.
484
W. Rekik, D. B´er´eziat, and S. Dubuisson
Moreover, 2D-to-3D registration methods reconstruct 3D features from only one projective view. This registration consists in identifying a geometrical transformation that aligns 2D frames with 3D datasets often acquired with a different modality. This alignment determines their relative position and orientation and then locates them in the coordinate system of the 3D images. It is of growing interest in the medical imaging field mainly to register 3D pre-operative images (like MRI, CT images) into 2D series of intra-operative frames acquired during a surgical intervention. Registration may operate on feature-based methods or intensity based ones. Reader is referred to [4,5]. Tomography and stereovison, namely, require multiple projective views to achieve the reconstruction task. Therefore, aligning 2D projection images to 3D datasets approaches appear to be adapted to our context of study. However they fit into a registration framework rather than a reconstruction one, unless other projective views are available, which is quite unusual. They can be used alternatively for a preprocessing stage for some aforementioned applications. We built then an adapted reconstruction model involving a prior geometrical constraint.
3
3D+t Reconstruction Approach
We carry out the 3D+t reconstruction using a couple of 3D data acquired at two different instants as well as the 2D+t sequence filmed meanwhile. Since the intermediary video sequence presents an interesting temporal resolution, we propose to restore the underlying 3D structure of each frame. The latter results from a reduction of the real 3D structure into a 2D projective view at a given time t. We take into account the prior geometrical constraint related to the observed object in order to build an inaccurate 3D inverse reconstruction of each 2D frame. We then attempt to improve this inaccurate 3D representation by motion compensation. We compute a motion vector field matching the 3D inaccurate structure with the 3D one already estimated correctly at time t − 1. The whole 3D+t sequence is reconstructed gradually using the couple of 3D data as initial border conditions. 3.1
Modeling Assumption
We carry out the 3D+t reconstruction using a series of 2D+t video sequence acquisitions intersected with 3D volumetric ones. We opt for the following notations for the remainder of the paper: X = (x, y, z), x = (x, y), W = (u, v, w), ˆ I(X, t), I2D (x, t) and I(X, t) are respectively a 3D vector position, a 2D vector position, a 3D vector field, a 3D+t sequence, a 2D+t one and finally the estimate of a 3D image at time t. Concretely, we have two 3D structures at two different instants: I 0 = I(X; t0 ) et I 1 = I(X; t1 ). Between t = t0 and t = t1 , we also have a video sequence I2D (x; t) describing all the 2D reductions of the I(X; t) structures evolving in this duration, i.e. between t = t0 and t = t1 . Each 2D frame I2D is a projective sight, at a given instant t, of the 3D original volume. This projection depends only on the data acquisition mechanism. We model it by a linear and
3D+t Scene Reconstruction in the Context
485
stationary transformation, called projection p: p(I)(x, y) = Ê I(x, y, z)h(z)dz, where h(z) is a measure kernel describing the interaction between observed object and the input light signal. p integrates contributions of all 3D slices to produce an unique 2D image. It introduces then, an important loss of information in the z−direction. 3.2
Prior Geometrical Constraint
Let us point out that, in our case of study, structures of interest evolve on a spheroid surface. The transformation p does not introduce distortions on respectively x and y−directions, consequently, p is reduced merely to a sum of two contributions coming from, respectively, the frontal and the dorsal hemisphere. 2D frames suffer then from a positioning ambiguity. Indeed, we are unable to assert that a projected object belongs either to the frontal or dorsal hemisphere. We suppose that the spherical support has a constant radius that we estimated in a pre-processing stage. We map the texture of each 2D frame on both sides of a sphere of same parameters that the original surface. This leads to an inaccurate 3D+t sequence IG (X, t) geometrically similar to the original 3D+t sequence I(X, t), but holding errors generated by the positioning ambiguity (figure 2). The sequence IG is formally obtained by IG ◦G(x, t) = I2D (x, t) where G is defined by G(x, y, z) = (x, y)T if (x−cx )2 +(y −cy )2 +(z −cz )2 = R2 . Parameters (cx , cy , cz ) and R are respectively the center and the radius of the sphere surface.
A 2D given projection
?
? ? Mapping same Texture
Possible 3D original images
I2D (x, t)
on frontal and dorsal hemispheres IG(X, t)
Fig. 2. From left to right: illustration of the positioning ambiguity problem and recovering inaccurate 3D images, IG (X, t), from 2D frames I(x, t)
Building the inaccurate 3D+t sequence yields the following data model equation: ˆ I(X, t) = IG (X, t), t ∈]t0 ; t1 [ 3.3
(1)
Motion Compensation
In order to restore correctly each luminosity contribution in the corresponding hemisphere, we match the inaccurate 3D representation, IG (X, t), with the 3D
486
W. Rekik, D. B´er´eziat, and S. Dubuisson
image available at t−1, I(X, t−1). This matching is quantified as a displacement vector field W describing voxel motion between both structures IG (X, t) and I(X, t − 1). Since times t and t − 1 are assumed to be very close and objects of interest evolve scarcely in this duration, we make the assumption that a moving voxel keeps the same gray value over time. A straightforward equation, similar to optical flow constraint, is then issued under the constant brightness assumption from time t to t − 1: ˆ I(X, t) = I(X + W, t − 1) (2) ˆ We recall that I(X, t) accounts for the estimate of the 3D image at time t. We estimate a retrogress vector field defining voxel displacement from t − 1 to t for notation conveniences. Expanding the right-hand side of equation (2) in a first order Taylor series leads to: I(X + W, t − 1) = I(X, t − 1) + ∇I(X, t − 1).W. This linearization yields the following motion constraint: ˆ I(X, t) = I(X, t − 1) + ∇I(X, t − 1).W 3.4
(3)
Numerical Resolution
Combining both constraints related, respectively to motion (3) and data model (1) leads to an equation model related to the displacement vector field:
(W)(X, t) = I(X, t − 1) + ∇I(X, t − 1).W − I
G (X, t)
=0
(4)
On one hand, equation (4) is under determined as it is insufficiency to recover reliably the displacement vector field W. On the other hand, a data model constraint involving a corrupted 3D representation, IG (X, t), is embedded in this equation. To cope with the inaccuracy of IG (X, t), we add an adapted smoothness term standing for spatial regularization of W. This regularization overcomes the aforementioned ambiguity positioning problem. We estimate gradually a dense motion field using a variational formulation. We build a functional E whose minimum, with respect to W, corresponds to the displacement vector field matching IG (X, t) and I(X, t − 1). E is composed of two additive terms: E(W) = (W)2 (X, t)dX + α ∇u2 + ∇v2 + ∇w2 dX (5) Ω
Ω
where the first term is related to equation (4) and the second one penalizes high spatial deformations of W. Parameter α tunes the importance of the second term, i.e. the spatial regularization. Differentiation of E, with respect to displacements u, v and w yields a set of Euler-Lagrange equations (reader is referred to [6]). Discretization of the latter with finite differences leads to a huge linear system. We solve the minimization of the linear system within a Gauss-Seidel iterative scheme. Estimated 3D images are reconstructed by motion compensation with respect to equation (2). However, positions of mapped samples computed with this equation do not match perfectly grid-positions in the output image ˆ I(X, t). We use then a method introduced in [7] to fit a smooth surface from the
3D+t Scene Reconstruction in the Context
487
set of mapped samples. This algorithm makes use of a coarse-to-fine hierarchy of control lattices in order to generate a sequence of bi-cubic B-spline functions whose sum approaches the desired interpolation function. In order to restore gradually the complete 3D+t sequence, we adopt a quite simple reconstruction strategy. We use, respectively, a forward progression procedure taking as first 3D available image, I0 and a backward one using the final 3D available image I1 . Consequently, we generate two displacement vector fields at the median moment of the temporal interval [t0 ; t1 ]. The final 3D image is computed merely using the average of both fields.
4
Results
We present, in this section, a series of experiments in order to assess the performance of the proposed approach. They consist in computing the displacement vector field matching a 2D frame with an anterior 3D structure. As real data are not available, we design two sets of simple synthetic data. First one is composed of a couple of spheres I(X, t1 ), I(X, t2 ) holding two squares lying on respectively the frontal and dorsal hemispheres. From t1 to t2 , the frontal square center moves towards the right-bottom direction and the dorsal square evolves merely downwards. We simulate the 2D transformation of I(X, t2 ) yielding I2D (x, t). We build then inaccurate representation IG (X, t2 ) using I2D (x, t) and finally we compute displacement vector field matching I(X, t1 ) and IG (X, t2 ). The estiˆ mate I(X, t1 ) is computed by motion compensation. Second set of data is quite identical to the first one, except that squares occlude each other in the 2D projection. In order to visualize motion computation results, we use an adapted tool called MAPVIS [8]. This tool displays complex information (scalar and vectorial) lying on real spheroid surfaces or projected ones. In the 3D case, MAPVIS project data embedded on the actual globe observed part. Moreover, it improves visualization of texture in perspective projective views of spheric shaped structures. This tool is based on a suitable planar map projection that unrolls the curved surface around a given origin. Therefore, varying the projection origin around the surface allows to observe different views of the sphere. Since the selected map projection minimize distortions around the projection origin, the closer this point is the more accurate is the data recovery. Equatorial aspect of map projection, with reference to an origin lying on the equator displays the frontal hemisphere of the observed globe and polar one, with reference to the north geographical pole displays information lying on the northern hemisphere. We display in figures 3 and 4 motion results for respectively first and second set of synthetic data. Let us point out that motion computation produces a set of correct vector fields (from amplitude and direction point of views), perfectly tangent to the enclosing spherical surface. Moreover, our reconstruction method handle reliably ambiguity positioning as moving square are mapped on the right original hemispheres. It is noticeable by visualization of difference between the ˆ t2 ). Performance of our estimated 3D image and the original one, i.e. I(., t2 )− I(., algorithm is not bothered by the more complex case showing object occlusion (see
488
W. Rekik, D. B´er´eziat, and S. Dubuisson
Fig. 3. On the first line: from left to right, visualization of the equatorial aspect then polar aspect of respectively of the set I(., t1 ), W then IG (., t2 ). On the second line, from left to right, visualization of the equatorial aspect then polar aspect of respectively ˆ t2 ) then of difference image I(., t2 ) − I(., ˆ t2 ). I(.,
Fig. 4. On the first line: from left to right, visualization of the equatorial aspect then polar aspect of respectively I(., t1 ) then IG (., t2 ). On the second line, from left to right, ˆ t2 ) then of visualization of the equatorial aspect then polar aspect of respectively I(., ˆ difference image I(., t2 ) − I(., t2 ).
figure 4) which does not ease the aforementioned structure distinction problem. However image difference show also reconstruction errors, doubtless amplified by the double interpolation procedure involved first to compute a smooth surface from samples recovered by motion compensation and second to visualize projective views [8]. These errors are localized in the border of moving objects. They may be caused by spatial derivative computations. Besides, they are due to the smoothing of motion discontinuities introduced by the spatial regularization of the estimated displacement vector field.
5
Conclusion
In this paper, we have presented an original approach dedicated to 3D+t scene reconstruction using 3D temporal data and 2D+t temporal sequences. It is based
3D+t Scene Reconstruction in the Context
489
on motion compensation and involves a prior geometrical data constraint. The latter touches spherical topology because several applications may be concerned. It leads us to build an inaccurate 3D inverse reconstruction of each 2D frame. This 3D image is enhanced by motion compensation, involving a regularity criterion under the brightness assumption constancy. The whole 3D+t sequence is reconstructed gradually using both 3D data as initial border conditions. Incorporating the prior geometrical constraint reduces relevantly the 3D-to-2D transformation to a merely a sum of contributions coming from the underlying frontal and dorsal hemispheres. This generates a positioning ambiguity on the intermediate 3D inaccurate representation. Our algorithm overcomes reliably this inaccuracy and handle possible object occlusions. Besides, we use an adapted tool, with relevant visualization properties, in order to display quickly reconstruction results. Currently, the whole sequence is computed frame per frame causing errors cumulation along the sequence. This limitation could be circumvented using a spatio-temporal smoothness constraint computed globally on the whole sequence [9]. Discussing other reconstruction issues, it is possible to generalize the prior geometrical constraint to any regular surface. We propose also in an other framework to solve the 3D+t scene recovery taken lower assumptions related to the data model. Moreover, we plan to apply our approach based on spherical topology to microscopic cell wall simulation acquisitions in the context described in [1]. That requires a spatial deconvolution framework to reconstruct each sphere from the original multi-focus image.
References 1. Staneva, G., Angelova, M., Koumanov, K.: Phospholipase a2 promotes raft budding and fission from giant liposomes. Chem. Phys. Lipids 129, 53–62 (2004) 2. Natterer, F.: The mathematics of computerized tomography. Wiley, New York (1986) 3. Huei-Yung, L.: Computer vision techniques for complete 3D model reconstruction. PhD thesis, State university of New York at Stony Brook (2002) 4. Gueziec, A., Kazanzides, P., Williamson, B., Taylor, R.: Anatomy-based registration of CT-scan and intraoperative X-ray images for guiding a surgical robot. IEEE Trans. Med. Imaging 17, 715–728 (1998) 5. Lemieux, L., Jagoe, R., Fish, D., Kitchen, N., Thomas, D.: A patient-tocomputedtomography image registration method based on digitally reconstructed radiographs. Med. Phys. 21, 1749–1760 (1994) 6. Glowinski, R.: Numerical Methods for Nonlinear Variational Problems. Springer edn. Series in computational physics, New York (1984) 7. Lee, S., Wolberg, G., Shin, S.: Scattered data interpolation with multilevel b-splines. IEEE Transactions on Visu. and Comput. Graph. 3(3), 228–244 (1997) 8. Rekik, W., B´er´eziat, D., Dubuisson, S.: Mapvis: a map-projection based tool for visualizing scalar and vectorial information lying on spheroidal surfaces. In: Proceedings of IV05, London (July 2005) 9. Weickert, J., Schn¨ orr, C.: Variational optic flow computation with a spatio-temporal smoothness constraint. Journal of Math. Imag. and Vision 14(3), 245–255 (2001)
Robust Fitting of 3D Objects by Affinely Transformed Superellipsoids Using Normalization Frank Ditrich and Herbert Suesse Chair of Digital Image Processing Department of Computer Science Friedrich-Schiller-University Jena D-07737 Jena, Germany
[email protected],
[email protected] http://www.inf-cv.uni-jena.de
Abstract. We present an algorithm for the robust fitting of objects given as voxel data with affinely transformed superellipsoids. Superellipsoids cover a broad range of various forms and are widely used in many application fields. Our approach uses the method of normalization and a new separation of the affine transformation into a shearing, an anisotropic scale and a rotation. It extends our previous work for 2D fitting problems and for fitting rectangular boxes in 3D. Our technique can be used as a valuable tool for solving this fitting task for 3D data. If the exponents describing the superellipsoids to be fitted are known in advance, the method is extremely robust even against major distortions of the object to be fitted.
1
Introduction
In the past, a lot of work was done dealing with the task of object fitting. This is not only important for objects in the 2D domain, but also in 3D, since there are a lot of techniques (e.g. computer tomography) which deliver 3D data. An overview of fitting methods for superellipses in 2D can be found in [4]. In [7] and [8], we developed methods using normalization for fitting 2D objects which are given as closed regions. A solution for rectangular boxes given as a region in a 3D voxel space was shown in [1], but was constrained only to the group of similarity transformations. This paper presents a new solution which uses ideas from 2D normalization and extends them towards a procedure for normalizing affine 3D transformations using an appropriate matrix separation. Compared to other methods, an advantage is the affine invariance of the fitting method itself. The second section of the paper introduces superellipsoids, and the third section briefly explains the principle of normalization. The following section describes our fitting procedure together with the explanation of the necessary
The work of Frank Ditrich was supported by DAKO GmbH Jena, Germany.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 490–497, 2007. c Springer-Verlag Berlin Heidelberg 2007
Robust Fitting of 3D Objects by Affinely Transformed Superellipsoids
491
normalization steps. Some results are given which show the fitting quality we achieved during some experiments. Finally, after a conclusion some further work which is of interest to augment and complete our solution is mentioned.
2
Superellipsoids
In 2D an ellipse can be described by the parametric representation a cos u e(u) = with u ∈ [−π, π). b sin u Superellipses are a generalization of ellipses with the representation a(cos u)ε s(u) = with u ∈ [−π, π) and ε > 0. b(sin u)ε The exponent ε is used to modify the form of the object, some examples are given in Fig. 1. Of course such a generalization is also possible for 3D ellipsoids.
Fig. 1. Some 2D superellipses with their form depending on the exponent ε (values from left to right 0.3, 0.5, 1.0, 2.0, 3.0)
Here, we have two exponents ε1 and ε2 to control the object form. A great variety of objects can be described, see Fig. 2. The parametric representation is the following: ⎛ ⎞ a(cos u)ε1 (cos v)ε2 π π s(u, v) = ⎝ b(cos u)ε1 (sin v)ε2 ⎠ with u ∈ − , , v ∈ [−π, π), ε1 , ε2 > 0. 2 2 c(sin u)ε1 This is a common representation found in literature, for real application one has to handle the cases where some of the sine and cosine expressions become negative by using their absolute values and additionally considering their signs. An implicit representation is given by the equation ε2 2 2 2 x ε2 y ε2 ε1 z ε1 + + = 1. a b c Superellipsoids have a broad range of application. For example, they are used as basic modelling elements in computer graphics. Other applications are the modelling of bubbles in fluids or the simulation of particles in granulate materials. Since we allow here the affin transformation of the superellipsoids described above, the range of forms for which our algorithm can be used becomes even larger.
492
F. Ditrich and H. Suesse
Fig. 2. Various 3D superellipsoids, controlled by the exponents ε1 and ε2 ((ε1 , ε2 ) from left to right: (0.3, 0.3), (1.0, 1.0), (3.0, 0.3), (0.5, 3.0), (1.0, 2.0), (3.0, 3.0)).
3
Fitting Using Normalization
In this section we briefly explain the principle of normalization for fitting and describe the advantage of this method in contrast to conventional fitting methods. Suppose we are given a class T of transformation where each t ∈ T is described by n parameters ξ1 , . . . , ξn . The theoretical shapes we use for fitting (e.g. rectangles, ellipses, cubes etc.) are modelled through a class P (θ) in which each of its members is described by m parameters θ1 , . . . , θm . If we want to fit a primitive onto a given object O, we usually derive some features f (O) = (f1 , f2 , f3 , . . .). Now some fitting methods try to solve the problem ||f (O) − f (P (θ))||2 −→ min by searching the whole m-dimensional space Θ of all primitives. Using normalization, we first define an appropriate canonical frame for our class of primitives which depends also on the transformation group, for example a unit square for the class of all squares with respect to similarity transformations. Then, we try to calculate a transform which maps the object to be fitted onto this canonical frame according to some normalization conditions. If we also normalize the optimal primitiv P (θ∗ ) with the same transformation, then we have to solve the following optimization problem: ||f (O ) − f (P (θ∗ ))||2 −→ min It has the dimension m − n which is usually significantly smaller than the dimension m of the problem mentioned above. There are many cases where it becomes 0 or 1, see [7] or [8]. After successfully solving this problem we get the optimal primitiv by applying the inverse of the normalization transformation to the optimization solution. For the determination of the normalization transformation and the parameter optimization we use volume moments up to fourth order (p + q + r ≤ 4), which are defined as mpqr = xp y q z r dx dy dz. object
Robust Fitting of 3D Objects by Affinely Transformed Superellipsoids
4
493
Normalization in 3D
To get the normalization transformation for an object, the transformation matrix can be separated into some simpler transformations, and a set of appropriate normalization conditions is used to determine their parameters. A detailed description together with some examples for the 2D case can be found in [5,7,8]. For affine 3D transformations we use here a separation which is similar to the well-known standard method for 2D transformations [9]. The transformation is separated into a rotation (considered as three rotations about the coordinate axes), an anisotrope scaling and a shearing: ⎛ ⎞ a11 a12 a13 A = ⎝ a21 a22 a23 ⎠ = RZ · RY · RX · S · D a31 a32 a33 ⎛ ⎞⎛ ⎞⎛ ⎞ cos u − sin u 0 cos v 0 − sin v 1 0 0 = ⎝ sin u cos u 0 ⎠ ⎝ 0 1 0 ⎠ ⎝ 0 cos w − sin w ⎠ · . . . 0 0 1 sin v 0 cos v 0 sin w cos w ⎛ ⎞⎛ ⎞ ⎛ ⎞ δ 0 0 1 α β | | | . . . · ⎝ 0 ε 0 ⎠ ⎝ 0 1 γ ⎠ = ⎝ C1 C2 C3 ⎠ . 0 0 λ 0 0 1 | | | Here we only consider transformations which do not contain reflections. If a matrix A is given, then u, v and δ can be determined from ⎛ ⎞ ⎛ ⎞ δ cos u cos v a11 C1 = ⎝ δ sin u cos v ⎠ = ⎝ a21 ⎠ . δ sin v a31 Since this expression can be interpreted as a spherical coordinates representation for the point (a11 , a21 , a31 ), we get two possible solutions for u and v: If (u, v) fulfills the equation, then (π + u, π − v) is also a solution. With the substitution of ε cos w and ε sin w the next line turns into a system of linear equations and the values of α, ε and w can be calculated from them: ⎛ ⎞ ⎛ ⎞ αδ cos u cos v − ε(sin u cos w + cos u sin v sin w) a12 C2 = ⎝ αδ sin u cos v + ε(cos u cos w − sin u sin v sin w) ⎠ = ⎝ a22 ⎠ . αδ sin v + ε cos v sin w a32 It is easy to show that the value α is independent of the choice of one of the above solutions for u and v. If we get a value w as a solution using the values u and v, then the solution for π + u and π − v is π + w. But both triples of values lead to the same product RZ · RY · RX . If these values are known, then β, γ and λ can be determined from the last column ⎛ ⎞ βδ cos u cos v − εγ(sin u cos w + cos u sin v sin w) ⎞ ⎛ ⎜ +λ(sin u sin w − cos u sin v cos w) ⎟ a13 ⎜ ⎟ ⎟ ⎝ ⎠ C3 = ⎜ ⎜ βδ sin u cos v + εγ(cos u cos w − sin u sin v sin w) ⎟ = a23 . ⎝ −λ(cos u sin w + sin u sin v cos w) ⎠ a33 βδ sin v + εγ cos v sin w + λ cos v cos w
494
F. Ditrich and H. Suesse
So, we have shown that this separation is possible and unique if the rotation matrix is considered as a whole, and we can use it for normalization. According to the separation, the normalization steps are the following. First, we handle the shearing matrix. The moments are transformed in this way: mpqr = (x + αy + βz)p (y + γz)q z r dx dy dz. object
Analogously to the 2D case here we consider the moments m110 = m110 + γm101 + αm020 + (αγ + β)m011 + βγm002 m101 = m101 + αm011 + βm002 m011 = m011 + γm002 If we state the conditions m110 = 0,
m101 = 0,
m011 = 0,
we can compute the parameters α, β and γ from them in the following way: α=
m101 m011 − m110 m002 , m020 m002 − m2011
β=
m110 m011 − m101 m020 , m020 m002 − m2011
γ=−
m011 m002
In the second step we deal with the anisotropic scaling. Here the transformation of moments is done according to ⎛ ⎞ δ 0 0 mpqr = det ⎝ 0 ε 0 ⎠ · (δx)p (εy)q (λz)r dx dy dz = δ p+1 εq+1 λr+1 mpqr , 0 0 λ object especially m200 = δ 3 ελm200 ,
m020 = δε3 λm020 ,
m002 = δελ3 m002 .
With the conditions m200 = 1,
m020 = 1,
m002 = 1
we get δ=
10
m020 m002 , m 4200
ε=
10
m002 m200 , m 4020
λ=
10
m200 m020 . m 4002
The third step is the normalization of the rotation. In the 2D case it was possible to determine the rotation angle ϕ directly from a normalization condition (see [5]). Unfortunately, an explicit solution for the 3D case is not known to us until now. Therefore, we used the following conditions to bring the object in an adequate position: m 310 = 0, m301 = 0, m130 = 0, m031 = 0, m103 = 0, m013 = 0.
Robust Fitting of 3D Objects by Affinely Transformed Superellipsoids
495
Here m pqr are the moments obtained after the rotation. For a rotation matrix ⎛ ⎞ R11 R12 R13 R = ⎝ R21 R22 R23 ⎠ R31 R32 R33 (with det(R) = 1) they are transformed from mpqr using
(x + y + z)n =
0≤a,b,c≤n a+b+c=n
(a + b + c)! a b c x y z a! b! c!
a+b+c b+c a b c = x y z b+c c 0≤a,b,c≤n
a+b+c=n
(see [2]) as follows: m pqr
=
0≤a,b,c≤p 0≤d,e,f ≤q 0≤h,i,k≤r a+b+c=p d+e+f =q h+i+k=r
a+b+c b+c d+e+f e+f · ... b+c c e+f f
h+i+k i+k f a b c d e h i k ... · · R11 R12 R13 R21 R22 R23 R31 R32 R33 ·... i+k k . . . · ma+d+h,b+e+i,c+f +k . We used them to do an optimization 2 2 2 (m 310 (u, v, w)) + (m301 (u, v, w)) + (m130 (u, v, w)) + . . . 2 2 2 . . . + (m 031 (u, v, w)) + (m103 (u, v, w)) + (m013 (u, v, w)) −→ min
subject to u, v, w ∈ [0, 2π) to determine the three rotation angles and to bring the object into its final position for the next stage.
5
Determination of the Remaining Parameters
In the last step, it is necessary to determine the parameters of the superellipsoid which describes best the normalized object. This is done by performing an optimization using the following moments: m 000 , m400 , m040 , m004 ,
m 220 , m202 , m022 .
The theoretical values of these moments for given parameters a, b, c, ε1 , ε2 can be calculated according to (see [3] for a detailed description) 2 ap+1 bq+1 cr+1 ε1 ε2 · . . . p+q+2
ε1 ε1 ε2 ε2 . . . · B (r + 1) , (p + q + 2) + 1 · B (q + 1) , (p + 1) , 2 2 2 2 μpqr (a, b, c, ε1 , ε2 ) =
496
F. Ditrich and H. Suesse
since all indices are even numbers. Moments with odd indices become 0. Here B(x, y) = (Γ(x)Γ(y))/Γ(x + y). The parameters are determined by ((m 000 − μ000 (a, b, c, ε1 , ε2 )) + . . . 2 2 . . . + ((m 400 − μ400 (a, b, c, ε1 , ε2 )) + ((m040 − μ040 (a, b, c, ε1 , ε2 )) + . . . 2 2 . . . + ((m004 − μ004 (a, b, c, ε1 , ε2 )) + ((m220 − μ220 (a, b, c, ε1 , ε2 )) + . . . 2 2 . . . + ((m 202 − μ202 (a, b, c, ε1 , ε2 )) + ((m022 − μ022 (a, b, c, ε1 , ε2 )) −→ min 2
subject to a, b, c, ε1 , ε2 > 0. One aspect we have to note in this situation is that the calculation of the moments given above holds for a superellipsoid in the special position defined by the parametric representation in section 2 (along the z-axis). The conditions for normalizing the rotation using the moments m 310 etc. can not guarantee to bring the object in this position. They will be fulfilled if (but not only if) the object is aligned along a coordinate axis, and additionally a rotation of π/4 about this axis is allowed. So we have to perform the optimization for these six positions and use the best one as the result. In cases in which the superellipsoid is not convex (if one of the exponents ε1 , ε2 is greater than 2), there are also special positions different from these six where the rotation normalization conditions are satisfied too. But they can easily be rejected, since they lead to non-meaningful results for the parameters. In our experiments it was always possible to reach a solution for the parameter optimization by simply choosing multiple start values. As the very last step, the inverse of the normalizing transformation composed of the shearing, the anisotropic scaling and the rotation is applied to the optimal superellipsoid found during parameter optimization to get the final fitting solution.
6
Experimental Results
For testing purposes, we randomly generated affinely transformed superellipsoids within a 2003 voxel cube. To evaluate the fitting quality, we generated the border of the original superellipsoid and the border of the fitted superellipsoid as voxels. For every voxel of one border the nearest voxel of the other one is determined. The maximum of all these distances is used as a quality measure. Our method provides very good fitting results. For 300 randomly generated superellipsoids we got an average fitting quality of 1.5. Even if the form of the object is distorted by major missing or added parts, we have only small deviations of the fitted object from the original, if the values for the exponents ε1 and ε2 are known before fitting, see Fig. 3. In the left and middle images an affinely transformed superellipsoid with parameters a = 40, b = 60, c = 80, ε1 = 1, ε2 = 1 is used, and a sphere with radius 30 is removed or added. The resulting values for the fitting quality are 3.16 and 8.25. In the right image the value for the exponents is 2.5, the radius of the removed sphere is 20 and the resulting quality is 3.61. So here we have a robustness against distortions like we have already noticed for rectangular boxes [1].
Robust Fitting of 3D Objects by Affinely Transformed Superellipsoids
497
Fig. 3. Three superellipsoids (gray) with major parts missing or added and the fitting result (white). See Section 6 for explanation.
7
Conclusion and Further Work
In this paper we presented an algorithm for fitting objects with affinely transformed superellipsoids. Its key element is a normalization of an affine transformation by separating it into a shearing, an anisotropic scaling and a rotation. We achieved very good fitting results, even for objects with major missing parts. One drawback of our method is the necessity to determine the rotation by an optimization instead of explicit calculation as it was possible in 2D. To find a solution for this problem is one task of the possible additional work. Perhaps there are also other separations of the affine transformations (as it is the case in 2D) that are also suitable for handling this fitting problem.
References 1. Ditrich, F., Suesse, H.: A New Efficient Algorithm for Fitting Rectangular Boxes and Cubes in 3D. In: Proc. IEEE ICIP 2005, Genoa, Italy, Part II, pp. 1134–1137 (2005) 2. Graham, R.L., Knuth, D.E., Patashnik, O.: Concrete Mathematics. AddisonWesley, Massachusetts (1994) 3. Jakliˇc, A., Solina, F.: Moments of Superellipsoids and their Application to Range Image Registration. IEEE Trans. Systems, Man, and Cybernetics, Part B 33(4), 648–657 (2003) 4. Rosin, P.L.: Fitting Superellipses. IEEE Trans. PAMI 20, 726–732 (2000) 5. Rothe, I., Suesse, H., Voss, K.: The Method of Normalization to Determine Invariants. IEEE Trans. PAMI 18, 366–375 (1996) 6. Solina, F., Bajcsy, R.: Recovery of Parametric Models from Range Images: The Case for Superquadrics with Global Deformations. IEEE Trans. PAMI 12, 131–147 (1990) 7. Voss, K., Suesse, H.: Invariant Fitting of Planar Objects by Primitives. IEEE Trans. PAMI 19, 80–83 (1997) 8. Voss, K., Suesse, H.: A New One-Parametric Fitting Method for Planar Object. IEEE Trans. PAMI 21, 646–651 (1999) 9. Wang, K.P.W.: Affin Invariant Moment Method of Three-Dimensional Object Identification, Ph.D.Thesis. Syracuse University (1977) 10. Zhang, Y.: Experimental comparison of superquadratic fitting objective functions. PRL 24(14), 2185–2193 (2003)
Fast and Precise Weak-Perspective Factorization ´ Levente Hajder1,2 , Akos Pernek2 , and Csaba Kaz´ o2 1
Computer and Automation Research Institute, Hungarian Academy of Sciences Kende u. 13-17., H-1111 Budapest, Hungary 2 Budapest Universitiy of Technology and Economics
[email protected] Abstract. We address the problem of moving object reconstruction. Several methods have been published in the past 20 years including stereo reconstruction as well as multi-view factorization methods. In general, reconstruction algorithms estimate the 3D structure of the object and the camera parameters in a non-optimal way and then a nonlinear optimization method refines the estimated camera parameters and 3D object coordinates. In this paper, an adjustment method is proposed which is the fast version of the well-known down-hill alternation method. The novelty which yields the high speed of the algorithm is that the steps of the alternation give optimal solution to the subproblems by closed-form formulas. The proposed algorithm is discussed here and it is compared to the widely used bundle adjustment method.
1
Introduction
3D motion-based object reconstruction is a challenging and developing area of computer vision. Many of the published rigid SfM methods are based on various extensions of the well-known factorization method [9] by Tomasi and Kanade. The original method assumes orthographic projection, while its extensions can cope with weak-perspective [12] as well as real projection [8]. It is well known, that there is no optimal closed-form solution to the problem. For this reason, researchers have proposed non-optimal methods and then they try to minimize the so-called reprojection error (or other cost function) by adjustment algorithms such as bundle adjustment [2]. It is also well known that these classical algorithms [9,12,8] cannot cope with realistic situations: they assume that all features points are tracked over all frames. For this reason, several methods have been published to solve the case when data are missing or uncertain[7,4,11]. (The Reader is referred to [3] for an excellent review.) However, the full matrix factorization is also an important issue, because Monte Carlo type robust algorithms [10,6,5] can use it to compute the camera motion from few points. The proposed method assumes the weak-perspective camera model which is applicable if the depth of the object is much smaller than the object-camera distance, but it can be extended easily to the perspective case by using the so-called rescaled measurement matrix [8]. The proposed down-hill alternation method is very similar to the one published by us in 2006 [6]. The main difference W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 498–505, 2007. c Springer-Verlag Berlin Heidelberg 2007
Fast and Precise Weak-Perspective Factorization
499
is that the proposed method is faster, because the computation steps are given in closed forms.
2
Summary of the Tomasi-Kanade Factorization
Given P feature points of a rigid object tracked across F frames, xf p = (uf p , vf p )T , f = 1, . . . , F , p = 1, . . . , P , the goal of SfM is to recover the structure of the object. For orthogonal projection, the 2D coordinates are calculated as xf p = Rf sp + tf ,
(1)
where Rf = [rf 1 , rf 2 ]T is the orthonormal rotation matrix, sp the 3D coordinates of the point and tf the offset. Under the weak perspective model, the equation is xf p = qf Rf sp + tf , (2) where qf is the nonzero scale factor of weak perspective. The offset vector is eliminated by placing the origin of 2D coordinate system at the centroid of the feature points. For all points in the f -th image, the above equations can be rewritten as Wf = (xf 1 . . . xf P ) = Mf · S 2×3
2×P
(3)
3×P
where Mf is called the motion matrix, S = (s1 , . . . , sP ) the structure matrix. Under orthography Mf = Rf , under weak respective Mf = qf Rf . For all frames, the equations (3) form W = M · S ,
2F ×P
(4)
2F ×3 3×P
where W T = [W1T , W2T , . . . , WFT ] and M T = [M1T , M2T , . . . , MFT ]. The Tomasi-Kanade factorization method [9] has two main steps: Step 1. In the noise-free case, W has rank 3 after subtracting the centre of gravity in all frames. When noise is present, the method reduces the rank of the noisy measurement matrix by Singular Value Decomposition. In this step, W is factorized into affine motion and affine structure: W = Maf f Saf f . Step 2. In the reduced space, the method applies a linear transformation to obtain the real (Euclidean) motion and structure data. The transformation is represented by a matrix, Q, and the estimated motion and structure data are written as M = Maf f Q and S = Q−1 Saf f , respectively. The matrix Q can be optimally determined by imposing orthographic [9] or weak-perspective [12] motion constraints. The final factorization is written as W = M S, where the matrix S consists of the 3D coordinates of the moving object, and the matrix M contains the 3D directions of base vectors of the camera planes.
500
´ Pernek, and C. Kaz´ L. Hajder, A. o
If few feature points are tracked, the quality of the Tomasi-Kanade factorization is not sufficient, and this is mainly caused by the SVD-based rank reduction. This reduction technique needs a relatively large number of feature points to obtain a good estimation. The SVD method determines a 3D subspace, and, despite the corrective transformation Q, the result cannot leave this subspace.
3
Proposed Method: Fast Alternation
As it is discussed in the introduction, the Tomasi-Kanade factorization does not produce optimal solution. We propose a down-hill alternation method to minimize the reprojection error ||W − M S||2F
(5)
subject to Mf MfT = qf I, ∀f , where ||.||F denotes the Frobenius norm of a matrix. Our method guarantees that a local optimum will be reached starting from the initial matrices M (0) and S (0) yielded by the Tomasi-Kanade factorization. Unfortunately, global optimum can not be guaranteed. The proposed method is an iterative one consisting of three main steps: the completion step, the S-step and the M-step. It runs until the difference between the subsequent error values drops below a given limit . The whole algorithm is overviewed in Alg. 1. We call this algorithm Fast Alteration, because its speed is (relatively) high despite the alternation: all the parameter estimations in the substeps can be written in closed forms. The slowest calculation is the pseudo-inverse computation of a 2F × 3 motion matrix. 3.1
Completion Step
The motion submatrix Mf consists of two rows: Mf = [mTf,1 mTf,2 ]T . Let us ˜ f = [mT mT mT ]T where the direction of denote the completed matrix with M f,1 f,2 f,3 mTf,3 is parallel to the cross product of mTf,1 and mTf,2 , its length is the average length of the vectors mTf,1 and mTf,2 . Based on the above strategy, the measurement matrix W is also completed. ˜ f denotes the completed version of Wf if its third row is calculated as mT S. W f,3 3.2
S-Step
The goal of the S-step is to determine the structure matrix S optimally. The structure matrix is not constrained: its value can take arbitrary real values. The optimal value of S (k) is provided by the well known least-squares optimization: † ˜ (k−1) S (k) = M (k−1) W f †
(6)
where M (k−1) denotes the Moore-Penrose pseudo-inverse of matrix M (k−1) .
Fast and Precise Weak-Perspective Factorization
3.3
501
M-Step
˜ (k) and S (k) . At the M-step, M (k) is estimated from W It is trivial that the value of Mi is independent from that of Mj , if i = j. The main idea is that the calculation of Mf from Wf and S is a quasi-registration problem: the only difficulty is that S consists of 3-dimensional vectors, while Wf only contains 2-dimensional vectors. This problem can be eliminated if Wf and Mf are completed as it is described earlier in this section. ˜ and the strucAfter completion, both the completed measurement matrix W ture matrix S consist of p piece of 3D vectors. The estimated (complete) motion ˜ can be decomposed into the product of a scale parameter qf and an matrix M orthonormal matrix Rf . According to [1], the rotation is optimally calculated as Rf = Vf UfT if Hf = Uf Λf VfT is the singular value decomposition (SVD) of Hf =
P
sp w ˜fTp ,
(7)
p=1
˜ (k) . where sp is the pth columns of S and w ˜f p is that of W The optimal scale is given by the following formula: P ˜fTp Rf sp p=1 w qf = P . T p=1 sp sp
(8)
Algorithm 1. SfM by Fast Alternation M (0) ,S (0) ← Tomasi-Kanade(W ) ˜ (0) , M ˜ (0) ← Complete(W ,M (0) ,S (0) ) W k←0 repeat k ← k+1 ˜ (k−1),M ˜ (k−1) ) S (k) ← S-Step(W (k) ˜ ˜ (k−1) ,S (k) ) W ← Complete(W,M (k) (k) (k) ˜ ˜ M ← M-Step(W ,S ) ˜ (k) ← Complete(W,M ˜ (k) ,S (k) ) W until
(k−1) 2 (k) 2 ˜ (k−1) S (k−1) − W ˜ −M − M (k) S (k) 0 such that Cq = 0 for q > n and each abelian group Cq is finitely generated. All chain complexes considered here are finite and free. A chain c in C is a q-cycle if c ∈ Ker dq . If c ∈ Im dq+1 then a is called a q-boundary. Denote the groups of q–cycles and q–boundaries by Zq and Bq respectively. Define the integer qth homology group to be the quotient group Zq /Bq , denoted by Hq (C). We say that c is a representative q–cycle of the homology generator c + Bq (denoted by [c]). For each q, the qth homology group Hq (C) is a finitely generated free abelian group. The rank of Hq , denoted by βq , is called the qth Betti number of C. Homology is a powerful topological invariant, which characterizes an object by its q-dimensional “holes” (connected components, tunnels and cavities). Let K be a complex (simplicial, cubical or simploidal). A q-chain c is a formal q sum of q-cells (where q is the dimension of the cell) in K. Let {σ1q , . . . , σm } be i mi q the set of q-cells in K, then c = i=1 λi σi , where λi ∈ {0, 1}. Alternatively, we can think of c as the set {σiq , such that λi = 1}, and the sum of two q-chains as their symmetric difference. The q-chains together with the addition operation form the group of q-chains denoted as Cq (K). The differential of a q-cell σ in K, dq (σ), is the sum of the (q − 1)-cells in K that belong to the boundary of σ. By linearity, the differential can be extended to q–chains. The chain complex C(K) is the sequence of chain groups Cq (K) connected by the homomorphisms dq . The homology of K is defined as the homology of C(K). Since we work with
A Graph-with-Loop Structure for a Topological Representation
509
objects embedded in R3 , the homology groups are torsion–free (see [1, ch.10]). Moreover, Theorem of Universal Coefficient [12] ensures that all the homology information can be computed working with coefficients in Z/2. An AT-model for K is established in [8,9] and used to obtain the homology and representative cycles of homology generators of K. An AT-model can be computed starting from an ordering of the cells in K [8,9]. We deal here with a particular ordering based on a cover forest T of K (any two vertices are connected by exactly one path in T if and only if they are connected in K). Let T = (V, E) be a cover forest of K where V is the set of all the vertices of K and E a subset of edges of K. S = (σ0 , . . . , σm ) is a T -filter if it is an ordering of all the cells in K such that: – for each j (where 0 ≤ j ≤ m), {σ0 , . . . , σj } is a subcomplex of K; – if i < j, the dimension of σi is less or equal than the dimension of σj ; – if i < j, σi and σj are two edges and σj ∈ T , then σi ∈ T That is, the edges of the cover forest are in first positions in S). Observe that σ0 is always a vertex of K. An AT-model for K is then defined as the output of the following algorithm, having as the input a complex K and a T -filter S of K. Algorithm 1. [8,9] AT-model Algorithm. Input: a T -filter S = (σ0 , . . . , σm ) of K, H := {σ0 }, f (σ0 ) := σ0 , g(σ0 ) := σ0 , φ(σ0 ) := 0. For i = 1 to m do If f d(σi ) = 0, then H := H ∪ {σi }, f (σi ) := σi , φ(σi ) := 0, g(σi ) := σi + φd(σi ). If f d(σi ) = 0, then: k := max {j such that σj ∈ f d(σi ), j = 1, ..., i − 1}, H := H\{σk }, f (σi ) := 0, φ(σi ) := 0. For j = 1 to i − 1 do if σk ∈ f (σj ), f (σj ) := f (σj ) + f d(σi ), φ(σj ) := φ(σj ) + σi + φd(σi ). Output: the set (S, H, f, g, φ). Notice that in the ith step of the algorithm (i = 1, . . . , m), exactly one homology generator is created or destroyed. The algorithm runs in time at most O(m3 ). Proposition 1. Let K be a complex (simplicial, cubical or simploidal), let T = (V, E) be a cover forest of K and S a T -filter of K. The output of Algorithm 1, (S, H, f, g, φ), satisfies that: – H is a subset of S such that no edge in T is an edge in H. H generates a chain complex denoted by H with null differential; – The number of vertices, edges and triangles in H equals the number of connected components, tunnels and cavities in |K|, respectively. In other words, the homology of K is isomorphic to H. – f : C(K) → H satisfies that if c ad c are two cycles in K such that f (c) = f (c ) then c and c are homologous.
510
R. Gonzalez-Diaz et al.
– g : H → C(K) satisfies that {[g(h)] : h ∈ H} is a set of homology generators of K. If h, h ∈ H, h = h , then g(h) and g(h ) are not homologous. Moreover, if a is an edge in H, then g(a) is a simple cycle in K and all the edges in g(a) \ {a} are edges in T . In fact, g(a) \ {a} is the simple path in T connecting the endvertices of a. – φ : C(K) → C(K) satisfies that if x ∈ H then φ(x) = 0 and there is no y ∈ S such that φ(y) = x. Moreover, if v is a vertex in K, then φ(v) is the simple path in T connecting v with the vertex in H that belongs to the same connected component in K than v. S H 0 0 1 2 0, 1 0, 2 1, 2 1, 2
f g φ 0 0 0 0 0, 1 0 0, 2 0 0 0 0 1, 2 1, 2 + 0, 2 + 0, 1 0
Fig. 3. A a filter S of a simplicial complex K and the result of applying Algorithm 1 to S (an AT-model for K)
3
Computing a Graph-with-Loop Representation of a 3D Object
Let K be a complex (simplicial, cubical or simploidal); |K| its geometric realization in R3 ; and h : |K| → R a continuous function. Let exy denote an edge with endvertices x and y. We say that the height of a point p ∈ |K| is t if h(p) = t and the height of a cell σ ∈ K is the minimum of the heights of all the points on |σ|. We say that K is an h-complex if: – the set of the vertices of K can be partitioned into a finite number of subsets r in terms of their height, V = i=1 Vi , where Vi = {v ∈ V : h(v) = ti and t1 < · · · < tr . – if evw is an edge in K then v and w belong to Vi for some i = 1, . . . , r or v ∈ Vi−1 and w ∈ Vi for some i = 2, . . . , r. h-Complexes appear in a natural way when they are defined by the neighborhood relations of voxels of a 3D digital image and h is the real function that associates to each point on |K| its elevation. Let K be an h-complex and σ a cell in K. We say that σ is horizontal if the heights of all the points on |σ| coincide; otherwise, it is vertical. For i = 1, . . . , r, let Ki be the collection of all the horizontal cells in K with the same height ti , i = 0, 1, . . . , r. Ki is a subcomplex of K and if a cell σ is not in Ki , then σ is vertical. Let Ti = (Vi , Ei ) be a cover forest of Ki and Si a Ti -filter of Ki . Denote by (Si , Hi , fi , gi , φi ), i = 1, . . . , r, the AT-models obtained using Algorithm 1. Let V be the set of vertices in K, T = (V, E) a cover forest of K (obtained
A Graph-with-Loop Structure for a Topological Representation
511
Fig. 4. From left to right: a digital image and 3 h-complexes associated to it considering the 6, 14 and 26-adjacency, respectively
after radding vertical edges in K in increasing ordering in height to the graph (V, i=1 Ei )), and S a T -filter of K. Denote by (S, H, f, g, φ) the AT-model obtained using Algorithm 1. Proposition 2. The AT-model (S, H, f, g, φ) satisfies that: – If a is a horizontal edge in H, then a ∈ Hi for some i and g(a) = gi (a) is a simple cycle such that its edges are in Ki . – If a is a vertical edge in H, then for each level i, i = 1, . . . , r, g(a) has an even number of vertical edges of height ti . Now, let us explain how to construct the graph Gh (K) and the function Ψ : Gh (K) → K using the AT-models (Si , Hi , fi , gi , φi ), i = 1, . . . , r and (S, H, f, g, φ) computed before. First, the vertices in Gh (K) in each level i are the vertices in Hi , i = 1, . . . , r. If v is a vertex in Gh (K), then Ψ (v) = v. Second, for each level i and (horizontal) edge a in Hi ∩ H, we add a loop α in Gh (K) such that its endvertex is the vertex in the level i of Gh (K) which belongs to the same connected component than |a| in |Ki |. Define Ψ (α) = gi (a) = g(a). Third, we add an edge exy between two vertices x and y in Gh (K) if x ∈ Hi and y ∈ Hi+1 for some i and f (x) = f (y) = z ∈ H (i.e. x and y belong to the same connected component in K). Define Ψ (exy ) = φ(x) + φ(y) (the simple path in T connecting the vertices x and y). Finally, for each vertical edge evw in H, an edge b is added to Gh (K). Since evw is vertical, then v ∈ Hi and w ∈ Hi+1 for some i. The endvertices of b are the vertices in Gh (K) which belong to the same connected component than v in Ki and w in Ki+1 , respectively. Define Ψ (b) = evw + φi (v) + φi+1 (w). Theorem 2. Given a complex K and a continuous function h : |K| → R. If K is an h-complex, then: 1. The graph Gh (K) and the complex K has the same number of tunnels and connected components. 2. For each loop α ∈ Gh (K), Ψ (α) is a simple cycle representative of a homology generator of K. If α1 and α2 are two different loops in Gh (K), then Ψ (α1 ) and Ψ (α2 ) are two representative cycles of two non-equivalent generators of homology. 3. For each edge exy in Gh (K) that comes from a vertical edge evw ∈ H, then Ψ (exy ) + φ(x) + φ(y) = g(evw ) is a representative cycle of a homology generator of K. 4. The graph Gh (K) without loops is a subdivision of the Reeb graph Rh (K). 5. The graph Gh (K) and the function Ψ can be computed in O(m3 ), where m is the number of cells in K.
512
R. Gonzalez-Diaz et al.
Proof. The number of tunnels of K is the number of edges in H. By construction, each horizontal edge in H produces a loop in Gh (K) (i.e. a tunnel in Gh (K)). Each vertical edge evw in H produces a vertical edge β in Gh (K). Let v in Ki and w in Ki+1 . Let V and W be the two vertices in Gh (K) that belong to the same connected component than v and w, respectively. Since evw ∈ H, then evw created a cycle when it was added. Therefore, v and w belong to the same connected component in K and so, there exists a path p between V and W in Gh (K) apart from the edge β that produces evw , by construction. Then, p + β is a cycle in Gh (K). Moreover, Ψ (p + β) = g(b) is a representative cycle of the homology generators of dimension 1 of K. Since representative cycles of a homology generator of dimension 1 of K map by ψ to a cycle in Rh (K), then ψ(g(b)) is a cycle in Rh (K). Since K is an h-complex, then a vertex in Rh (K) corresponds to a contour in a level ti , i = 1, . . . , r. Therefore, a vertex in Rh (K) is a vertex in Gh (K).
Fig. 5. From left to right: a cover forest T of the cubical complex K showed on the right of Figure 4; the complexes K0 , K1 , K2 and K3 ; a set of representative cycles of the generators H1 (K); and the graph Gh (K)
Example 1. Let K be the cubical complex K on the right in Figure 4. A cover forest T of K; the complexes K0 , K1 , K2 and K3 ; a set of representative cycles of the generators H1 (K); and the graph Gh (K) The non-trivial identification of the edges and loops in Gh (K) and K by Ψ are: Gh (K) Ψ h1 eab + ebc + ecd + edo + eof + eaf v1 eBj + eij + ehi + egh + edg + ecd + eac + eab + ebC v2 eBj + eij + ebi + eab + eaC v3 eBk + ek + em + emn + enC The representative cycles of generators of H1 (K) are: α0 = Ψ (h1 ) = eab + ebc + ecd + edo + eof + eaf α1 = Ψ (v1 ) + eBC = eBj + eij + ehi + egh + edg + ecd + eac + eab + ebC + eBC α2 = Ψ (v2 ) + eBC = eBj + eij + ebi + eab + eaC + eBC α3 = Ψ (v3 ) + eBC = eBk + ek + em + emn + enC + eBC
A Graph-with-Loop Structure for a Topological Representation
4
513
Conclusions and Future Work
It is possible to obtain representative cycles on the boundary of the given complex K if we compute a cover forest of K first adding the edges on the boundary. Another task is the generalization of the method to any dimension. The problem is that the homology of a complex of a dimension higher than 3 can have torsion groups. In order to capture the torsion part of the homology we could use the concept of λ-AT-model developed in [10]. A possible extension of this work is the construction of a discrete Morse complex Mh (K) associated to a cell complex K, such that there is not only a oneby-one identification of all the homology generators of Mh (K) with that of K, but also an isomorphism between cohomology rings. Mh (K) can be constructed using a gradient vector field VK associated to a discrete Morse function (see [6,7]) that can be obtained from an AT-model for K.
References 1. Alexandroff, P., Hopf, H.: Topologie I. Springer, Berlin (1935) 2. Biasotti, S., Facidieno, B., Spagnuolo, M.: Extended Reeb Graphs for Surface Understanding and Description. In: Nystr¨ om, I., Sanniti di Baja, G., Borgefors, G. (eds.) DGCI 2000. LNCS, vol. 1953, pp. 185–197. Springer, Heidelberg (2000) 3. Dahmen, W., Micchelli, C.A.: On the Linear Independence of Multivariate bSplines. Triangulation of Simploids. SIAM J. Numer. Anal., 19 (1982) 4. Cole-McLaughlin, K., Edelsbruner, H., Harer, J., Natarajan, V., Pascucci, V.: Loops in Reeb Graphs of 2-mainifolds. Discrete Comput. Geom. 32, 231–244 (2004) 5. Fomenko, A.T., Kunii, T.L.: Topological Methods for Visualization. Springer, Heidelberg (1997) 6. Forman, R.: A discrete Morse theory for cell complexes. In: Yau, S.T.(ed.) Geometry, Topology and Physics for Raoul Bott. International Press (1995) 7. Forman, R.: Discrete Morse Theory and the Cohomology Ring. Transactions of the American Mathematical Society 354, 5063–5085 (2002) 8. Gonzalez-Diaz, R., Real, P.: Towards Digital Cohomology. In: Nystr¨ om, I., Sanniti di Baja, G., Svensson, S. (eds.) DGCI 2003. LNCS, vol. 2886, pp. 92–101. Springer, Heidelberg (2003) 9. Gonzalez-Diaz, R., Real, P.: On the Cohomology of 3D Digital Images. Discrete Applied Math. 147, 245–263 (2005) 10. Gonzalez-Diaz, R., Jim´enez, M.J., Medrano, B., Real, P.: Extending AT-Models for Integer Homology Computation. In: GbR2007. LNCS, vol. 4538, pp. 330–339. Springer, Heidelberg (2007) 11. Massey, W.M.: A Basic Course in Algebraic Topology. New York (1991) 12. Munkres, J.R.: Elements of Algebraic Topology. Addison-Wesley, London, UK (1984) 13. Reeb, G.: Sur les Points Singuliers d’une Forme de Pfaff Complement Integrable ou d’une Function Num´erique. C. Rendud Acad. Sciences 222, 847–849 (1946)
Print Process Separation Using Interest Regions Reinhold Huber-M¨ ork, Dorothea Heiss-Czedik, Konrad Mayer, Harald Penz, and Andreas Vrabl Business Unit High Performance Image Processing, Smart Systems Division, Austrian Research Centers GmbH - ARC, A-2444 Seibersdorf, Austria
[email protected], http://www.smart-systems.at
Abstract. For quality inspection of printing systems it is necessary to measure the displacement between printing processes. Tie points are employed in correspondence and displacement estimation between individual print elements. We compare interest point and region descriptors for tie point detection in industrial inspection tasks. Clustering of measured displacements taken from sequences of sample images allows the estimation of the accuracy of printing processes and the alignment of printing processes. Results of an experimental application to banknote printing process inspection are given.
1
Introduction
Modern printing of text and images often involves different printing steps, e.g. the four-color process printing using cyan, magenta, yellow, and black (CMYK) inks or printing employing different technologies [11]. Printing processes are performed sequentially, e.g. an offset printing process is applied prior to intaglio printing process. The sequential application of different printing processes might cause some relative displacement between elements generated by the different processes. Displacements between different printing processes and within a single printing process are of great interest in high quality printing. In Fig. 1 two parts of images of banknotes are shown. Both images show an enlarged part of identical size and at the same position relative to the outline of the banknote. The spatial resolution of the images is 0.1 millimeters. The circles and the star on the left of each image were printed using offset printing, whereas the dark object on the right was produced by intaglio printing. It is possible to identify a horizontal shift of approximately four pixels, which corresponds to 0.4 millimeters, for the offset print elements between the two images. The intaglio print element has a stronger displacement on the order of magnitude of 12 pixels, which corresponds to 1.2 millimeters. In order to automatically separate print elements belonging to different print processes the following two observations will be utilized in this paper: 1. On a single image, elements belonging to a particular print tend to be moved coherently with respect to print elements printed using a different technology. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 514–521, 2007. c Springer-Verlag Berlin Heidelberg 2007
Print Process Separation Using Interest Regions (a)
515
(b)
0
20
40
60
80
100
Horizonal pixel position
120
140
0
20
40
60
80
100
120
140
Horizonal pixel position
Fig. 1. Images of the same area for two different 10 Euro banknotes: Elements printed using offset printing, e.g.the circles and the star, are slightly shifted, the element from intaglio printing, i.e. the element on the right, has larger displacement
2. From a sample of images one is able to observe different displacements between print elements produced by different technology on different images. Tie points are required to estimate displacements between print elements from images. A tie point is typically characterized by its neighborhood, e.g square windows centered at the tie point. The probably most prominent field of application for tie points is image registration [16]. Recently, tie points are also referred to as interest points and the fields of applications became more widespread. Wide baseline stereo vision [6] [12], object recognition [5] and video indexing and retrieval [8] became new application areas for interest points. In view of the considered application, we restrict ourselves to a single scale and the most prominent interest point and region detectors, i.e. The Harris, Hessian, Moravec and MSER detectors. We compared interest point detectors by an established quality measure [9], where MSERs were found to perform best, see Sec. 2. Therefore, we derive displacements between different printing processes using corresponding MSERs in Sec. 3. Details of the matching procedure are given in [2]. Furthermore, clustering of displacement vectors derived from sequences of banknote images enables us to perform reliable segmentation of print elements, see Sec. 4.
2
Interest Point and Region Detectors
Interest points are usually defined as local image features where the image content varies significantly in more than one direction, e.g. at region corner locations. Interest regions are usually closed areas in the image characterized by small variation in intensity and/or color. For wide baseline stereo vision, the maximally stable extremal regions (MSERs) [6] and intensity extrema based regions (IBR) as well as edge based regions (EBR) [12] have been suggested. The most prominent interest point detectors are Harris [1], Hessian [4] and the Moravec detectors [7]. Combinations of point and region detectors also appeared recently [14]. In order to detect interest points at different resolutions, image pyramid and scale space approaches have been suggested. Among others, one of the most prominent
516
R. Huber-M¨ ork et al.
approaches into this direction is the scale invariant feature transform (SIFT) [5]. Wavelet based descriptors have been presented by Sebe et. al [10]. Concerning interest point detectors, we will focus on properties derived from the local autocorrelation matrix ( Harris detector [1]), properties derived form the matrix containing local second order partial derivatives (Hessian detector [4]) and description of local variance variation (Moravec detector [7]). The MSER [6] interest region detector selects regions characterized by the property that their boundaries are relatively stable over a wide range of pixel intensities: – Every pixel in the region has an intensity value which is less than or equal to every pixel on the contour. – The change in the total number of pixels in the region over the range of contour intensities g − δ to g + δ is a local minimum, where g is an intensity value and δ is an integer increment or decrement thereof. 2.1
Matching of Interest Points and Regions
Found interest points or regions in a defined search area are required to correspond. For a given reference point or region location in a reference image a search region in a comparison image is derived. Usually, this is a square window centered on the reference point location or an enlarged bounding box with respect to the reference region bounding box. Correspondence is checked by comparison to a threshold on the similarity measure. The similarity of interest points is measured by normalized cross correlation. Similarly, the correspondence of an interest region, i.e. binary image patches containing MSERs, is assessed using a binary distance measure. Due to its efficient implementation and discriminative power [15] the Sokal-Michener measure of similarity for a MSER centered at point (x, y) is used δ [I(x, y), I(x + lx , y + ly )] C(x + lx , y + ly ) = W , (1) |W | 1 if a=b, δ[a, b] = 0 else, where lx and ly are the pixel lags of the replicas of the MSER. A binary image patch centered at (x, y) and containing the MSER is denoted by I(x, y). A shifted replica centered at (x + lx , y + ly ) is denoted by I(x + lx , y + ly ). The window W is a bounding box for the MSER in question. 2.2
Comparison of Detectors
The mean repeatability rate rmi for an image numbered i with respect to a reference image numbered m is measured as [9] rmi = |Smi (ν)|/min(nm , ni ),
(2)
where Smi (ν) is the set of corresponding interest points in images m and i. The parameter ν describes the search area, i.e. where to look in image i for a
Print Process Separation Using Interest Regions
517
Fig. 2. Mean repeatability for four different detectors and varying reference image
specific point or region found in image m. Practically ν defines rectangular search windows centered around each reference point location. The numbers nm , ni are the total numbers of interest points or regions found in images m and i. The repeatability of the detectors is summarized in Fig. 2 for the considered set of 95 front side images of 10 Euro banknotes. The mean repeatability for image i is simply calculated as (where n is the total number of sample images) r¯i = rsi /(N − 1). (3) s=1...N ;s=i
Interest points or regions were extracted from images printed using two processes, i.e. intaglio and offset printing. While MSERs are typically extracted from either one the print regions, the corner detectors also tend to extract interest points on borders between print regions. Due to possible shifts between print regions, corners at borders do not necessarily occur in both images. Therefore, the MSER detector significantly outperforms the other detectors. On the average the mean repeatability for Harris was 0.679, for Hessian it was 0.6653, for Moravec it was 0.5664 and MSERs achieved 0.9353. The results indicate that MSERs should be used for this task. Therefore, we will restrict ourselves to MSERs in the remaining paper.
3
Displacement Estimation Based on Region Correspondence
We will investigate the usefulness of the interest region matching approach in a problem related to banknote print inspection, the measurement of displacement between the output of different printing processes. Figure 3 shows scatter plots for the estimated displacements of all matching MSERs between four pairs of banknotes. The displacements visible to the human observer in Fig. 1, which are approximately 4 and 12 pixels in horizontal direction, correspond well to the centers of the two dominant clusters shown Fig. 3 (a). The cluster on the
518
R. Huber-M¨ ork et al.
Vertical displacement
(b)
Vertical displacement
(a)
Horizontal displacement
Vertical displacement
(d)
Vertical displacement
(c)
Horizontal displacement
Horizontal displacement
Horizontal displacement
Fig. 3. Examples of scatter plots showing horizontal and vertical displacements for MSERs estimated by region based matching of four different pairs (a-c) of images
left corresponds to MSERs printed in intaglio and the cluster on the right corresponds to MSERs printed in offset. The sparse measurements with a larger vertical displacement are MSERs located in the hologram stripe on the right side of each 10 Euro banknote. Figures 3 (b) to (c) show other examples of observed displacements in the considered sample of 95 images. Displacements observed in horizontal direction are larger than in vertical direction, which are characteristic of the inspected print. The variance among MSERs attributed to a specific cluster is mainly due to the dynamics, e.g. undulations of the paper and/or belt slip, involved when high banknotes are transported at high speed. 3.1
Accuracy of Estimated Displacements
In this subsection, we compare the performance of our approach with respect to the accuracy of manually estimated displacements. Manual evaluation turned out to be quite difficult. Table 1 summarizes the results of the experiments for MSERs numbered 1 to 10. The manually estimated displacement in horizontal (u) and vertical (v) directions are shown in rows ground-truth (GT) and the results of the proposed method using region based matching (RBM) and the absolute error with respect to ground-truth are shown in the remaining two rows. The mean absolute deviation of the proposed method with respect to ground truth is approximately 0.16 pixels.
Print Process Separation Using Interest Regions
519
Table 1. Estimated horizontal (u) and vertical (v) displacements in pixels measured manually (GT), using region based color matching (RBCM) and absolute errors MSER
4
1
2
3
4
5
6
7
8
9
10
GT
u -5.00 -4.00 -5.00 -4.00 -7.00 -10.00 -11.50 -11.00 -15.00 -12.00 v 0.50 1.00 1.00 1.00 1.00 0.00 0.00 -1.00 -3.00 -0.50
RBM
u -4.96 -4.05 -4.96 -4.06 -6.97 -10.00 -11.68 -10.99 -14.96 -12.43 v -0.36 1.15 1.12 1.36 -1.14 -0.34 -0.76 -0.96 -2.98 -0.17
Abs. Err.
u v
0.04 0.05 0.04 0.06 0.03 0.14 0.15 0.12 0.36 0.14
0.00 0.34
0.18 0.76
0.01 0.04
0.04 0.02
0.43 0.33
Print Process Separation
Cluster analysis is the process of classifying objects into subsets that have meaning in the context of a particular problem [3]. In the context of the banknote printing process evaluation, meaningful subsets are the different printing processes, i.e. the distinction between offset and intaglio. We define a feature vector fiAB = (ui , vi ), in which each pair (ui , vi ) corresponds to the displacement estimate for MSER i between a predefined reference image A and a comparison image B. The k-means clustering procedure, where the number of clusters is denoted by k, can be applied in the two-dimensional (u, v)-space. The clustering procedure can be improved by taking more than a single pair of images into account, which results in a higher dimensional feature vector fi = (fiA1 , fiA2 , . . . , fiAN ) for each MSER i present in N image pairs, where image A is taken as the reference image for each pair of images. Figure 3 showed four scatter plots which correspond to four pairs of images. The reference image A is the same for each pair. Each matching MSER is represented by a cross and indicates the displacement of the MSER in pixels in both directions. Cluster centers have been calculated by k-means clustering of the feature vector. The number of clusters k was set to 2. For example, all MSERs closer to the cluster on the left of Fig. 3 (a) correspond to intaglio print and MSERs closer to the cluster on the right correspond to offset print. The images related to the clustering shown in 3 are 10 Euro banknotes as depicted in Fig. 4 (a). The MSERs extracted from Fig. 4 (a) were matched with MSERs extracted from four comparison banknotes and the resulting 8dimensional feature vectors were clustered. The image in Fig. 4 (b) shows a manually derived ground truth which partitions the banknote image into offset print areas (black), intaglio print areas (gray), and remaining areas such as background, etc. (white). 4.1
Classification of Print Elements
A classification result based on displacement estimates derived using variable window block matching [13] is shown in Fig. 4 (c), where pixels classified as offset
520
R. Huber-M¨ ork et al.
(a)
(b)
(c)
(d)
Fig. 4. Classification of print layer regions based on MSER displacements: a) image of a 10 Euro banknote, b) ground truth for print regions: intaglio print areas are grey, offset print areas are black, c) classification based on block matching d) classified MSERs
are depicted in black and pixels classified as intaglio are gray. The result based on block matching has shortcomings for the application at hand, e.g. homogeneous regions such as the interiors of the large digits are not reliably classified. On the other hand, textured regions, where extraction of MSERs becomes unreliable, are classified correctly. If a dense segmentation of images is required, which is beyond the scope of this paper, combination of block and region based matching is suggested. The classification result using the proposed method is shown in Fig. 4 (d), where MSERs assigned to offset are depicted in black and MSERs assigned to intaglio are gray regions. There is a total number of 143 MSERs, of which 44 MSERs are classified as intaglio print regions and 99 MSERs are classified as offset print regions. There are 4 MSERs classified falsely as intaglio print regions. Three of these regions are enclosed by intaglio print, e.g. the interiors of the 0 digits and the arch. The fourth region falsely assigned to intaglio print is the background region on the bottom right which is adjacent to the intaglio print region. Regarding the determination of displacements between printing processes, those errors are non-critical. Regions falsely classified as offset print regions mainly lie inside the arch element on the banknote. The replicated structure in this area is the main reason for misclassification. The overall achieved correct classification rate was 94.4 percent.
5
Conclusion
We have presented a interest region based approach for estimating displacements between image regions, which is especially useful in situations when common
Print Process Separation Using Interest Regions
521
block matching algorithms tend to fail, e.g. for homogeneously colored objects and at borders between regions. The advantage of a regions based approach, when compared to interest points, is that it takes into account arbitrarily shaped and sized regions, while it ignores the area surrounding the regions. Using common clustering techniques, one is able to distinguish between image regions printed in different technologies. In applications, where the main goal is a dense segmentation, classifications based on block and region based matching can be combined. The suggested method is used to decrease setup times of printing systems. Future use could also be inline control of tolerances in printing.
References 1. Harris, C., Stephens, M.J.: A combined corner and edge detector. In: Proc. of ALVEY Vision Conf., pp. 147–152 (1988) 2. Huber-M¨ ork, R., Ramoser, H., Penz, H., Mayer, K., Heiss-Czedik, D., Vrabl, A.: Region based matching for print process identification. Pat. Rec. Let (to appear) 3. Jain, A.K., Dubes, R.C.: Clustering methods and algorithms. In: Algorithms for Clustering Data, pp. 55–142. Prentice-Hall, Englewood Cliffs (1988) 4. Kitchen, L., Rosenfeld, A.: Gray-level corner detection. Pat. Rec. Let. 1(2), 95–102 (1982) 5. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comp. Vis. 60(2), 91–110 (2004) 6. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. Image Vis. Comp. 22(10), 761–767 (2004) 7. Moravec, H.: Obstacle avoidance and navigation in the real world by a seeing robot rover. Tech. Rep. CMU-RI-TR-3, Robotics Inst., Carnegie-Mellon Univ (1980) 8. Schaffalitzky, F., Zisserman, A.: Multi-view matching for unordered image sets, or How do I organize my holiday snaps? In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 414–431. Springer, Heidelberg (2002) 9. Schmid, C., Mohr, R., Bauckhage, C.: Evaluation of interest point detectors. Int. J. Comp. Vis. 37(2), 151–172 (2000) 10. Sebe, N., Tian, Q., Loupias, E., Lew, M.S., Huang, T.S.: Evaluation of salient point techniques. Image Vis., Comp. 21(13-14), 1087–1095 (2003) 11. Stone, M.C., Cowan, W.B., Beatty, J.C: Color gamut mapping and the printing of digital color images. ACM Trans. Graph. 7(4), 249–292 (1988) 12. Tuytelaars, T., Gool, L.V.: Matching widely separated views based on affine invariant regions. Int. J. Comp. Vis. 59(1), 61–85 (2004) 13. Veksler, O.: Fast variable window for stereo correspondence using integral images. In: Proc. Conf. Comp. Vis. Pat. Rec. (CVPR) (2003) 14. Winter, M., Fraundorfer, F., Bischof, H.: MSCC: Maximally stable corner clusters. In: Kalviainen, H., Parkkinen, J., Kaarna, A. (eds.) SCIA 2005. LNCS, vol. 3540, pp. 45–54. Springer, Heidelberg (2005) 15. Zhang, B., Srihari, S.N.: Properties of binary vector dissimilarity measures. In: Proc. Conf. on Comp. Vis., Pat. Rec. Image Proc. (CVPRIP) vol. 1 (2003) 16. Zitova, B., Flusser, J.: Image registration methods: a survey. Image Vis. Comp. 21, 977–1000 (2003)
Histogram-Based Lines and Words Decomposition for Arabic Omni Font-Written OCR Systems; Enhancements and Evaluation Mohamed Attia1 and Mohamed El-Mahallawy2 1
The Engineering Company for the Development of Computer Systems; RDI, Egypt
[email protected] 2 Arab Academy for Science & Technology and Maritime Transport; AAST, Egypt
[email protected] Abstract. Given font/type-written text rectangle bitmaps extracted from digitally scanned pages, inferring the boundaries of lines hence complete words is a preprocessing vital to whatever OCR system while the recognition process itself as well as the post processing necessary for producing the recognized text. Histogram-based methods are commonly used due mainly to their relative implementation simplicity and computational efficiency, however, some authors report about their vulnerability to some idiosyncratic textual structure complexities, and noise. This paper elaborates on this approach to produce a more robust algorithm for lines/words decomposition esp. in Arabic, or Arabic dominated, text rectangles from real-life multifont/multisize documents. Trying to evaluate this algorithm, this paper also presents the results of extensive experimentation made on about 1800 documents fairly distributed over different kinds of sources with different noise levels at different scanning resolutions and color depths. Keywords: Arabic, decomposition, font-written, histogram, OCR, multifont, multisize, omni, segmentation, typeset, typewritten.
1 Introduction Realizing a word-error-rate (WER) ≤ 0.5% at a speed of 60 words/min. has always been the ultimate threshold of font-written OCR systems to be considered competent with skilled human typists [2]. While font-written OCR systems working on Latin script can claim approaching such measures under favorable conditions, the best systems working on other scripts (e.g. Chinese, Arabic, ..) are still well behind due to a multitude of complexities [2], [4]; e.g. the recent best reporting Arabic omni fontwritten OCR systems can only claim WER’s exceeding 10% under favorable conditions and not to mention the real-life ones [5]. Font-written OCR systems need to perform some processing on the input scanned text block prior to the recognition phase. Decomposing the target text block into lines hence into words is perhaps the most fundamental such preprocessing where errors add in the overall WER of the whole OCR system. This can be shown as follows: W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 522 – 530, 2007. © Springer-Verlag Berlin Heidelberg 2007
Histogram-Based Lines and Words Decomposition
523
a = 1 − e = (1 − e D ) ⋅ (1 − e R ) = 1 − (e D + e R − e D ⋅ e R ) ∵ e D , e R 0 and Li+1;i SIntra. Figure 8 on page 527 shows the result of diagnosing inter-word spaces, hence completing the task of decomposing a text rectangle bitmap into whole words.
3 Evaluation To assess the lines & words decomposition algorithm presented above, we ran it on a total of 1800 real-life documents evenly covering different kinds of sources, different levels of image-quality, and different scanning resolutions and color depths as explained in table 1 below. This evaluation DB1 has been acquired via a common hardware of HP 3800 flatbed scanner with TMA, HP LASER jet 1100 printer, and XEROX 256XT photocopier. Table 1. Results of evaluation experiments on the proposed decomposition algorithm Source of doc.’s → # of photocopying → 300 dpi B&W 300 dpi, Gray Shades 600 dpi B&W 600 dpi, Gray Shades
eD for LASER printed documents 50 pages, 1081 lines, 12146 words PC#0 PC#1 PC#2 0.41% 0.59% 0.73% 0.36% 0.56% 0.66% 0.35% 0.50% 0.61% 0.32% 0.46% 0.57%
eD for Book pages 50 pages, 987 lines, 12524 words PC#0 PC#1 PC#2 0.44% 0.61% 0.71% 0.42% 0.58% 0.68% 0.38% 0.53% 0.64% 0.31% 0.45% 0.59%
eD for News Papers 50 pages, 1309 lines, 12870 words PC#0 PC#1 PC#2 0.59% 0.71% 0.90% 2.10% 2.91% 3.67% 0.50% 0.63% 0.84% 0.85% 1.10% 01.43%
The WER of this algorithm is eD ≡ Ne÷Ntotal ; Ntotal is the total number of contiguous (i.e. un-spaced) strings in the input text block, and Ne is the number of wrongly decomposed full-word rectangles that either contain more than one contiguous strings or contain only part of contiguous strings. Errors due to typing mistakes, tiny isolated noise stains (recognized harmlessly as dots), and wrongly-split contiguous numbers (retrieved contiguous in the OCR post processing), are all not counted as errors. Acknowledgments. The authors feel indebted to RDI; The Engineering Company for the Development of Computer Systems; www.RDI-eg.com for its strong support to this work. Special thanks are posed to Prof. Mohsen A. A. Rashwan; RDI ’s CEO, and professor of Electronics & Electrical Communications, Faculty of Eng., Cairo Univ., Egypt.
References 1. Amin, A., Masini, G.: Machine recognition of multi-font printed Arabic texts. In: Proc. 8th International Joint Conf. on Pattern Recognition, Paris, France, pp. 392-395 (October 1986) 2. Al-Badr, B., Mahmoud, S.A.: Survey and Bibliography of Arabic Optical Text Recognition, Elsevier Science. Signal Processing 41, 49–77 (1995) 3. Attia, M., El-Mahallawy, M.: Histogram-Based Decomposition Algorithm Evaluation DB http://www.RDI-eg.com/OCR_Decomp_Eval_DB 1
The 4 parts of this several-GB’s DB corresponding to the 1st column in table 1 are freely downloadable at [3], where the whole DB is freely affordable by asking either of the authors.
530
M. Attia and M. El-Mahallawy
4. Attia, M.: Arabic Orthography vs. Arabic OCR, Multilingual Computing & Technology magazine, USA (December 2004) 5. Bazzi, I., Schwartz, R., Makhoul, J.: An Omni font Open-Vocabulary OCR System for English and Arabic. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(6) (1999) 6. Bouhilali, K., Kamrouni, M., Ellouze, N.: Method of segmentation of Arabic text image into characters. In: Proceedings of the 1st Kuwaiti Computer Conf., Kuwait, pp.442–446 (March 1989) 7. Fujisawa, H., Nakano, Y., Kurino, K.: Segmentation Methods for Character Recognition: From Segmentation to Document Structure Analysis. Proc. IEEE 80(7), 1079–1091 (1992) 8. Gonzalez, R., Woods, R.: Digital Image Processing, 2nd edn. Prentice-Hall, Englewood Cliffs (2002) 9. Gouda, A.M.: Arabic Handwritten Connected Character Recognition, PhD thesis, Dept. of Computer Engineering, Faculty of Engineering, Cairo University (November 2004) 10. Kasturi, R., O’Gorman, L.: Document image analysis: A bibliography. Machine Vision and Appl. 5, 231–243 (1992) 11. Parhami, B., Taraghi, M.: Automatic recognition of printed Farsi texts. Pattern Recognition 14(1), 1–6 (1981) 12. Pratt, W.K.: Digital Image Processing, 2nd edn. John Wiley & Sons Inc., West Sussex, England (1991)
Semi-automatic Training Sets Acquisition for Handwriting Recognition Jerzy Sas and Urszula Markowska-Kaczmar Wroclaw University of Technology, Applied Informatics Institute, Wyb Wyspianskiego 27, Wroclaw, Poland
[email protected],
[email protected] Abstract. In this paper, a method of semi-automatic training set acquisition for character classifiers used in cursive handwriting recognition is described. The training set consists of character samples extracted from a training corpus by segmentation. The method first splits the word images from the corpus into a sequence of graphemes. Then, the set of candidate segmentation variants is elicited with an evolutionary algorithm, where the segmentation variant determines subdivision of grapheme sequences of words into subsequences corresponding to consecutive letters. Segmentation variants are modeled by a chromosome population. Next, each segmentation variant from the final population is tuned in an iterative process and the best chromosome is selected. Then character samples resulting from application of the segmentation modeled by the selected chromosome are grouped into sets corresponding to letters from the alphabet. Finally, the most outstanding samples are rejected so as to maximize the accuracy of words recognition obtained with a character classifier trained with the reduced samples set.
1
Introduction
Many systems of handwriting recognition follow ”analytic” paradigm, where the word is segmented into characters either explicitly or implicitly and isolated characters are recognized by character classifiers. The character classifiers used in such process must be trained with a training set consisting of correctly labeled character samples extracted from handwritten text. The creation of the training set from a corpus of handwritten texts is a tedious and time-consuming process. Automatic or at least semi automatic method of the training set extraction is therefore desired. In this paper the method is proposed, which extracts individual character samples from the set of handwritten word images manually annotated with the sequence of actual characters constituting the words. Because the annotation must be done by a user we call this approach ”semi-automatic”. Handwriting segmentation has been investigated in a number of other works. Proposed methods differ in efficiency and complexity. In [1], the authors describe the method of handwritten word segmentation with the usage of simulated annealing, which is very time consuming. In [2] the method is based on approximating a binary raster image with a set of polygons and building a continuous skeleton of these polygons. In [3] the word segmentation is performed by W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 531–538, 2007. c Springer-Verlag Berlin Heidelberg 2007
532
J. Sas and U. Markowska-Kaczmar
approximating the average character widths and then by seeking for such segmentation of an individual word, which minimizes the difference between actual and expected word width. The approach described in [4] combines a word segmentation and recognition but requires a lexicon of words that can appear in the text. The recognition stage requires a set of samples for each character from the alphabet. For this reason it is useless for our purpose, i.e. for the training set acquisition. The examples of evolutionary algorithms application, which we used in the first step of our method of segmentation can be found in [5] and [6]. They differ from the one used by us in the applied encoding, genetic operators and the fitness function. In the paper we focus on off-line handwritten word recognition. Writerdependent approach is being considered here because many researchers, for instance in [7], [5] claim that writer dependent recognition achieves higher recognition accuracy than writer independent approaches. Writer dependent handwriting recognition requires that individual classifier must be created and trained for each writer. It, in turn, requires thorough preparation of the training sets extracted from the texts written by the individual writer. Because typically it must be done in unsupervised manner during system operation, the possibility of automatic creation of the training set is of particular importance in such case.
2
Problem Definition
The aim of the method described here is to semi-automatically create the training set (T SCC) for a character classifiers. T SCC is extracted from the training corpus, consisting of handwritten word images annotated with their correct transcription to ASCII character sequences. T SCC is acquired by a three-stage procedure. At the first stage the set of candidate corpus segmentation variants is created with an evolutionary algorithm, where the chromosome is a model of the segmentation variant. At the next stage, all segmentation variants corresponding to individuals in the final population are refined in an iterative process, where the character boundaries are slightly modified as long as it improves overall segmentation evaluation. The best segmentation variant is selected and character samples extracted from word images according to the best variant are grouped into sets corresponding to letters from the alphabet. Finally, the obtained family of image sets for alphabet letters is evaluated by using it as a training set for a handwriting recognition procedure. The aim of the third stage is to reject the most outstanding samples from the sets for individual letters in the alphabet so as to maximize the accuracy of the character recognition. The method described here applies the word image oversegmentation into graphemes, which is based on the segmentation presented in [8]. It is assumed that the characters consist of the complete number of graphemes, i.e. that actual character boundaries are on the grapheme boundaries. At the first two stages of the proposed procedure our aim is to find such subdivision of the grapheme sequences for words, which most closely (optimally) corresponds to the actual
Semi-automatic Training Sets Acquisition for Handwriting Recognition
533
Fig. 1. Exemplary word segmentation into graphemes
character annotations explicitly given in the training corpus. It leads to the optimization problem defined as follows. Let us consider the corpus S of K handwritten word images Ik , k=1,...,K, annotated with the sequence of characters actually constituting the word wk = (1) (2) (l ) (ck , ck , ..., ck k ): S = ((I1 , w1 ), (I2 , w2 ), ..., (IK , wK ))
(1)
For each word image the segmentation procedure can be applied, which divides the word image into graphemes. For each word wk the corresponding grapheme (1) (2) (m ) Gk = (gk , gk , ..., gk k ) sequence can be obtained using the grapheme extraction procedure described in [8]. The result of exemplary word segmentation into graphemes is presented on Fig. 1. Experiments have shown that in the great majority of cases, individual characters consist of the subsequent grapheme se(n)
(b
)
(n)
(e
)
quences not longer than 4. Let qik (n) = (gk k,i , ..., gk k,i ) denote the subsequence of graphemes in the word wk corresponding to i−th character cik in certain vari(n) (n) ant Vnk of this word segmentation. bk,i and ek,i denote here the indices of the first and the last grapheme assigned to i−th character in the partitioning variant Vnk . By selecting a variant Vnk for each word in the corpus S the complete partitioning variant is obtained: Vˆn = (Vn11 , Vn22 , ·, VnKK ) By gathering the segments of the image Ik corresponding to graphemes in (i) k qi (n) the hypothetical image uki (n) of the character ck appearing on the i−th position in the word wk can be obtained. The method is based on quite natural assumption that various appearances of the same character are similar in certain sense. Let us define the function Δ(k, i, c), which is equal 1 if the character c appears at the i−th position of the word wk and 0 otherwise: ⎧ i ⎪ ⎨1 if ck = c, Δ(k, i, c) = (2) k ∈ K; c ∈ A ⎪ ⎩ 0 otherwise. Then the sets of hypothetical character samples for the partitioning variant Vˆn and for all characters from the alphabet A can be defined as: (i)
Tcn = {uk (n) : Δ(k, i, c) = 1} k ∈ K; c ∈ A
(3)
The similarity between character images is defined by the function d(u, v), which values come from the range (0, 1). The higher is d(u, v) value, the more similar
534
J. Sas and U. Markowska-Kaczmar
are the images x, and y. The overall similarity of the character images resulting from the complete segmentation variant Vˆn can be calculated as: c u,v∈T n d(u, v) u=v Γ (Vn ) = (4) η(Tnc )2 − η(Tnc ) c∈A
η(Tnc )
where is the cardinality of the set Tnc . If various appearances of the same character in the text corpus are similar, then the segmentation variant Vˆn which minimizes the value of Γ (V n ) in the formula (4) corresponds most closely to the actual segmentation of the words into characters. Our aim is therefore to find ∗ such complete partitioning variant Vˆn which minimizes Γ (V n ). The similarity function d(u, v) was calculated using ten geometrical features of the hypothetical character image. It is normalized within the standard interval (0,1) and based on Euclidean distance in the normalized feature space: M 2
i=1 (xi (u) − xi (v)) d(u, v) = 1 − (5) M where xi is the normalized feature.
3
The Proposed Method
3.1
First Stage – Evolutionary Algorithm
The optimization problem described in the previous section is difficult because of the enormous number of possible segmentation variants. Evolutionary algorithm has been applied to find a number of suboptimal segmentation variants due to its proved effectiveness for similar combinatorial optimization problems ([9]). The chromosome in the evolutionary algorithm corresponds here to the complete segmentation variant Vˆn . It is represented by the sequence of trailing grapheme indices for consecutive characters in successive words from the corpus S. (n) (n) (n) (n) (n) (n) (n) (n) (n) Chr(Vˆn ) = (e1,1 , e1,2 , ..., e1,l1 , e2,1 , e2,2 , ..., e2,l2 , ..., eK,1 , eK,2 , ..., eK,lK ) (6)
Each index is encoded as integer number. Because the segmentation variants for individual words can be changed independently, it seems reasonable to define (n) (n) (n) the gene as the subsequence (ek,1 , ek,2 , ..., ek,lk ) in Chr(Vˆn ) corresponding to a single word wk . The initial population is created randomly and then all individuals are evaluated. This needs decoding of chromosome and creation of consecutive characters on the basis of indices included in the evaluated chromosome. All graphemes (n) (n) between two indices (distinguished by bk,i and ek,i ) are merged. They produce a (i)
hypothetical character sample uk (n) from the word. Evaluation of the obtained
Semi-automatic Training Sets Acquisition for Handwriting Recognition
535
individual is based on the fitness function. The selection of individuals is carried out with the roulette wheel rule. The elitist model of selection is applied, so the best individual from the previous population always survives. Then, individuals are exposed to genetic operations: mutation and crossover. Uniform crossover operator is implemented, which means that the crossover operation consists in randomly exchanging (with assumed probability) of the corresponding genes between parents. Mutation operator relies on the creation of new individuals with assumed probability. The fitness value of the chromosome corresponding to Vˆn is calculated as Γ (Vn )α . The value of α was determined experimentally. The fastest convergence to the suboptimal solution was obtained for α ∈ (10, 20). 3.2
Second Stage - Iterative Refinement
The segmentation variants represented by chromosomes from the last population of the evolutionary algorithm can be further improved by moving towards a local maximum of the fitness function. The solution represented by a chromosome is being refined in a step-wise iterative procedure, where the chromosome is repeatedly modified by the series of minimal modifications. In each step only single element of the chromosome determining the right boundary position of a single character is modified. Elementary modification consists in moving the right character boundary one grapheme to the left and to the right (if it does not violate the constraints related to assumed limits of grapheme count per each character). If the fitness function value for the modified chromosome is increased then the modification remains, otherwise the unchanged chromosome is processed in the next iterations. The single pass of the method consists in trying the elementary modification subsequently for these elements of the chromosome, which correspond to all but the last characters of the corpus words (the right boundary of the last character in a word is fixed). The pass is performed repeatedly until the single cycle does not result in any increase of the fitness function value. Experiments show that in many cases the highest fitness value is reached by the segmentation variant different than the best one after the first stage. For this reason all chromosomes from the final population of the evolutionary algorithm are passed to the refinement stage. The chromosome, which has the highest fitness value after the refinement stage defines the words segmentation for the next processing stage. 3.3
Third Stage - Validation
The segmentation obtained with the refinement procedure may still contain incorrectly extracted characters. It may seriously impair the word recognizer trained with the set containing incorrectly extracted character samples. Segmentation faults resulting from incorrect segmentation into graphemes where graphemes are shared by adjacent characters are particularly detrimental. It is because they cannot be corrected at the first two segmentation stages based on grapheme grouping. At the last stage the most suspicious character samples are
536
J. Sas and U. Markowska-Kaczmar
rejected from the samples set. The candidates for rejection are selected in a validation procedure, where the character sample set elicited at the refinement stage is used as the training set for a character classifier applied in turn in a simplified cursive word classifier applying ”analytic” handwriting recognition paradigm. The validation procedure starts with the whole samples set. The samples are sorted taking into account their average similarity to the remaining samples in the set representing the same letter from the alphabet. The most outstanding, untypical samples are at the beginning of rank, while the most typical ones are at the end. In each step of the iterative procedure, single sample is temporarily removed from the samples set in the order determined by the samples rank (the most untypical ones are considered first) and the classifier is trained with the reduced set. If the sample removal resulted in improvement of words recognition accuracy then the sample is permanently removed, otherwise it is returned to the training set. The samples that survived the validation constitute the final result of the training set acquisition process.
4
Experiments
The method described in the previous section was tested to evaluate its effectiveness and to assess how the procedure applied in each stage contribute to the accuracy of the word segmentation. The accuracy of the character classifier trained with the automatically segmented character samples was also compared with the accuracy achieved by the classifier trained with the character sample images extracted from the same corpus by a human expert. In the experiment, three handwritten corpora were used. Each corpus consisted of the same set of polish words rewritten by a single writer. The writers were selected so, as to obtain the samples of three handwriting styles differing in accuracy, precision and readability. The styles can be classified as calligraphic (c), fair(f ) and sloppy(s). Each corpus was divided into training and validation parts. The training part was subject of the proposed segmentation procedure. The validation part was used at the validation stage. The training part of each corpus contained of the same set of 116 polish words consisting of 725 appearances of letters from Latin alphabet. The validation part contained 100 words consisting of 630 characters. In the first experiment the segmentations obtained by applying genetic algorithm (GA) and genetic algorithm with stepwise refinement (RGA) were compared with the manual segmentation. The accuracy of the automatic segmentation for each letter in the alphabet was calculated as the fraction of the letter appearances extracted consistently with the expert segmentation. The obtained accuracies are presented in Table 1. In the second experiment, the validation stage was tested. The character samples created by RGA algorithm were used as the training set for a simplified word recognition algorithm. As the result of the validation procedure described in the previous section, some of samples were rejected from the training set. The finally selected set of letter images was assessed in two ways: by evaluating of the
Semi-automatic Training Sets Acquisition for Handwriting Recognition
537
Table 1. Accuracy of automatic segmentation; c - calligraphic, f - fair, s - sloppy styles Character GA algorithm s f c a 0.62 0.74 0.92 b 0.59 0.71 0.88 c 0.79 0.91 0.97 d 0.72 0.79 0.93 e 0.70 0.84 0.95 f 0.95 0.95 1.00 g 0.85 1.00 1.00 h 0.55 0.64 0.86 i 0.81 0.87 0.94 j 0.55 0.65 0.85 k 0.77 0.77 0.91 l 0.87 0.94 1.00 m 0.78 0.81 0.93 n 0.66 0.76 0.76 o 0.90 0.85 0.95 p 0.81 0.95 0.90 r 0.63 0.69 0.75 s 0.78 0.89 0.89 t 0.53 0.91 0.97 u 0.68 0.91 0.94 w 0.75 0.82 0.82 z 0.85 0.79 0.85 y 0.89 0.94 1.00 Average 0.73 0.82 0.91
RGA algorithm s f c 0.69 0.85 0.94 0.65 0.82 1.00 0.79 0.94 1.00 0.83 0.83 0.97 0.77 0.88 0.98 0.90 1.00 1.00 0.85 1.00 1.00 0.68 0.68 0.86 0.87 0.87 0.94 0.65 0.75 0.90 0.80 0.80 0.94 0.87 0.94 1.00 0.78 0.85 0.93 0.69 0.83 0.83 0.93 0.93 1.00 0.95 0.95 0.90 0.67 0.77 0.81 0.85 0.93 0.93 0.53 0.94 0.97 0.74 0.91 0.94 0.71 0.86 0.82 0.85 0.88 0.94 0.94 1.00 1.00 0.77 0.87 0.94
Table 2. Results of isolated character recognition with semi-automatically extracted training sets; c - calligraphic, f - fair and s - sloppy styles, m.seg.-manual, a.seg.automatic segmentation
Number of rejected samples Number of remaining samples Remaining samples set consistency with the m.seg. Character recognition accuracy – train.set acquired with m.seg. Character recognition accuracy – train. set acquired with a.seg.
s 141 584 0.84 0.84 0.77
f 89 636 0.93 0.90 0.86
c 39 686 0.95 0.96 0.94
consistency with the manually extracted character images and by determining the character recognition accuracy of the character classifier trained with the final samples set. The latter one was compared to the character recognition accuracy achieved with the training set obtained by the manual segmentation. Results for the three analyzed handwriting styles are presented in Table 2.
538
5
J. Sas and U. Markowska-Kaczmar
Conclusions, Further Works
The aim of the method described in this article is to semi-automatically create the training set for character classifiers used in writer-dependent cursive handwriting recognition. Experiments carried out show that by combining three proposed techniques (evolutionary algorithm for preliminary segmentation, stepwise refinement of segmentation variants and sample set validation by worst samples rejection) high degree of segmentation consistency with the segmentation done by a human expert can be achieved. Moreover, isolated character recognition accuracy in case of a classifier trained with automatically acquired training set is relatively close to the recognition accuracy attainable with the training set prepared by manual training corpus segmentation. The results obtained automatically are particularly close to the results of manual segmentation in case of accurate writing styles. In such cases, the proposed method can be considered as practically useful. It should be emphasized that the recognition test was carried out for correctly isolated characters. To more reliably assess the usability of the described method, it should be tested in the future in application to unsegmented words recognition based on ”analytic” paradigm. Acknowledgement. This work was financed by Polish Ministry of Science and Higher Education in 2005 - 2007 years as a research project No 3T11E00528.
References 1. Schomaker, L., Teulings, H.: Unsupervised learning of prototype allographs in cursive script using invariant handwritting features. In: Simon, J.C., S.I. (ed.): From Pixels to Features III, Amsterdam, North-Holland (1992) 2. Mestetskii, L., Reyer, I., Sederberg, T.: Continuous approach to segmentation of handwritten text. In: Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 440–444 (2002) 3. Mackowiak, J., Schomaker, L., Vuurpijl, L.: Semi-automatic determination of allograph duration and position in on-line handwriting words based on the expected number of strokes. In: Progress in Handwriting Recognition, World Scientific, London (1997) 4. Yulyakov, S., Govindaraju, V.: Probabilistic model for segmentation based word recognition with lexicon. In: Proc. of the Sixth International Conference on Document Analysis and Recognition, pp. 164–167. IEEE Press, Orlando, Florida, USA (2001) 5. Sadri, J., Suen, C., Bui, T.D.: A genetic framework using contextual knowledge for segmentation and recognition of handwritten numeral strings. Pattern Recognition 40, 898–919 (2007) 6. Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: Seggen: a genetic algorithm for linear text segmentation. In: Proceedings of IJCAI, pp. 1647–1652 (2007) 7. Connel, S., Jain, A.: Writer adaptation for online handwriting recognition. IEEE Trans. on PAMI 24, 329–346 (2002) 8. Arica, N., Yarman-Vural, F.: Optical character recognition for cursive handwriting. IEEE Trans. on PAMI 24, 801–813 (2002) 9. Knjazew, D.: OmeGA: A Competent Genetic Algorithm for Solving Permutation and Scheduling Problems. Kluwer Academic Publishers, Boston, MA (2002)
Gabor-Based Recognizer for Chinese Handwriting from Segmentation-Free Strategy Tong-Hua Su, Tian-Wen Zhang, De-Jun Guan, and Hu-Jie Huang School of Computer Science and Technology Harbin Institute of Technology, Harbin, 150001, China {tonghuasu,twzhang}@hit.edu.cn
Abstract. Segmentation-free recognizer is presented to transcribe Chinese handwritten documents, incorporating Gabor features and Hidden Markov Models (HMMs). Textline is extracted and filtered as Gabor observations by sliding windows first. Then Baum-Welch algorithm is used to train character HMMs. Finally, best character string in maximizing a posteriori criterion is found out through Viterbi algorithm as output. Experiments are conducted on a collection of Chinese handwriting. The results not only show the evident feasibility of segmentation-free strategy, but also manifest the advantages of Gabor filters in the transcription of Chinese handwriting. Keywords: Gabor filter, HMM, Chinese handwriting recognition.
1
Introduction
Despite thirty years research, offline recognition of Chinese handwriting remains one of the most challenging problems. High recognition rates are reported in literature on high-quality test samples [1]. When the real handwriting is fed, the results deteriorate dramatically [2]. Therefore, it could be advisable to emphasis on the more complex or even realistic handwriting recognition. We motivate to deal with above task from a segmentation-free strategy while previous recognizers for Chinese handwriting employ character segmentation prior to recognition stage. Textline is extracted and converted to observation sequence by sliding windows first (no attempts are made to segment textline into characters). Then Baum-Welch algorithm [3] is used to train character Hidden Markov Models (HMMs). Finally, the best character string in maximizing a posteriori criterion is found out through Viterbi algorithm [4] as output. One of the prominent advantages is that the juncture information between adjacent characters can be utilized easily. As for the complex structure and writing variability of Chinese handwriting, the core factor is to characterize Chinese character in a discriminative way. Some researchers have claimed that directional features of strokes demonstrate excellent performance to Chinese character recognition [1]. Gabor filter is one of the candidate ones. Gabor filter has been widely used in pattern recognition domain, e.g. in face recognition [5], iris recognition [6], and numeral recognition [7], due to W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 539–546, 2007. c Springer-Verlag Berlin Heidelberg 2007
540
T.-H. Su et al.
its biological evidences [8] and mathematical characteristics [9]. More recently, it has been applied into Chinese handwritten character recognition field with promising results [10] [11] [12]. In this paper, we explore Gabor features under the segmentation-free framework. After properly tuning the filter parameters, the recognition rates can be improved greatly. The results not only show the evident feasibility of segmentationfree strategy, but also manifest the advantages of Gabor filters in the transcription of Chinese handwriting. The next section details the main components of our system. Much emphasis is made on the Gabor feature extraction. Section 3 and 4 report the experimental results and draw conclusions from this work, respectively.
2
System Descriptions
When a textline image is fed into a recognizer, it is converted to a sequence of feature vectors or observations O = (o1 , ..., om ). The recognition task is to identify a string of characters S = (s1 , ..., sn ) maximizing the a posteriori (MAP) P (S|O). The recognizer described in this paper consists of three main components: textline extraction, Gabor feature selection by sliding window, recognition using HMMs. The system architecture is illustrated in Fig. 1.
Textline extraction
Sliding window based feature selection
Viterbi algorithm Character HMM
Reference database
B-W algorithm
Handwriting database
Character string
Performance
Fig. 1. System architecture
2.1
Textline Extraction
Each piece of handwriting for recognition consists of about 10 textlines and each textline includes about 24 characters. To make things easier, we first segment document into textlines. One straightforward method is horizontal histogram analysis. The zero-crossing points are potential text border lines. Unfortunately, skewed document presents overlapping between adjacent textlines and cannot be successfully handled by horizontal histogram analysis. A skew detection algorithm based on stroke histogram is adopted and then the de-skewed documents are analyzed by horizontal histogram [2]. By this technique, all textlines are successfully extracted in our handwritten document database written by a single writer (see Sect. 3.1 for more information relating the database). Fig. 2 shows
Gabor-Based Recognizer for Chinese Handwriting
541
Fig. 2. All lines are extracted after skew correction
a sample whose textlines are extracted after skew correction, while they cannot be separated by analysis of the histogram profile before skew correction. 2.2
Gabor Feature Selection
Once the textlines are extracted, we parameterize them using a sliding window (refer to Fig. 3(a)) from left to right. We are purposely omitting the noise removal and textline normalization stages to see the recognizer’s robustness. Generally, the height of the sliding window is the same as that of textline. The other two parameters, the width W and the shift step S, should be assigned by researchers or determined through experiments. In our system we set W = 12 pixels (one-fifth the characters average width) and S is determined by experiments. In addition, we partition the window into 3 zones, as showed in Fig. 3(b). The body zone is used to extract feature vector instead of the whole window in order to resist the undulation of textlines. The body zone is separated by the topmost and bottommost foreground pixels vertically.
S
…
upper blank zone …
ȥ .., oi, oi+1, … (a) The running of sliding window
body zone lower blank zone (b) Zone used to extract features
Fig. 3. Feature selection process using sliding window
We present cell features as a baseline system and Gabor-based features as enhanced one. The extraction of cell features is very simple. The body zone
542
T.-H. Su et al.
of the window is divided into 8 × 2 cells, and then the foreground pixels are summed in each cell. A bit complicated process is needed to extract the Gabor features. First we determine the parameters of Gabor filters. Mathematically the Gabor filter describes a plane wave modulated by a Gaussian envelope function, whose 2D form has been proven to resemble the receptive fields of simple cells in the visual cortex [8]. Among various wavelet bases, Gabor filter provides the optimized resolution in both spatial and frequency domains [13], and exhibits desirable characteristics of spatial locality and orientation selectivity. The 2D Gabor filters are defined: 2 κμ,ν 2 − κμ,ν 22 ζ2 iζκμ,ν − σ2 2σ e e − e , (1) σ2 where μ,ν are indices of orientation and scale of Gabor filters, ζ = x + iy, and κμ,ν is defined in half-octave space as:
ψμ,ν (ζ) =
κμ,ν = 2−
ν+2 2
πeiϕμ ,
(2)
where ϕμ is the μ-th orientation. We use following parameters: μ ∈ {0, 1, 2, 3}, ν = 3, σ = 2π. The values of μ and ν are designated from the statistical information of Chinese character structure [10]. Second, filter the handwritten textline. The filtered image at the i-th step is computed by convoluting the body zone in the i-th sliding window (let Bi (ζ) denote the zone) with Gabor filters: i Jμ,ν (ζ) = Bi (ζ) ∗ ψμ,ν (ζ).
(3)
Third, generate feature vectors. Generally, the Gabor feature extraction methods fall into two categories: directly sampling (referred as GS below) and local smoothing (also known as histogram, denoted as GH hereafter). The former dii vides the filtered body zone Jμ,ν (ζ) into 8 × 2 cells and the center point in each cell is selected and combined to construct the i-th feature vector oi . In the latter, i Jμ,ν (ζ) is partitioned into 4 × 2 cells and two quantities are produced in each cell, which are to be concatenated into oi : i i ζ∈{Jμ,ν (ζ)>0/0/ 10*ha) then processed CC was assumed as frame/window. Since the lines, frames and noises are the possible sources of error, they are discarded in the further steps of evaluation. Remained components were framed and these frames were filled to reach possible text areas. In Fig. 1c pre-energy map of the sample is shown. In Fig. 1d the energy map of the sample is shown. 2.3
Character Components Interval Tracing
Text characters in words and sentences, construct textural structure by the distance between its neighbor characters (pitchs) which are in a certain distance interval. In our model, we performed a novel component interval tracing process to benefit from this feature of text characters. Figure 1f shows the texts obtained by the application of character tracing from Fig. 1e which shows the binarized document image that masked with energy map. The binarized document image (by Otsu’s method) was masked with energy map to obtain probable text characters at the beginning of character interval tracing method. Connected component analysis made in this masked image and
Text Area Detection in Digital Documents Images
559
Fig. 1. a) Document image sample from database, b) Gabor filtering normalized result of sample image, c) Pre-energy map of this sample, d) Energy map of this sample, e) Binarized document image that masked with energy map, f) Text areas extracted by proposed method
text characters were found but there were still some non-text components which were labeled as probable text characters. In character components interval tracing process each component is classified as text or not. As stated previously, first the search regions for the component’s left and right neighbors are found as shown in Fig. 2. In Fig. 2, a text character “A” is given, where h is the height of this character component and Xa Yb the coordinate values of this character on the x-y plane. As it given in Fig. 2, the dotted areas in square form, are the search regions for this component’s neighbors. Then the following rules are applied: a) If there is no component in the search regions of the processed component then processed component is labeled as negative, b) If there is any component in the search regions of the processed
560
I. Ar and M.E. Karsligil
Fig. 2. The search areas of probable text component’s neighbors
component and the neighbor component does not have a height ratio given in (7) then the processed component is labeled as negative, c) If there is any component in the search regions of the processed component and the neighbor component have a height ratio satisfying (7) then the processed component and the neighbor component are labeled as positive. At the end, negative labeled components are erased from masked image, positive labeled components are accepted as text characters. In (7), height ratio rule is given. Nh is the height of the neighbor component, Ph is the height of the processed component. Nk (2 ∗ Nk ) ≥ Pk ≥ . (7) 2 This height ratio controls whether the neighbor components is similar to processed component or not.
Fig. 3. Character component interval tracing result on sample image
In Fig. 3, the result of character component interval tracing is given at the part of a document image. It is observed that masked image in Fig. 3.a has some non-text components as probable text components. But after character component interval tracing process, non-text components are eliminated due to neighborhood and height ratio, with great success.
3
Experimental Results
We tested our method on the generated document database which contains 40 such different kinds of document images from the books, the magazines and the newspapers. The results were quite promising for the documents having:
Text Area Detection in Digital Documents Images
561
Fig. 4. a) A document image taken from Media Team Database, b) Part of this sample, c) Energy map of part, d) Result of proposed method (text areas of part)
Fig. 5. a) A part of Russian document image, b) Result of masking energy map with black and white document image, c) Result of proposed method
pictures, non-vertical texts, character sequences of different languages, and nonrectangular layout. The proposed method sometimes fails when the document contains complex shapes like caricature, mathematical equation, musical note or pictorial background. Figure 4 shows a document that is composed of text areas and pictures and its parts. In the energy map of part, most of the image region’s components and non text components eliminated. And character component interval tracing eliminated the rest of non-text components easily. In Fig. 5 a part of document image that contains Russian text characters are given. Nearly all of the elements in image region eliminated at masking with energy map but still there were remaining non-text components. After character component interval tracing as the novelty of this work, only one nontext component remained.
4
Conclusion
The proposed method can be used for 256 grey level document images which are written in different languages. Our proposed method does not have restrictions like having a rectangular layout and training the system for known layout or
562
I. Ar and M.E. Karsligil
font types. The main advantage of the proposed system is the detection of text areas by determining textural features in a two stage. First, candidates of text region components were extracted by applying Gabor filter. Then the textures of the remaining components were examined to see whether they were part of a word. This second stage increases the performance of detecting the text regions correctly.
References 1. Simon, A., Pret, J.-C., Peter Johnson, A.: A Fast Algorithm for Bottom-Up Document Layout Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(3), 273–277 (1997) 2. Ha, J., Haralick, R., Phillips, I.: Document Page Decomposition by the BoundingBox Projection Technique. In: Proc. Third Int’l Conf. Document Analysis and Recognition. pp. 1,119-1,122, Montreal (1995) 3. Jain, A.K., Yu, B.: Document representation and its application to page decomposition. IEEE Trans. Pattern Analysis and Machine Intelligence 20, 294–308 (1998) 4. Chen, J.-L.: A simplified approach to the HMM based texture analysis and its application to document segmentation. Pattern Recognition Letters 18(10), 993– 1007 (1997) 5. Yuan, Q., Tan, C.L.: Page segmentation and text extraction from gray-scale images in microfilm, SPIE Document Recognition and Retrieval VIII, pp. 323-332 (2001) 6. Raju, S.S., Pati, P.B., Ramakrishnan, A.G.: Gabor Filter Based Block Energy Analysis for Text Extraction from Digital Document Images. In: Proceedings First International Workshop on Document Image Analysis for Libraries (DIAL 2004), pp. 233–243 (2004) 7. Pati, P.B., Raju, S., Pati, N., Ramakrishnan, A.G.: Gabor filters for document analysis in Indian Bilingual Documents. In: Proc. of International Conference on Intelligent Sensing and Information Processing - 2004, Chennai, India, pp. 123–126 (2004) 8. (2007) http://www.mediateam.oulu.fi/downloads/MTDB/ 9. Gonzales, R.C., Woods, R.E.: Digital Image Processing, 2nd edn. Prentice-Hall, Englewood Cliffs, NJ (2002) 10. Young, I., van Vliet, L., van Ginkel, M.: Recursive Gabor filtering. IEEE Trans. Sig. Proc. 50(11), 2799–2805 (2002) 11. Lee, T.S.: Image representation using 2D Gabor wavelets. IEEE Trans.Pattern Anal. Machine Intell. 18, 959–971 (1996) 12. Otsu, N.: A Threshold Selection Method From Gray Level Histograms. IEEE trans. Syst, Man. Cybercet, SMC, 62–66 (1979)
Optimal Threshold Selection for Tomogram Segmentation by Reprojection of the Reconstructed Image K. Joost Batenburg and Jan Sijbers University of Antwerp, Vision Lab, Universiteitsplein 1, B-2610 Wilrijk, Belgium
Abstract. Grey value thresholding is a segmentation technique commonly applied to tomographic image reconstructions. Many procedures have been proposed to optimally select the grey value thresholds based on the image histogram. In this paper, a new method is presented that uses the tomographic projection data to determine optimal thresholds. The experimental results for phantom images show that our method obtains superior results compared to established histogram-based methods.
1
Introduction
Segmentation of tomographically reconstructed images (also called tomograms) is a well known problem in computer vision. It refers to the classification of image pixels into distinct classes that are characterized by a discrete set of grey values. Amongst all segmentation techniques, image thresholding is the simplest, yet often most effective segmentation method. Many algorithms have been proposed for selecting “optimal” thresholds with respect to various optimality measures [1]. Thresholds are typically selected from the histogram of the tomogram, such that the distance between the tomogram and the segmented image is minimized. To the authors’ knowledge, current segmentation techniques for tomographic images are applied to tomograms only and do not exploit the available projection data but are rather based on the image histogram [2,3,4]. Specifically, the tomogram histogram is often used as a basis for determining global thresholds by, for example, fitting multiple gaussian distributions to the histogram [5] or applying a k-means clustering algorithm to it [6]. More recent thresholding techniques are based on a minimum variance criterion [7] or employ the variance and intensity contrast [8]. In practice however, the image histogram often lacks clear modes that would allow an intuitive selection of threshold values. Indeed, tomograms may be polluted by artifacts that tend to smear out the image histogram. Typical artifacts are streaking artifacts caused by highly absorbing object parts, blurring caused by object motion during scanning (or equivalently caused by small shifts of the detector during acquisition), bias fields, or artifacts caused by a limited field of view and/or a missing wedge. Hence, although threshold selection based on the image histogram can be made fully automatic, it lacks robustness in case clear grey value modes are absent. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 563–570, 2007. c Springer-Verlag Berlin Heidelberg 2007
564
K.J. Batenburg and J. Sijbers
This paper presents a new approach to segmentation, which uses the information in the projection data instead of the histogram. Forward projections of the tomogram are computed and compared to the measured projection data. In the search for optimal thresholds, many computationally expensive forward projections must be performed. An important contribution of the current paper is the efficient implementation of the forward projection method, which makes using the original projection data as a segmentation criterion feasible. The proposed method does not suffer from subjectiveness of the user. Our simulation experiments demonstrate that our method is robust and clearly outperforms established histogram-based methods.
2
Grey Level Estimation
We restrict ourselves to the segmentation of a 2-dimensional image, which is represented on a rectangular grid of width w and height h. Hence, the total number of pixels is given by n = wh. The grey-scale image x ∈ Rn that we want to segment is a tomographic reconstruction of some physical object, of which projections were acquired using a tomographic scanner. Our method can be used for any scanning geometry, e.g., parallel beam, fan beam and, in 3D, cone beam. Projections are measured as sets of detector values for various angles, rotating around the object. Let m denote the total number of measured detector values (for all angles) and let p ∈ Rm denote the measured data. The physical projection process in tomography can be modeled as a linear operator W that maps the image x (representing the object) to the vector p of measured data: W x = p.
(1)
For parallel projection data, the operator W is a discretized version of the wellknown Radon-transform. We represent W by an m×n matrix (wij ). Note that for each projection angle, every pixel i will only project onto a few detector pixels, so the matrix W is very sparse. Exploiting this sparsity forms an essential part of our method. The matrix representation of the projection operator is commonly used in algebraic reconstruction algorithms. We refer to Chapter 7 of [9] for details. From this point on, we assume that an image x has been computed that approximately satisfies Eq. (1). Our approach does not depend on the reconstruction algorithm that was used to compute x, e.g., Filtered Backprojection or ART. The image x now has to be segmented using global thresholding of the grey levels. The main motivation of using thresholding in general, is that pixels representing the same “material” in the scanned object should have approximately the same grey values in the tomogram. We rely strongly on the assumption that the scanned object consists of only a few different materials, which is true for all segmentation methods based on thresholding. Our segmentation approach first assigns a real-valued grey level to each of the segmentation classes. Using these grey levels, the projections of the segmented image are then computed. The computed forward projections are compared to the measured
Optimal Threshold Selection for Tomogram Segmentation
565
projection data, which provides a measure for the quality of the segmentation (along with the chosen grey levels). This quality measure can also be used for other segmentation techniques than thresholding. We first consider the problem of determining grey levels for each of the segmentation classes of a segmented image. We can consider a segmentation of an image into classes as a partition of the set of pixels, consisting of subsets. Let S = {S1 , . . . , S } be a partition of {1, . . . , n}. We label each set by its index t: St . Each pixel j is contained in exactly one set St ⊂ S, denoted by s(j) ∈ {1, . . . , }. To each set St , a grey level ρt ∈ R is assigned, which induces an assignment of grey levels to the pixels 1 ≤ j ≤ n, where pixel j is assigned the grey level ρs(j) . T Define r S (ρ) = ρs(1) . . . ρs(n) . The vector rS (ρ) contains, for each pixel j, the corresponding grey level of that pixel. Our goal is to determine “optimal” grey values ρ for the given partition S. The quality of a vector ρ is determined by computing the projections of the segmented image, using the grey levels from ρ, and comparing the computed projections to the measured projections p. More formally, we define the problem of finding optimal grey values for a given partition as follows: Problem 1. Let W ∈ Rm×n be a given projection matrix, let S = {S1 , . . . , S } be a partition of {1, . . . , n} and let p ∈ Rm be a vector of measured projection data. Find ρ ∈ R such that |W r S (ρ) − p|2 is minimal. We will start by deriving the equations for solving Problem 1 for a fixed partition S. Subsequently, we describe how the optimal set of grey values can be recomputed efficiently, each time the partition S has been modified by moving a single pixel from one class to another class. This fast update computation allows us to design efficient algorithms that combine the search for a segmentation S and the corresponding grey values ρ, such that the projection distance is minimal. Define A = (ait ) ∈ Rm× by ait = wij . (2) j: s(j)=t
The value ait equals the total area of pixels from the set St that contribute to detector value i. We denote the row vectors of A by ai = A·t . Clearly, we have [W r S (ρ)]i =
ait ρt .
(3)
t=1
Define the projection difference d ∈ Rm by d = W r S (ρ) − p = Aρ − p. Put ci = −2pi ai , Qi = ai ai T . Define the squared total projection difference by |d|2 = |Aρ − p|2 =
m
(ai T ρ − pi )2 =
i=1
Define c¯ =
m i=1
¯= ci , Q
m i=1
m (ci T ρ + ρT Qi ρ + p2i ). i=1
¯ Qi . Thus |d|2 = |p|2 + c¯T ρ + ρT Qρ.
(4)
566
K.J. Batenburg and J. Sijbers
Note that each of the terms ci and Qi only depend on ai and pi , not on ρ. A vector ρ that minimizes the projection difference |d| can now be computed by setting the derivatives of |d|2 with respect to ρ1 , . . . , ρ to 0, obtaining the ¯ = −¯ system 2Qρ c, and solving for ρ. So far, we have assumed that the partition S was fixed. Suppose that we have ¯ for the partition S. We now change S into a new partition computed c¯ and Q S , by changing s(j) for a single pixel j. The only rows of A that are affected by this transition are the rows i for which ¯ can be computed by wij = 0. This means that the new vector c¯ and matrix Q the following updates: c¯ = c¯ +
(ci − ci )
(5)
(Qi − Qi ).
(6)
i:wij =0
and ¯ = Q ¯+ Q
i:wij =0
¯ can be computed by applying updates for only a The fact that c¯ and Q few of the terms ci and Qi respectively, means that the optimal grey values for the entire image can be recomputed efficiently. This property allows us to develop algorithms for simultaneous segmentation and grey level estimation. Any segmentation algorithm that moves pixels from one class into another class, one at a time, can efficiently keep track of the optimal grey levels. In the next section we demonstrate this approach for a simple thresholding algorithm, finding both the optimal thresholds and grey values in a single pass.
3
Thresholding with Simultaneous Grey Level Estimation
The concepts in the previous section apply to any partition S of the pixels. We will now restrict ourselves to a specific type of partitions, that are induced by a thresholding scheme. Our starting point is a grey level image x ∈ Rn , that has been computed by any continuous tomographic reconstruction algorithm, e.g., Filtered Back Projection. The image x now needs to be segmented by means of thresholding, using a fixed number of classes for the pixels. Every pixel i is assigned a class according to a thresholding scheme using thresholds τ1 < τ2 < . . . < τ−1 . Put τ = (τ1 . . . τ−1 )T . Define the threshold function by ⎧ 1 (xi < τ1 ) ⎪ ⎪ ⎨ 2 (τ1 ≤ xi < τ2 ) s(i, τ ) = . (7) . . . ⎪ ⎪ ⎩ (τ−1 ≤ xi ) The threshold function induces a partition Sτ of the set {1, . . . , n}. Define T r(τ , ρ) = ρs(1,τ ) . . . ρs(n,τ ) .
Optimal Threshold Selection for Tomogram Segmentation
567
Similar to Section 2, we define the quality of a set of thresholds τ and grey values ρ as the total squared projection distance for the corresponding segmentation. We now consider the problem of simultaneously finding thresholds and grey values, such that the total squared projection difference is minimal: Problem 2. Let W ∈ Rm×n be a given projection matrix, let x ∈ Rn be a grey scale image and let p ∈ Rm be a vector of measured projection data. Find τ ∈ R−1 with τ1 < . . . < τ−1 and ρ ∈ R , such that |W r(τ , ρ)−p|2 is minimal. The simplest case of Problem 2 occurs when the image x is segmented into two classes, using a single threshold τ . In that case, an exhaustive search over all possible thresholds can be performed efficiently: first, a list of all pixels is computed, sorted in ascending order of grey level in the continuous reconstruction x. The start segmentation is formed by setting the threshold τ at infinity, so all pixels will be in the segmentation class S1 . The threshold τ is now gradually decreased, each time moving pixels from S1 to S2 . By using the update operation from Section 2, it is possible to keep track of the optimal grey values and the projection difference by applying only small update steps. Figure 1 shows the steps of the segmentation algorithm. For each pixel j, the time complexity of the update computation is O(u(j)2 ), where u(j) = #{i : wij = 0}. Each pixel is moved from S1 to S2 only once. Therefore, the total time complexity of the algorithm is O(U 2 ), where U denotes the total number of nonzero elements in the projection matrix W . Make a list L containing all elements j ∈ {1, . . . , n}, sorted in ascending order of xj ; τ := ∞; S1 := {1, . . . , n}; S2 := ∅; P For i = 1, . . . , m: ai1 := m j=1 wij ; ai2 := 0; compute c i and Q i ; ¯ Compute c¯ and Q; S1 := ∅; S2 = {1, . . . , n}; k := n + 1; while k > 1 do begin k := k − 1; τ := xL(k) ; while (k ≥ 1) and (xL(k) = τ ) do begin k := k − 1; j := L(k); S1 := S1 − {j}; S2 := S2 ∪ {j}; ¯ for each i such that wij = 0: update ci , Qi , c¯ and Q; end ¯ Compute the minimizer ρ of |d|2 = |p|2 + c¯T ρ + ρT Qρ; if (d < dopt ) then dopt := d; ρopt := ρ; τopt := τ ; end
Fig. 1. Basic steps of the algorithm for solving Problem 2 in the case = 2
568
3.1
K.J. Batenburg and J. Sijbers
More Than Two Grey Levels
If there are more than two segmentation classes, it is usually not possible to compute the total squared projection error for each possible set of candidate thresholds, due to the vast number of candidates. However, using the update operation from Section 2, it is possible to compute a local minimum of the projection error in reasonable time. A simple algorithm for this case is the following: first, determine initial thresholds τ , possibly using another automated procedure, such as fitting ¯ and c¯. In an iteratively Gaussian functions to the histogram. Next, compute A, Q loop, compute for each threshold the effect of a small increase and a small decrease of that threshold on the total projection error. Among all these possible steps, select the one that results in the largest decrease of the total projection error. The algorithm terminates if no step can be found that decreases the total error. The initial estimate τ 0 can be computed using another automated procedure, such as fitting Gaussian functions to the histogram.
4
Results and Discussion
In order to validate our proposed threshold selection method, simulation experiments were set up. For this purpose, three phantom images of size 512×512 were constructed: a binary vessel image representing a vessel tree, a binary femur image, and a grey valued mouse leg image (see Fig 2). From these images, CT projections were simulated as follows. First, the Radon transform of the images was computed, resulting in a sinogram for which each data point represents the line integral of attenuation coefficients. Then, (noiseless) CT projection data were generated where a mono-energetic X-ray beam was assumed. The projections were then polluted with Poisson distributed noise where the number of photons per
(a) Vessel
(b) Femur
(c) Mouse leg
(d) Vessel
(e) Femur
(f) Mouse leg
Fig. 2. (a-c) Phantom images ; (d-f) Simulated CT reconstructions from 90 projections
Optimal Threshold Selection for Tomogram Segmentation
569
Fig. 3. Thresholding results of two commonly applied thresholding methods and the proposed method for the femur image (cfr Fig. 2(b)). The numbers at the top of the figure indicate the least possible number of misclassified pixels for each case.
detector element was varied from 5 × 102 − 5 × 104 . Next, the noisy sinogram of the attenuation coefficients was obtained by dividing the CT projection data by the maximum intensity and computing the negative logarithm. Finally, the simulated, noisy CT reconstructions were obtained by applying a SIRT algorithm. For each case, 10 independent noisy sinograms were generated. We refer to [9] for further details on image formation in CT and on the SIRT algorithm. The proposed tomogram thresholding technique was then compared to commonly applied thresholding methods [10]. First, a parametric optimal thresholding technique was implemented where the image histogram was fitted to a mixture probability density function (two gaussian functions) from which an optimal threshold was derived [5]. Next, the commonly used iterative threshold selection scheme of Otsu was implemented [6]. This is also the default thresholding method used in Matlab. As a final method, k-means clustering was applied to the image histogram. For each simulated reconstruction, global thresholds were computed. Then, the number of misclassified pixels for each method, referred to as Nm , was compared to the number of misclassified pixels Nopt of the ‘optimally thresholded image’. The latter image can be found by an exhaustive search over all possible threshold values and comparing the thresholded image to the original, noiseless image. From those numbers, a measure for the number of correctly classified pixels for each method is given by: R = 100 ∗ Nm /Nopt which will be referred to as the classification performance ratio. For each method, R is evaluated as a function of the number of photons per detector element employed during simulation of the CT sinograms.
570
K.J. Batenburg and J. Sijbers
Typical results of the simulation experiments are shown in Fig. 3. Since the classification performance ratio is based on Nopt , this number is denoted above the figure for each data point. The error bars around each point indicate the range of the results for the 10 independent experiments. From the figure, it is clear that the proposed projection based method outperforms conventional thresholding techniques with respect to the classification performance ratio. Similar results were obtained for the vessel and mouse-leg images. For the vessel image, the classification performance ratio of our proposed method was above 92% in all tests, whereas the best performing alternative method, k-means clustering, dropped to 80% in the test with the highest count per detector element. For all tests, the running time was around 10s on a standard desktop PC.
5
Conclusions
Global grey value thresholding is a trivial, yet often used segmentation technique. The search for the optimal grey level threshold is, however, far from trivial. Many procedures have been proposed to select the grey value threshold based on the image histogram. In our paper, we have presented an innovative approach to find the optimal threshold grey levels by exploiting the available projection data. The results on simulated CT-reconstructions show that our proposed method obtains superior results compared to established histogram-based methods.
References 1. Glasbey, C.A.: An analysis of histogram-based thresholding algorithms. Graphical Models and Image Processing 55, 532–537 (1993) 2. Seo, K.S.: Improved fully automatic liver segmentation using histogram tail threshold algorithms. vol. 3, pp. 822–825 (2005) 3. Sahoo, P.K., Arora, G.: Image thresholding using two-dimensional Tsallis. HavrdaCharv´ at entropy 27, 520–528 (2006) 4. Ng, H.F.: Automatic thresholding for defect detection. vol. 27, pp. 1644–1649 (2006) 5. Sonka, M., Fitzpatrick, J.M.: Handbook of Medical Image Processing and Analysis. SPIE Press (2004) 6. Otsu, N.: A threshold selection method from gray level histograms. IEEE Trans. Systems, Man and Cybernetics 9, 62–66 (1979) 7. Hou, Z., Hu, Q., Nowinski, W.: On minimum variance thresholding. 27, 1732–1743 (2007) 8. Qiao, Y., Hu, Q., Qian, G., Luo, S., Nowinski, W.L.: Thresholding based on variance and intensity contrast. 40, 596–608 (2007) 9. Kak, A.C., Slaney, M.: Principles of Computerized Tomographic Imaging. In: Volume Algorithms for reconstruction with non-diffracting sources, pp. 49–112. IEEP Press, New York, NY (1988) 10. Eichmann, M., L¨ ussi, M.: Efficient multilevel image thresholding. Master’s thesis, Hochschule f¨ ur Technik Rapperswil, Switzerland (2005)
A Level Set Bridging Force for the Segmentation of Dendritic Spines Karsten Rink and Klaus T¨ onnies Department of Simulation and Graphics University of Magdeburg, Germany {karsten,klaus}@isg.cs.uni-magdeburg.de
Abstract. The paper focusses on a group of segmentation problems dealing with 3D data sets showing thin objects that appear disconnected in the data due to partial volume effects or a large spacing between neighbouring slices. We propose a modification of the speed function for the well-known level set method to bridge these discontinuities. This allows for the segmentation of the object as a whole. In this paper we are concerned with treelike structures, particularly dendrites in microscopic data sets, whose shape is unknown prior to segmentation. Using the modified speed function, our algorithm segments dendrites and their spines, even if parts of the object appear to be disconnected due to artifacts.
1
Motivation
A number of problems arise when an object in a 3D data set should be segmented. Depending on the modality there may be certain artifacts, e.g. magnetic field inhomogeneities in MR-images or metal artifacts in CT-images. An artifact common to almost all image acquisition techniques is the partial volume effect (PVE). In digital images the grey value of a pixel is the mean value of all the information that was measured in the area represented by this pixel. The result is a blurring of regions with large variances in grey values. This is especially evident at the borders of objects. It is amplified even more between neighbouring slices in data sets, as the slice thickness is often larger than the pixel spacing within a slice. Structures whose diameter is smaller than the width of a pixel will merge into the background and may even vanish entirely. When segmenting objects with a known shape, one can simply devise a model describing this shape and its variation to detect and/or segment the object within the data set. The location of small parts not visible in the data can then be estimated based on the parts of the object that have been segmented. Examples of such models are the Active Shape Model[2], Snakes[7] or Mass Spring Models[4]. With objects whose shape is very complex or not known prior to segmentation, partial volume effects may pose a big problem. It is difficult to determine the exact location of the borders of the object. Furthermore, parts of the structures that appear disconnected within or, more likely, between slices are usually not segmented at all. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 571–578, 2007. c Springer-Verlag Berlin Heidelberg 2007
572
K. Rink and K. T¨ onnies
(a)
(b)
Fig. 1. Illustration of the front propagation process: The left image shows an initial curve C and the level set function Φ at time t=0. The right image shows both functions at a later time.
Due to its implicit definition, the level set method is an appropriate means to segment objects whose shape is not known a priori or whose shape varies significantly between data sets. Therefore, we device a solution to the problem mentioned above using a modification of the level set method. We propose a new force term for the level set speed function to bridge small gaps in the representation of objects, such as those resulting from PVE.
2
Level Sets Methods and Speed Functions
Level sets were introduced by Osher and Sethian [10] for the solution of surface motion problems. They are used to describe the propagation process of a closed curve C ∈ n . This curve C is represented as the zero level set of a higher dimensional function Φ ∈ n+1 , i.e. C(t) = {x|Φ(x(t), t) = 0}.
(1)
C is moving in its normal direction over time. The speed of each point on C is given by a speed function F that is dependent on various internal and external forces, such as local curvature or image gradients, respectively. This leads to the level set equation Φt + F |∇Φ| = 0,
(2)
where |∇Φ| denotes the normalised gradients of the level set function and F is the speed function. The advantage of this representation is that Φ always remains a function even if C splits, merges or forms sharp corners. Also, this representation is independent of the number of dimensions of C. As Φ changes over time its zero level set Φ(x, t) = 0 always yields the propagating front, i.e. C(x) at time t. The speed function F governs the way the front is propagating. Usually F consists of a number of different forces that are combined in a meaningful way. Using level set methods for image processing applications, one can differentiate between external force terms derived from the underlying image and internal
A Level Set Bridging Force for the Segmentation of Dendritic Spines
573
force terms derived from the properties of the front itself. An example for a speed function of a level set function for an image processing task is given by F = F∇I (FA + Fκ ),
(3)
where FA is an advection term for expanding the front, Fκ is a force based on local curvature, usually given by Fκ = ∇
∇Φ , |∇Φ|
(4)
and F∇I is an external force based on image gradients. One way to define such a force is given by F∇I (x) =
1 , 1 + |∇Gσ ∗ I(x)|
(5)
other approaches are described in [9] and [1]. In equation 5, Gσ ∗ I(x, y) denotes an image convolved with a Gaussian low pass filter with a standard deviation of σ. If many force terms are used it is necessary to find meaningful weights for all forces to guarantee that the results of all forces have approximately the same order of magnitude. The definition of image-based forces is by no means limited to simple image features. Van Bemmel et al.[12] devise a force term based on the size of cross section areas and the eigenvalues of the Hessian matrix at each pixel for the segmentation of blood vessels in MRA images. The resulting speed term gives high values inside of cylindrical objects and low values otherwise. Leventon et al. [8] proposed a method to use curvature and intensity profiles along the gradients of the propagating front to segment the corpus callosum and the femur in MR images. Such advanced speed terms containing model-based aspects often result in a more robust and reliable segmentation result. Unfortunately, they are also often limited to the specific application they were designed for. Also, the independence of the level set method to the number of dimensions of the data is lost. A paper by Han et al. [5] introduces topology preserving level sets. This modification to the speed function prevents the propagating front from splitting or merging. In [5] this is used for the segmentation of grey matter in MR images of the human brain and the segmentation of bones in CT images. Like the speed term proposed in this paper, the modification in [5] is somewhat weaker in its contribution to the speed function. But it is consistent with the definition of level sets and not limited to a single application.
3
Bridging Gaps in Data
The goal of this section is to define a speed term that allows the complete segmentation of objects, even if parts of those objects are not connected. As
574
K. Rink and K. T¨ onnies
(a)
(b)
(c)
Fig. 2. Image 2(a) is an example of one slice of a contrast enhanced 3D microscopic image depicting a dendrite with various spines (see section 4 for details). The original image is convolved with a small gaussian filter and segmented using the level set method. Image 2(b) depicts the result using a common speed function. While some spines have been segmented during this segmentation process, others have been missed. Image 2(c) shows the segmentation result using a modified speed function containing our new speed term. Almost all spines have been segmented. See the enlarged regions in the lower right corner for more details. The missing spines are barely visible (e.g. one spine marked by the white arrow in the enlarged section in image 2(c)) and cannot be segmented without either the incorporation of additional knowledge or an oversegmentation in other parts of the image.
described in section 1 such gaps may occur in medical imaging due to partial volume effects when thin objects are seemingly disconnected due to a large pixel spacing in the data set. To solve this problem we define a new speed term that allows the propagating front to “look ahead” of its current positions. That is, we incorporate a term that for each pixel on the front decides if it should be moved even if the underlying image features in the data (grey values, gradients, etc.) suggest that it should be stopped. In a prior paper [11] we used the pixels along the surface normal to decide if the front should continue to propagate. This method is very fast, as the surface normal is easily derived from the level set representation. While this works well with most objects, problems occurred unter certain circumstances. Because the surface normal is calculated from a discrete representation of the level set function, the approximation at places of high curvature tends to be very inexact. This is obviously a problem when dealing with thin objects. On the other hand, these objects are of special interest as partial volume effects tend to affect particularly the representation of thin objects within the data. The new definition of our “bridging force” doesn’t rely on the surface normal anymore. Instead, for each pixel x on the front all pixels y in a given neighbourhood with Φ(y) > 0 are checked. The best pixel is selected using various, easily computed conditions:
A Level Set Bridging Force for the Segmentation of Dendritic Spines
575
– y fulfills the image-based conditions of the level set speed function (i.e. the pixel would be segmented if it was connected to the object) – the angle γ between the surface normal of x and the vector y-x is smaller than 90 degrees – the distance between x and y is smaller than the distance between y and the neighbours of x on the propagating front. Note, that the image-based conditions (e.g. gradient length, grey values, etc.) need to be calculated only once before the start of the propagation process and are needed for the use of other image based forces anyway. Also, instead of actually calculating γ we simply compute the scalar product (n · (y-x)), where n is the surface normal of the front at the position of x. Furthermore, we need only to calculate this force term if the speed of the front at position x is very small. Otherwise the front is currently still moving at this position, so there cannot be any gap that has to be bridged. The speed function of the level set equation is now rewritten as αFI (x)(FA + βFκ (x)), if F > ε, F (x) = (6) αFI (y)FA , otherwise. FI denotes the combination of all image-based speed terms, Fκ is the curvaturebased speed term and α, β ∈ [0, 1] control the influence of image- and curvaturebased forces, respectively. Using the speed function given in equation 6 the propagating front may cross areas of the image that would not be segmented using a common speed function such as given in equation 3. Still, the front will not leak into the background of the image if the speed function is defined appropriately because the image-based speed terms represented by FI will prevent any propagation in these locations (see figure 4). While disconnected parts of the desired object are segmented in almost all cases the chosen path of the front was not always the semantically correct way. If two different parts of the front may potentially be connected to a given part of the object, it cannot be decided which connection would be the correct one (see figures 5(b) and 5(c)). In order to guarantee a correct decision, more semantic knowledge of the desired object would be required, which in turn would restrict the general use of our “bridging force” for various applications. Nevertheless, corrections of these semantically wrong “bridges” could be realised in a postprocessing step. Also, the size of the neighbourhood is important to consider. If the neighbourhood is too small, some parts of the desired object may not be found (see figure 5(a)). On the other hand, if the neighbourhood is too large, the number of unwanted bridges may increase, depending on the topology of the data.
4
Experimental Results
The modified speed function was tested on images depicting branches of neurons, called dendrites, which have been injected with a luminescent dye. These
576
K. Rink and K. T¨ onnies
(a)
(b)
(c)
Fig. 3. Preprocessing steps: A slice of the original 8 bit image (figure 3(a)) is contrastenhanced (figure 3(b)) and convolved with a 5 × 5 gaussian filter (figure 3(c))
(a)
(b)
Fig. 4. Advantages of the bridging force: Figure 4(a) depicts a slice of the 3D segmentation of the dendrite from figure 3(a) using a common level set function as given in equation 3. In contrast, the result in figure 4(b) uses the additional bridging force introduced in section 3.
dendrites have been scanned using a confocal laser scanning microscope. We used five data sets with a minimum resolution of 512x512x112 voxel. The voxel size in all images is 0.1μm3 . A number of extensions are visible on the side of each dendrite. These dendritic spines are used for the transmission of signals between various neurons [6]. Figure 2(a) shows a slice from one of the data sets. It can be seen that there is often no visible connection between dendrites and their spines and a segmentation using the level set function does not find these disconnected spines (see figure 2(b)). Prior to the segmentation, the original data has been contrast enhanced to allow for a more reliable calculation of the image based forces, and convolved with a 5 × 5 gaussian filter kernel to reduce the image noise (see figure 3). As mentioned in section 3, the success of the segmentation depends on those image features that are used in the speed function and on the number of pixels that the bridging force is using to decide if a gap should be crossed.
A Level Set Bridging Force for the Segmentation of Dendritic Spines
(a)
(b)
577
(c)
Fig. 5. Drawbacks of the bridging force: Shown are three slices from the same data set. Region A is an example of a spine too similar to the background to be segmented. The bright part of the spine in figure 5(a) is too far from the dendrite to be detected. Region B shows a spine that has been detected. The level set function bridged over to the bright region from another spine, though, which is semantically not correct.
The algorithm was initialised with a single seed pixel within the dendrite. It did segment all spines with the specified image features that where either (visibly) connected to the dendrite or who were located within the neighbourhood specified by the bridging force. For the experiments shown in this paper we used a maximum look-ahead distance of six pixels. As shown in figure 5(a), a small number of spines were missed. Nevertheless this was a good tradeoff, since a larger neighbourhood increased the number of semantically incorrect connections. In figure 2(c) we point out a number of hard to detect spines with image features similar to the background noise. Without the incorporation of additional semantic knowledge it is not possible to detect these spines. Since no ground truth for a correct segmentation exists, we compared our result with various other segmentation techniques. In figures 2(b) and 4(b) we depict results of a common level set segmentation. The results are comparable to other user guided segmentation techniques like region growing, fast marching, image foresting transform, etc. These techniques cannot segment disconnected parts of the specified objects without incorporating a modification similar to our bridging force. As mentioned in section 1 model-based segmentation techniques are not applicable since there is no prior information on the shape of the object. Simple pixel based techniques, e.g. thresholding methods, cannot connect disconnected spines to the dendrite, but we assume that it should be possible to achieve a correct segmentation using a markov random field [3] with an appropriate neighbourhood configuration.
5
Conclusions
We defined a new speed term to connect parts of an object that appear disconnected in the data due to partial volume effects. We aimed at a general definition
578
K. Rink and K. T¨ onnies
applicable to various similar problems. This speed term was successfully applied to a number of microscopic images of neurons to segment dendrites and their dentritic spines. We found all spines given our specified image features, although not always the correct connection to the associated dendrite. We presented the advantages of our modified speed function to a common level set segmentation and also mentioned drawbacks due to the general definition of our new force term. To obtain a semantically correct segmentation the integration of further problem related knowledge will be necessary, although this will require a modification of our speed term adapted to the problem.
Acknowledgements We like to thank Andreas Herzog for the fruitful discussion on the data sets.
References 1. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic Active Contours. Int. J. Comp. Vis. 22(1), 61–79 (1991) 2. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active Shape Models - Their Training and Application. Comput. Vis. Image Understand. 61(1), 38–59 (1995) 3. Geman, S., Geman, D.: Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721– 741 (1984) 4. Hamarneh, G., McInerney, T., Terzopoulos, D.: Deformable Organisms for Automatic Medical Image Analysis. In: Niessen, W.J., Viergever, M.A. (eds.) MICCAI 2001. LNCS, vol. 2208, pp. 66–76. Springer, Heidelberg (2001) 5. Han, X., Xu, C., Prince, J.L.: A topology preserving level set method for geometric deformable models. IEEE Trans Pattern Anal Mach Intell 25(6), 755–768 (2003) 6. Herzog, A.: Formrekonstruktion dendritischer Spines aus dreidimensionalen Mikroskopbildern unter Verwendung geometrischer Modelle. Dissertation. Ottovon-Guericke-Universit¨ at Magdeburg, Fakult¨ at Elektrotechnik (2001) 7. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active Contour Models. Int. J. Comp. Vis. 1(4), 321–331 (1988) 8. Leventon, M.E., Faugeras, O., Grimson, W.E.L., Wells III, W.M.: Level Set Based Segmentation with Intensity and Curvature Priors. In: Workshop on Mathematical Methods in Biomedical Image Analysis (2000) 9. Malladi, R., Sethian, J.A., Vemuri, B.: Shape Modelling with Front Propagation: A Level Set Approach. IEEE Trans. Pattern Anal. Mach. Intell. 17(2), 158–175 (1995) 10. Osher, S., Sethian, J.A.: Fronts Propagating with Curvature Dependent Speed: Algorithms Based on Hamilton-Jacobi Formulation. J. Comp. Phys. 79, 12–49 (1988) 11. Rink, K., T¨ onnies, K.: A modification of the level set speed function to bridge gaps in data. In: Franke, K., M¨ uller, K.-R., Nickolay, B., Sch¨ afer, R. (eds.) Pattern Recognition. LNCS, vol. 4174, pp. 152–161. Springer, Heidelberg (2006) 12. Van Bemmel, C.M., et al.: A Level-Set-Based Artery-Vein Separation in Blood-Pool Agent CR-MR Angiograms. IEEE Trans. Med. Imag. 22(10), 1224–1234 (2003)
Knowledge from Markers in Watershed Segmentation Sébastien Lefèvre LSIIT, CNRS / University Louis Pasteur - Strasbourg I Parc d’Innovation, Bd Brant, BP 10413 67412 Illkirch Cedex, France
[email protected] Abstract. Due to its broad impact in many image analysis applications, the problem of image segmentation has been widely studied. However, there still does not exist any automatic segmentation procedure able to deal accurately with any kind of image. Thus semi-automatic segmentation methods may be seen as an appropriate alternative to solve the segmentation problem. Among these methods, the marker-based watershed has been successfully involved in various domains. In this algorithm, the user may locate the markers, which are used only as the initial starting positions of the regions to be segmented. We propose to base the segmentation process also on the contents of the markers through a supervised pixel classification, thus resulting in a knowledge-based watershed segmentation where the knowledge is built from the markers. Our contribution has been evaluated through some comparative tests with some state-of-the-art methods on the well-known Berkeley Segmentation Dataset. Keywords: Marker-based Watershed. Supervised classification. Colour Segmentation.
1 Introduction Analysis, processing and understanding of digital images often involve many different algorithms. Among them, the segmentation (which consists in generating a set of meaningful regions from an input image) is certainly one of the most crucial steps as the quality of the following procedures directly depends on the accuracy and relevance of the segmentation results. So many research works are related to the problem of image segmentation, one of the main goals being to elaborate a method both automatic (without user assistance) and generic (able to deal with any kind of images). As this objective is still unreachable, the existing approaches are either automatic or generic. Semi-automatic segmentation techniques may not be dedicated to a given type of images. Indeed, the genericity property is ensured by the user intervention or setting. There are several ways the user can drive the segmentation procedure, among which we can cite the spatial initialisation of the algorithm, the definition of the class of interest in the images, the respective influence of the differents features to be involved (e.g. colour, texture, shape, etc). The marker-based watershed [1], a widely used enhancement of the well-known watershed algorithm [2], may be driven by the user through a spatial initialisation (i.e. the location of the "markers"). Only the marker spatial position is used in the watershed algorithm, and the marker content is completely ignored. We W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 579–586, 2007. c Springer-Verlag Berlin Heidelberg 2007
580
S. Lefèvre
propose in this paper to increase the marker-based watershed performance by using the information related to the markers content. More precisely, we use the markers as learning sets in a supervised pixel classification procedure which results in a probability map per marker. From these probability maps is then built the relief necessary to the watershed algorithm. The rest of this paper is organized as follows. In section 2 we will recall the markerbased watershed and present existing watershed segmentation methods relying on knowledge or prior information. Then we will describe the proposed solution in section 3 and illustrate its relevance through comparative tests in section 4. Finally, section 5 will be devoted to concluding remarks.
2 Marker-Based and Knowledge-Based Watershed The watershed transform is a very popular segmentation method, as it is computationally efficient and does not require any parameter. However it has also some drawbacks, such as the sensitivity to noise and above all oversegmentation, where the result consists in a large number of irrelevant and undesired regions. To counter these limits and to increase the accuracy and relevance of the results, it is possible to consider prior knowledge. This knowledge often consists in the number and the positions of the regions through the definition of some markers, thus resulting in the marker-based watershed which will be recalled in this section. We will also present some other ways to involve knowledge in the watershed transform. 2.1 The Marker-Based Watershed The marker-based watershed [1] is certainly the mostly known and widely used enhancement of the watershed transform. Several definitions have been given in the literature, and we recall here the algorithmic definition given by Vincent and Soille [2] and called simulated immersion as it is the definition our method is relying on. In this definition, the set of the catchment basins of the greyscale image function f (with values in [hmin , hmax ] is equal to the set Xhmax obtained after the following recursion: Xhmin = Thmin (f ) Xh+1 = MINh+1 ∪ IZTh+1 (f ) (Xh ),
hmin ≤ h < hmax
(1)
where Th is the threshold set at level h, MINh is the union of all regional minima at altitude h, and IZA (B) is the union of geodesic influence zones of connected components of B defined with B ⊆ A and the following equations: IZA (B) =
l
izA (Bj )
(2)
j=1
izA (Bj ) = {p ∈ A|∀k ∈ [1, l]\{j} : dA (p, Bj ) < dA (p, Bk )}
(3)
dA (a, B) = min dA (a, b)
(4)
b∈B
Knowledge from Markers in Watershed Segmentation
581
where dA (a, b) represents the geodesic distance between a and b within A. The reader is refered to the original article [2] for a complete definition. From this definition, it is possible to impose some minima to the image function f at some specific locations (i.e. the markers). Let us note M the set of markers, thus we can define a new image function g as: hmin−1 if p ∈ M g(p) = (5) f (p) otherwise where p represents pixel coordinates and hmin−1 denotes a new value dedicated for initial markers. The new recursion definition is then: Xhmin−1 = Thmin−1 (g) Xh+1 = IZTh+1 (g) (Xh ),
hmin−1 ≤ h < hmax
(6)
2.2 Knowledge-Based Watershed Methods In [3], Beare considers spatial prior knowledge and introduces a way to constrain the growing of the markers through the use of structuring element-based distance functions. The proposed method is able to deal with noisy or incomplete object boundaries. Li and Hamarneh [4] consider a set of training images with expert-made segmentation to build shape histograms and appearance descriptors (mean and variance of the object pixel intensities). Their knowledge-based segmentation procedure relies on a classical watershed followed by a k-means clustering algorithm, but the shape and appearance information are not taken into account in the watershed segmentation step. In a previous work [5], we have proposed to involve knowledge as labelled pixels in the watershed segmentation of remote sensed multispectral images. More precisely, a set of c predefined classes (e.g. building roofs, vegetation, and roads) is considered and sample pixels are given for each class. Then a supervised fuzzy pixel classification (based on spectral signatures) is involved to generate probability maps gathered into a single c-band image. A morphological gradient is applied on this image and the euclidean norm is considered to obtain a graylevel image, on which the watershed transform is finally applied. So the knowledge is only related to spectral information. To the best of the author’s knowledge, the only attempt to use the markers content as a knowledge source in the watershed segmentation was done by Grau et al in [6], where a very restrictive assumption is made on normal distribution for the objects in the image. Each marker or class being represented by the mean and variance of its pixel values, they consider a Bayesian framework and model local correlations between pixels through Markov Random Fields. Thus the overall marker-based watershed process is very time-consuming.
3 Proposed Method In the definition of the marker-based watershed given previously, the set of markers M was used only to generate the initial set Xhmin−1 involved in the recursive algorithm. In this paper, we propose to use not only the position of the markers but also their content.
582
S. Lefèvre
Let us first modify the definition of the markers, considering from now a collection M = {Mi }1≤i≤c of c markers. Each individual marker is a set of points Mi = {p}1≤p≤n , thus resulting in either one or several connected components. These points may be characterized by various features, such as intensity, colour, spectral signature, texture, etc. We associate to each marker Mi a class Ci and we apply then a given supervised (soft or fuzzy) pixel classification, using Mi as the learning set for the class Ci . The supervised classification procedure will return a set of probability values {wi (p)} where wi (p) represents the probability a pixel p would belong to the class i, with the constraint 1≤i≤c wi (p) = 1. From the content of a given marker Mi , we have then generated a new image wi where high values represent pixels which most probably belong to Mi . As the watershed paradigm considers an increasing level h, we define the functions fi = (1 − wi ) · f where pixels with high probability wi will have their relative input relief f lowered whereas pixels with low probability will be kept unchanged. It is also possible to consider only the probability maps (thus defining fi = 1 − wi ) but we have observed in our experiments poorer results as in this case the segmentation process relies too much on the classification step. The functions fi will be considered as the reliefs in the watershed process. The watershed algorithm (either standard or marker-based) relies on a grayscale image f . Here the supervised classification procedure results in a set of c images fi . A standard way to combine these images is to compute a given norm (such as the euclidean norm) as in [5]. In the context of marker-based watershed segmentation, it is not necessary to merge all fi images into a single one, and we rather consider a different image fi for each marker Mi . So the usual algorithm from Vincent and Soille cannot be applied directly and should be adapted to our case. More precisely, we define the functions gi as hmin−1 if p ∈ Mi gi (p) = (7) fi (p) otherwise and set X = {X i }1≤i≤c , thus modifying the recursive scheme: Xhi min−1 = Thmin−1 (gi ) i i Xh+1 = IZ{T (Xh ), h+1 (gi )}1≤i≤c
hmin−1 ≤ h < hmax
(8)
and considering an adapted definition of the influence zones by the following equations: i IZA (B)
=
l
i izA (Bji )
(9)
j=1
i izA (Bji ) = izA (Bji ) ∪
c
{p ∈ Ai ∩ Am |∀k ∈ [1, l] : dAi (p, Bji ) < dAm (p, Bkm )}
m=1
(10) In other words, each catchment basin is initially defined from a given marker and will grow relying mainly on the relief built from its related marker. Many relief functions fi will be involved only in case of borderline pixels which could be assigned to different catchment basins.
Knowledge from Markers in Watershed Segmentation
583
4 Results We have evaluated our method on the freely available Berkeley Segmentation Dataset [7] which contains mainly natural colour images (animals, landscape, etc) of size 481 × 321 pixels. In order to underline the potential interest of the proposed solution against the state-of-the-art, we have also considered two other well-known segmentation methods. First, as our method is relying on a supervised classification using markers as learning sets, we have performed a segmentation directly based on the classification algorithm under consideration. Each pixel is then given the class for which it has the highest probability. Second, as our method is relying on a marker-based watershed algorithm, we have obviously compared with the usual marker-based watershed segmentation, considering the same markers. Comparing our contribution with these two widely used approaches helps us to illustrate the interest of combining the marker-based segmentation watershed and the supervised pixel classification in a single procedure, thus benefiting from advantages of both techniques.
Fig. 1. Comparative results (from left to right): input image with markers, classical supervised pixel classification, classical marker-based watershed segmentation, and proposed approach
Fig. 2. Comparison between [8] and our approach (from left to right): initial seeds and segmentation result for [8], markers and segmentation result for our approach
584
S. Lefèvre
Fig. 3. Influence of the number of markers on the results (from left to right): input image with markers, classical supervised pixel classification, classical marker-based watershed segmentation, and proposed approach
Fig. 4. Segmentation results on various images (markers and results)
The proposed method may deal with any kind of images (grayscale, colour, multispectral) and may consider any kind of features (spectral or colour signature, texture, etc). For the sake of simplicity, we have limited ourselves here to the case of colour images where each pixel is represented by its tristimulus RGB (Red Green Blue) values. We could have used more appropriate features (texture, colour hue, etc) but let us recall that our goal is to show the potential interest of the proposed method compared to classical marker-based watershed and supervised classification, rather than to obtain optimal segmentation results for a given dataset. However, the results given in this section are rather conclusive. Compared to other related experiments made recently on the considered dataset, for instance the seed-based approach from Micusik and Hanbury [8], our method performs well without requiring neither a large number nor a precise location of markers or seeds, as illustrated by figure 2. Moreover, let us notice that our method was experimented here only with RGB values whereas results from [8] benefit from more elaborated features (brightness, colour and texture).
Knowledge from Markers in Watershed Segmentation
585
There are only a few parameters used in the comparison process. A K-nearest neighbour algorithm is involved for pixel classification purpose (for both classification-based segmentation and our method), with the following parameters: K = 5, a distance weighting scheme, a number of learning samples per class being equal for all classes and less or equal to 100. For the classical watershed algorithm, we use a morphological gradient (defined as the difference between a dilation and an erosion) with a squared structural element of size 3 × 3 pixels and an euclidean norm in the RGB space to create the relief on which the watershed algorithm would be applied. Figure 1 shows the segmentation results obtained for two different images with a common depth of field property. As we can notice on the left images, the two markers (for the object and the background) are rather small and far from the actual object edges. Thus the classical marker-based watershed is unable to segment correctly the objects. The 2 class supervised classification brings interesting results, as the edges between objects are clearly visible, but it is hard to determine the appopriate object edges from the markers and the classification map. The proposed approach is far more accurate due to the fact it takes into account both the spatial positions of the markers (as initial positions of the catchment basins) and the colour content of the markers (as learning sets of the classification procedure). The number of markers is variable and it could be more relevant to use in some cases more than two markers. For instance, we compare in figure 3 the initialisation by two markers and by three markers, considering respectively only one or two markers for the background. In the last case, using two markers for the background helps to increase the accuracy of the supervised classification procedure by considering two distinct classes instead of only one. Moreover, we can see here the ability of the method to correctly separate the field and the wood. A complete set of segmentation results is given in figure 4. We can notice that the proposed approach returns often accurate results, except for the two last images where the learning procedure returns poor results, thus resulting in an inaccurate segmentation using our method. The computational cost of the proposed approach is of course higher than the classical marker-based watershed as it also involves a supervised pixel classification. However, it is still reasonable and as been measured around 15 seconds with a Java-based implementation on a Pentium M 1.1 GHz / 1 GB RAM laptop.
5 Conclusion In this paper we dealt with the problem of image segmentation, for which the markerbased watershed algorithm has been one of the most widely accepted solutions. This algorithm requires a spatial initialisation made either automatically or manually to define the initial position of the catchment basins (i.e. the markers). However the use made of this very relevant information is rather limited. So we proposed here to use the markers as learning sets in a supervised pixel classification procedure. Thus we obtain, for each marker or class, the probability for each pixel to belong to the given class through a probability map. It is then possible to build from this map the relief necessary to the marker-based watershed algorithm, and to
586
S. Lefèvre
make use of both the spatial positions and the contents of the markers. We made some conclusive tests on the Berkeley segmentation dataset. In order to better evaluate the segmentation results returned by our method, we consider to measure their relevance and accuracy using the user-made references results from the Berkeley segmentation dataset [7]. Moreover, we will involve other features, related either to texture [9,10] or colour [11], and other classifiers (e.g. support vector machines). The robustness of the method to the initial location of the markers should also be evaluated in order to determine the minimal marker size. A major improvement of the method could finally be achieved by following the idea from Micusik and Hanbury who have built a completely automatic segmentation solution [12] by iterating their semi-automatic segmentation method [8], thus removing the need for manual setting of the markers.
References 1. Rivest, J., Beucher, S., Delhomme, J.: Marker-controlled segmentation: an application to electrical borehole imaging. Journal of Electronic Imaging 1(2), 136–142 (1992) 2. Vincent, L., Soille, P.: Watersheds in digital spaces: An efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(6), 583–598 (1991) 3. Beare, R.: A locally constrained watershed transform. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(7), 1063–1074 (2006) 4. Li, X., Hamameh, G.: Modeling prior shape and appearance knowledge in watershed segmentation. In: Canadian Conference on Computer Vision (2005) 5. Derivaux, S., Lefèvre, S., Wemmert, C., Korczak, J.: Watershed segmentation of remotely sensed images based on a supervised fuzzy pixel classification. In: IEEE International Geosciences And Remote Sensing Symposium, Denver, USA (July 2006) 6. Grau, V., Mewes, A., Alcaniz, M., Kikinis, R., Warfield, S.: Improved watershed transform for medical image segmentation using prior information. IEEE Transactions on Medical Imaging 23(4), 447–458 (2004) 7. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: IEEE International Conference on Computer Vision. vol. 2 pp. 416–423 (July 2001) 8. Micusik, B., Hanbury, A.: Steerable semi-automatic segmentation of textured images. In: Scandinavian Conference on Image Analysis (2005) 9. Aptoula, E., Lefèvre, S.: Spatial morphological covariance applied to texture classification. In: Gunsel, B., Jain, A.K., Tekalp, A.M., Sankur, B. (eds.) MRCS 2006. LNCS, vol. 4105, pp. 522–529. Springer, Heidelberg (2006) 10. Lefèvre, S.: Extending morphological signatures for visual pattern recognition. In: IAPR International Workshop on Pattern Recognition in Information Systems. (June 2007) 11. Aptoula, E., Lefèvre, S.: A comparative study on multivariate mathematical morphology. Pattern Recognition (2007) (to appear) doi:10.1016/j.patcog.2007.02.004 12. Micusik, B., Hanbury, A.: Automatic image segmentation by positioning a seed. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 468–480. Springer, Heidelberg (2006)
Image Segmentation Using Topological Persistence David Letscher and Jason Fritts Saint Louis University Department of Mathematics and Computer Science {letscher, jfritts}@slu.edu
Abstract. This paper presents a new hybrid split-and-merge image segmentation method based on computational geometry and topology using persistent homology. The algorithm uses edge-directed topology to initially split the image into a set of regions based on the Delaunay triangulations of the points in the edge map. Persistent homology is used to generate three types of regions: p-persistent regions, p-transient regions, and d-triangles. The p-persistent regions correspond to core objects in the image, while p-transient regions and d-triangles are smaller regions that may be combined in the merge phase, either with p-persistent regions to refine the core or with other p-transient and d-triangles regions to potentially form new core objects. Performing image segmentation based on topology and persistent homology guarantees several nice properties, and initial results demonstrate high quality image segmentation.
1
Introduction
Image segmentation algorithms classically fall into one of two classes: region-based methods or edge-based methods. Region-based methods typically employ clustering, region growing, or split-and-merge methods to construct regions based on feature characteristics of the region(s). Conversely, edge-based methods use edge detection to find edges within the image, and then use some method to link the edges to form boundaries that define regions. Both classes of segmentation methods have their advantages and disadvantages, so research has also delved into hybrid methods that combine region- and edge-based segmentation [6]. This paper presents a new hybrid approach to image segmentation based on computational topology. The method is fundamentally based on the split-andmerge paradigm, using edge-directed topology to initially split the image into a set of regions, and then using region-based merging to combine select regions into the final segmentation. The algorithm uses a three-step approach: 1) Perform edge detection. 2) Split the image into regions using edge-directed computational topology, based on persistent homology. 3) Merge select regions with similar features in order of topological persistence. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 587–595, 2007. c Springer-Verlag Berlin Heidelberg 2007
588
D. Letscher and J. Fritts
More specifically, the algorithm first uses edge-detection to produce an edge map, which serves as the guide for topological region splitting. Given this edge map, edges are linked to form regions based on the theory of persistent homology. The result of this topological splitting is an initial region segmentation that contains three types of regions: p-persistent regions, p-transient regions, and dtriangle regions. The p-persistent regions are larger regions corresponding to core objects in the image. The p-transient and d-triangle regions are smaller regions of the image, which correspond to either additional sections of the existing core objects or sub-sections of new core objects in the image. The p-transient and d-triangle regions are merged, in order of increasing persistence, either with ppersistent regions to further refine the existing core objects or with each other to potentially define new core objects in the final segmentation. The primary contribution of this paper is the novel use of persistent homologybased topology in the second step, and the corresponding order it imposes on region merging in the third step. Any edge detection method could be used in step one, and any feature characteristics could be used for region merging in step three. The second step is similar to the algorithm in [11]. However, the third step adds topological guarantees for the completed segmentation and incorporates region information to improve the quality of the final result. In the remainder of the paper, we discuss the proposed method in greater detail, with emphasis on the details of the topological region splitting and merge ordering in steps two and three, respectively. Section 2 presents the pertinent background in computational topology and persistent homology. Section 3 presents the proposed edge-directed topological segmentation algorithm. Section 4 presents preliminary results for this new algorithm. Section 5 presents related work, and Section 6 concludes the paper and discusses future directions.
2
Background in Topology
Since the key aspect of the proposed image segmentation algorithm is our approach to region splitting and merging based on computational topology, we first present the necessary background on topology and persistent homology. We present these topological concepts as they pertain to images and image segmentation, so we assume the existence of an image I with points in 2 . 2.1
Alpha Complexes and Delaunay Triangulations
Suppose we have a set of points X = {x1 , . . . , xn } in I that have been identified by some edge detection algorithm. For any positive radius α, we can consider the set of all disks of radius α with centers in X (see Fig. 1). As α grows, larger regions of the image are covered and disks begin overlapping. The centers of the overlapping disks can be connected to form edges and triangles. The pattern in which these disks intersect defines a triangulation, with the vertices of the triangulation corresponding to the points in X. An edge in the triangulation is formed when two disks meet in a point that is not in the interior of another disk. Similarly, a triangle is formed when three disks meet in a single
Image Segmentation Using Topological Persistence
589
Fig. 1. α-complexes for α = 1.2, 1.6, and 2.0, and the full Delaunay triangulation that is formed at α ≥ 4.6
point that is not in the interior of another disk. The triangulation constructed in this way is called the Delaunay triangulation and has many nice properties [2]. This triangulations can be found in O(n log n) time in two dimensions [2,5,7]. Figure 1 illustrates the alpha complexes for a set of points with α = 1.2, 1.6, 2.0, and 4.6. As visible in the figure, at α = 1.2, the disks are of radius 1.2 and fourteen pairs of disks intersect to form 14 edges. Some of these edges are connected with other edges, forming four connected components. When the radius α grows to 1.6, more edges are formed, and all of the edges are connected into a single connected component. Additionally, three triplets of disks intersect to form three Delaunay triangles. As the radius further grows to α = 2.0, 5 additional edges and 2 additional triangles are formed. Finally, when α grows to α = 4.6, the full Delaunay triangulation over all the vertices is formed. For a given value of α, the set of vertices, edges, and triangles defines the subcomplex, Kα . This complex accurately represents the topology generated by the union of disks with radius α centered at each vertex [3]. An important characteristic of complexes is that, for two radii, α and α , such that α < α , the subcomplex of α is a subset of the subcomplex of α , i.e. Kα ⊂ Kα . Any sequence of complexes with one a subset of the next is referred to as a filtration. At sufficiently large α, the complex becomes a full triangulation, in which all regions in the graph over X are covered by Delaunay triangles, as demonstrated in Fig. 1 for α ≥ 4.6. This complex is a superset of all other subcomplexes in X. The edges and triangles in Kα piece the points in X into a cohesive union. Recalling that we are defining X as the set of edge points generated from performing edge detection on I, one problem that occurs is that small regions may be left in the complement of the subcomplex. These regions would disappear for slightly larger α. However, increasing α to remove these regions may create new small regions. To counter this problem and effectively deal with small regions in the complement of the complex, we turn to persistent homology. 2.2
Persistent Homology
In algebraic topology, homology provides a way to measure the “complexity” of a space. In the study of complexes inside the plane we are concerned with the first homology group, H1 (K), and its corresponding Betti number, β1 . Fortunately, in this application, β1 can be readily defined without in-depth knowledge of algebraic topology. For a connected graph where V and E are the number of vertices and edges in a graph, respectively, the Betti number is defined as β1 = E − V + 1. If
590
D. Letscher and J. Fritts
we have a triangulated subcomplex in the plane, β1 = C − V + E − F , where C is the number of connected components and F the number of faces (or triangles) in the complex. The full definitions are more complex (see [9] for a full treatment), but in our situation the theory reduces to the formulae above. As mentioned in the previous section, as α increases, the resulting sequence of subcomplexes is known as a filtration. As the triangulations change, the Betti number, β1 (Kα ), for each subcomplex Kα in the filtration will change with α. As α increases, β1 will decrease when a hole is filled in (i.e. a triangle is created), and may increase when a new edge is created (see Fig. 2). In other words, when a hole is filled, F increases by one. Conversely, when an edge forms, the number of edges E increases by one, and the number of connected components C may decrease by one (if the edge connects two previously separate components). Related to this, persistence refers to how long regions in the complement of Kα last until they are contained in Kα+p for some p > 0. The p-persistent Betti number of Kα , β1p (Kα ), is β1 (Kα ) minus the number of regions in the complement of Kα that are contained in Kα+p . In spirit, it is the Betti number of the complex obtained from Kα with “small” holes filled in. For formal definitions in a more general setting, and algorithms for their calculation, see [4]. Consider two subcomplexes in a filtration, Kα and Kα+p , where p > 0. Recall that this means that Kα ⊂ Kα+p , so Kα+p contains the same vertices, edges, and triangles as Kα , but may also contain new edges and triangles. Holes in Kα that are still holes in Kα+p are regions that are persistent in p. We shall refer to these regions as p-persistent regions. Holes in Kα that are filled in between Kα and Kα+p are regions that are not persistent in p. We shall refer to these as p-transient regions. And finally, the individual triangles that were already filled in Kα will be referred to as d-triangles. For subcomplexes of the plane, loops that do not persist bound regions that disappear completely in Kα+p . An equivalent definition to the more general one, is β1p (Kα ) = C − V + E − F − R, where R is the number of regions in the complement of Kα that are contained in Kα+p . Note that this is not necessarily true in higher dimensions.
3
Segmentation Algorithm
The proposed segmentation algorithm is a hybrid split-and-merge segmentation algorithm that employs computational topology and persistent homology for image splitting and region feature characteristics for region merging. The algorithm first employs some edge detection algorithm, such as the Canny algorithm [1] or the wavelet-based detector that we used [8], to determine the set of edge points, X, in the image. Topological splitting is the performed over X to generate an initial segmentation with three types of regions: p-persistent regions, p-transient regions, and d-triangles. The topological splitting is controlled by two parameters, α and p, where α defines the radius of the disks and p indicates the persistence. In the final step, the algorithm merges the p-transient and
Image Segmentation Using Topological Persistence
591
β1 8 4 0 0
C V E F
= = = =
1
C V E F
4 18 14 0
β1 = 0 β
β
1 p
1 18 22 0
C V E F
β1 = 5
p 1 0
= = = =
2
2
β
1 p
1 18 27 5
C V E F
β1 = 5
p 1 0
= = = =
3
2
β
1 p
1 18 35 10
C V E F
β1 = 8
p 1 0
= = = =
4
2
β
1 p
1 18 37 13
C V E F
β1 = 7
p 1 0
= = = =
5 α
2
β
1 p
1 18 38 18
β1 = 3
p 1 0
= = = =
2
p 1 0
1 p
2
Fig. 2. This figure demonstrates various complexes in a filtration, including their Betti numbers and persistent Betti numbers for each Kα and Kα+p
d-triangle regions with either the p-persistent regions or each other to generate the final segmentation. As discussed in the previous section, the p-persistent regions represent larger regions that remain (are not filled in) in Kα+p . Conversely, p-transient are smaller regions that are filled in between Kα and Kα+p , while d-triangles are the smallest regions, which already existed as triangles in Kα . The d-triangles are akin to “thickened” edges separating regions in the initial segmentation. The p-persistent regions are crucial in that they define the core objects in the initial segmentation. There are two reasons to believe these regions are important to the segmentation: they are completely surrounded by edges, and they all contain a disk of radius α + p in their interiors. These initial regions can then be expanded upon to create a full segmentation. We perform this process in such a way that the persistent homology is respected. One approach would be to use the simplification algorithms presented in [4] to expand these regions and preserve this homological information. However, this method uses only the edge information, essentially expanding the regions so that the region boundaries consist of the shortest possible edges. This approach has the significant drawback that it does not incorporate any region information. An alternative strategy for expanding the persistent regions to a full segmentation is to merge neighboring regions based on their feature characteristics. In our case, we simply used the average color value for each region, but any desired feature vector could be used. Homological persistence dictates the merge order of the regions, with regions being ordered according to the disk radius α for which the triangle was formed, starting from the smallest. In order to preserve the topology of the segmentation, regions will only be merged if the persistent Betti number remains unchanged.
592
D. Letscher and J. Fritts
Algorithm 1. Algorithm for edge-directed topological image segmentation 1: 2: 3: 4: 5: 6: 7: 8:
4
Identify set of edges, X, in the image. Find the Delaunay triangulation of X. Calculate β1p (Kα ) and identify regions in 2 − Kα as p-persistent, p-transient, or d-triangle regions. Let F be the set of regions ordered by increasing α values. while (F is not empty) do Let σ be the first face in F . And let F = F − {σ}. Merge σ with the most similar adjacent region provided that β1p (merged regions) = β1p (Kα ) end while
Results
Figures 3 and 4 demonstrate the results of the algorithm for a sample image, both over multiple stages of the process, and over multiple separate images, respectively. As evident, each of the resulting segmentations effectively distinguishes the critical objects in the image. Figure 5 demonstrates how the choice of α and p affects the resulting segmentation. For small values of α large regions are not separated from each other in the initial stages of the algorithm. For fixed α larger values of p remove small regions from consideration.
Fig. 3. Major steps in the algorithm: (a) original image (b) complement of the α complex for α = 4, p = 0. (c) the persistent regions at α = 4, p = 3 (d) final segmentation.
Fig. 4. Example segmentations a stop sign, a rose and a baseball player
Image Segmentation Using Topological Persistence p=0 23 Regions
p=2 22 Regions
p=4 13 Regions
47 Regions
34 Regions
21 Regions
43 Regions
28 Regions
22 Regions
593
α=2
α=4
α=6
Fig. 5. Affect on the choice of parameters of the resulting segmentation
The segmentation algorithm has several nice guarantees. In particular, “large” regions in the complement of the edges will not be broken into separate regions. Specifically, any ball of radius p in the complement of the edges will not be divided between multiple segments. The choice of α will affect how well the points identified by the edge detection fill out to complete the boundary of regions. In particular, if the points identified as edges have neighbors with a distance of α/2 in each direction then boundaries of regions will be correctly reconstructed.
5
Related Work
An early image segmentation method using Delaunay triangulations was presented in [6]. Their method is distinct in that they do not use persistent homology, but instead start with a full Delaunay triangulation and generate the core objects in the final segmentation solely through region merging. A more recent image segmentation method based on Delaunay triangulation is presented in [11]. They use similar criteria for doing their initial segmentation. However, whereas we use Delaunay triangulation for defining different types of regions and then perform region merging to generate the final segmentation, they use Delaunay triangulation for defining boundaries, and use thinning to eliminate undesired edges in the final segmentation.
594
D. Letscher and J. Fritts
Another recent image segmentation method based on Delaunay triangulation is presented in [10]. They generate the full Delaunay triangulation first, followed by region merging. They merge regions by extracting a skeleton of the segmented image from the connected components graph of the triangulations. Each separate connected component in the skeleton defines one region in the final segmentation.
6
Conclusion and Future Directions
We have presented an image segmentation algorithm that uses both edge and region information and uses techniques from computational topology. The segmentations obtained clearly delineate key objects in the images and are good initial indicators of the effectiveness of the method. There are several directions of followup for this algorithm. These include studying different orderings for merging the d-triangles and p-transient regions, and using alternate feature characteristics for region merging (as opposed to just average region color value). Significant improvement could also potentially be achieved in using machine learning methods to help identify appropriate parameterizations of α and p according to the characteristics of the image and edge features. Another improvement is to utilize more information from the edge detection algorithms. The current algorithm only uses the magnitude of the estimated gradient. The direction of the gradient can also be incorporated in finding an anisotropic Delaunay triangulation. These triangulations can better represent the regions divided by the edge information.
References 1. Canny, J.: A Computational Approach To Edge Detection. IEEE Trans. Pattern Analysis and Machine Intelligence 8, 679–714 (1986) 2. Edelsbrunner, H.: Algorithms in Combinatorial Geometry. Springer, New York (1987) 3. Edelsbrunner, H.: The Union of Balls and its Dual Shape. In: Proceedings of the Ninth Annual Symposium on Computational Geometry, pp. 218–231 (1993) 4. Edelsbrunner, H., Letscher, D., Zomorodian, A.: Topological Persistence and Simplification. Discrete and Computational Geometry 28, 511–533 (2002) 5. Fortune, S.: A Sweepline Algorithm for Voronoi Diagrams. Algorithmica 2, 153–174 (1987) 6. Gevers, T., Smeulders, A.W.M.: Combining Region Splitting and Edge Detection through Guided Delaunay Image Subdivision. In: Proc. of the 1997 International Conference on Computer Vision and Pattern Recognition, pp. 1021–1026 (1997) 7. Guibas, L., Knuth, D., Sharir, M.: Randomized Incremental Construction of Delaunay and Voronoi Diagrams. Algorithmica 7, 381–413 (1992) 8. Mallat, S., Zhong, S.: Characterization of signals from Multiscale Edges. IEEE Trans. Patt. Anal. and Mach. Intell. 14, 710–732 (1992) 9. Massey, W.: A Basic Course in Algebraic Topology. Springer, Heidelberg (1991)
Image Segmentation Using Topological Persistence
595
10. Prasad, L., Skourikhine, A.N.: Vectorized Image Segmentation via Trixel Agglomeration. In: Brun, L., Vento, M. (eds.) GbRPR 2005. LNCS, vol. 3434, pp. 12–22. Springer, Heidelberg (2005) 11. Stelldinger, P., Ullrich, K., Meine, H.: Topologically Correct Image Segmentation Using Alpha Shapes. In: Kuba, A., Ny´ ul, L.G., Pal´ agyi, K. (eds.) DGCI 2006. LNCS, vol. 4245, pp. 542–554. Springer, Heidelberg (2006)
Image Modeling and Segmentation Using Incremental Bayesian Mixture Models Constantinos Constantinopoulos and Aristidis Likas Department of Computer Science, University of Ioannina, GR 45110 Ioannina, Greece
[email protected],
[email protected] Abstract. Many image modeling and segmentation problems have been tackled using Gaussian Mixture Models (GMM). The two most important issues in image modeling using GMMs is the selection of the appropriate low level features and the specification of the appropriate number of GMM components. In this work we deal with the second issue and present an approach for GMM-based image modeling employing an incremental variational algorithm for Bayesian GMM training that automatically specifies the number of mixture components. Experimental results on natural and texture images indicate that the method yields reasonable models without requiring the a priori specification of the number of components.
1
Introduction
Several approaches have been proposed for statistical image representation, ie. the modeling of an image based on the distribution of various features at pixel or window level. Simpler approaches focus on the use of histograms, while more sophisticated statistical modeling tools such as mixture models have also been considered [1]. More specifically, the assumption under the GMM framework is that an image is considered as a set of regions (segments) where each region is represented by a Gaussian distribution and the set of all regions in an image is represented by a GMM. To build the GMM for an image, first a dataset X = {fn } is constructed that contains one feature vector fn ∈ Rd either for each image pixel n (as is the case in this work) or for appropriately selected image windows (patches). Typical low level features used are related to color and texture. Then an efficient training method is applied to the dataset X in order to obtain the final GMM for the image. Let g be a mixture with M Gaussian components g(f ) =
M
πj N (f ; μj , Tj )
(1)
j=1
where π = {πj } are the mixing coefficients (priors), μ = {μj } the means (centers) of the components, and T = {Tj } the precision (inverse covariance) matrices.
This work was partially supported by Interreg IIIA (Greece-Italy) grant I2101005.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 596–603, 2007. c Springer-Verlag Berlin Heidelberg 2007
Image Modeling and Segmentation
597
After training, it is possible to segment the image, ie. assign a group of pixels to each GMM component by finding for the feature vector f of each pixel the component j with maximum posterior, (ie. with maximum πj N (f ; μj , Tj )). An issue to be considered is how to impose the requirement for spatial smoothness, ie. that in most cases neighboring pixels should be assigned to the same component (cluster). Two approaches have been proposed. The first imposes an MRF prior on the posteriors [2]. The MRF approach is more difficult to handle from the learning point of view, and requires all image pixels to be included in the training set. It provides as outcome the posterior probabilities for each pixel. In addition, no effective method has been proposed for automatically determining the number of GMM components in the MRF framework. In this work we focus on the second way to impose spatial smoothness [1]. According to this approach, the spatial location (x, y) of a pixel is considered as an additional feature to be included in the feature vector f describing this pixel, for example fn = (L, a, b, x, y)n for the n-th pixel with image coordinates (x, y) and color vector (L, a, b). Thus the spatial distance between two pixels contributes significantly to the total distance between the corresponding feature vectors. In this way, adjacent pixels tend to have similar cluster labels since their spatial distance is small, thus spatial smoothing is achieved. On the other hand, this approach forces large image areas (with approximately the same color) that can be considered as one segment to be splitted into smaller segments, because spatially distant pixels cause the distance of the corresponding feature vectors to be large, thus they cannot be modeled by a single Gaussian component. However, this fragmentation problem can be easily resolved through a simple postprocessing stage that merges image regions corresponding to GMM components with similar mean feature values. In addition, to further enforce spatial smoothness, it is possible to apply the MRF-based GMM model afterwards starting from the GMM solution provided by the second approach. The advantage of using the second approach for spatial smoothness is that it results in a GMM that takes as input the image location (x, y) and can assign feature labels (for example color) to every location (x, y) independent of whether this pixel has been used for training or not. This has three interesting consequences. First it is easy to obtain a model for a specific region of the image by considering only the mixture components that are active in this region. For example we can easily derive a mixture model for the color density in an arbitrary image region. Second it is straightforward to marginalize the (x, y) coordinates and obtain a GMM for the distribution of the low level features. Finally, it is easy to assign segment labels to image locations (x, y) not used for training, simply by computing the component j with highest posterior. This allows train the mixture model using only a subset of the pixels. Despite those advantages, a major difficulty is introduced that relates to the specification of the number of mixture components. For example, it is possible that an image with four colors cannot be modeled using a GMM with four components, especially in the case where the image segments with the same color have large size or are disconnected. In this case the GMM requires more
598
C. Constantinopoulos and A. Likas
components to accurately model the image. Therefore it is difficult to specify in advance the number of GMM components, thus it is essential to use GMM training methods that incorporate a built-in mechanism for automatically assessing the number of components. This is the case with our method [3] used in this work that is described next.
2
The Incremental Variational Bayesian Method
In this section we describe a Bayesian method for Gaussian mixture learning [3] that is deterministic, does not depend on the initialization, and resolves adequately the model selection problem, ie. the specification of the number of components. The method is an incremental one: it starts with one component and progressively adds components to the model. The procedure for component addition is based on a splitting test applied to each of the existing mixture components. According to this test, a component is splitted into two sub-components and then variational Bayesian learning is applied to the specific pair of components, while the rest components remain “fixed”. To apply this method, it is necessary to define a Bayesian Gaussian mixture model by imposing priors on the parameters π, μj and Tj . The effect of variational Bayesian learning is that a competition takes place between the two sub-components. If the data distribution in the region of the splitted component strongly suggests the existence of more than one clusters, then both sub-components will “survive” and the number of model components will be increased. Otherwise, the competition among the two components will cause one of them to be eliminated and the initial component will be recovered. This strategy of incremental component addition also facilitates the specification of the parameters of the priors, since it can be based on the parameters of the component to be splitted. In order to apply this idea, a modification of the typical Bayesian mixture model [4] is required that is described in the graphical model of Figure 1 and is explained next. Assume that we wish to restrict the competition to a subset containing s of the GMM components while the others M − s components are “fixed”. The proposed modification is to impose a prior only on the M − s “fixed” mixing coefficients π ˜ . Let X = {fn } the set of training points containing the feature vectors of the image pixels. The hidden variables Z = {zjn } capture the missing information of which component has generated a given data point fn . More specifically, zjn = 1 if component j is responsible for generating xn , otherwise zjn = 0. Therefore it holds that: p(X|Z, μ, T ) =
N M
zjn
[N (fn ; μj , Tj )]
(2)
n=1 j=1
The distribution of Z is a product of multinomials p(Z|π, π ˜) =
N s n=1 j=1
z
πj jn
M j=s+1
z
π ˜j jn
(3)
Image Modeling and Segmentation
s
N
T
X
~ π
μ
Z
π
599
J−s
s
Fig. 1. The graphical model
given the subset π ˜ = {˜ πj } of “fixed” mixing coefficients and the subset π = {πj } of “free” mixing coefficients. For notational convenience and assuming M mixing components, we can always rearrange the indexes so that the first s components are the “free” ones. The typical Bayesian framework assumes conjugate Dirichlet priors over the entire set of mixing coefficients. Therefore, in order to define the modified Bayesian GMM, it is necessary to define the conditional joint distribution p(˜ π |π) of the “fixed” mixing coefficients given the “free” which can be shown to be a nonstandard Dirichlet with parameters αj (j = s + 1, . . . , M ): ⎛ ⎞−M+s s Γ( M j=s+1 αj ) p(˜ π |π) = ⎝1 − πj ⎠ M j=s+1 Γ (αj ) j=1 ×
M
1−
j=s+1
π ˜ sj k=1
αj −1 πk
(4)
It must be emphasized that we do not impose prior on the “free” mixing coefficients. Completing the specification of the Bayesian model we assume Gaussian and Wishart priors for μ and T respectively [4]: p(μ) =
p(T ) =
s j=1 s
N (μj |0, β I)
(5)
W(Tj |ν, V ).
(6)
j=1
Learning in the Bayesian framework can be achieved through maximization of the marginal likelihood of X given π, that is obtained by integrating out the hidden variables θ = {Z, μ, T, π ˜ } as follows: p(X|π) = p(X, θ|π) dμ dT d˜ π. (7) Z
Since this integral is intractable, the Variational Bayes methodology is adopted, where we maximize a lower bound L of the logarithmic marginal likelihood log p(X|π):
600
C. Constantinopoulos and A. Likas
L[q, π] =
q(θ) log
Z
p(X, θ|π) dθ q(θ)
(8)
where q is an arbitrary distribution that approximates the posterior distribution p(θ|X, π). The maximization of L is performed in an iterative way, where at each iteration two steps take place (in analogy to the EM approach): first maximization of the bound with respect to q, and subsequently maximization of the bound with respect to π. To implement this maximization with respect to q the mean-field approximation [4] has been adopted, which assumes that q is constrained to be a product of the form: q(θ) = qZ (Z)qμ (μ)qT (T )qπ˜ (˜ π ). The resulting update equations for the parameters of the q distributions (E-step) and the parameters π (M-step) are omitted and described in detail in [3]. Using the above idea, the incremental algorithm for constructing the GMM proceeds as follows. Mixture components are sequentially added to the mixture model using the following component splitting procedure: one of the mixture components is selected and is appropriately splitted in two components. The resulting two components are considered as “free” and the rest as “fixed” according to the terminology introduced previously. Next we set the precision prior p(T ) based on the characteristics of the splitted component, and apply variational learning as described in the previous section. In case that the two components provide a much better fit to the data in their region, then both components are 2
2
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2
−2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
−2
−1.5
−1
−0.5
(a)
0
0.5
1
1.5
0.5
1
1.5
(b)
2
2
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2
−2
−2
−1.5
−1
−0.5
0
(c)
0.5
1
1.5
−2
−1.5
−1
−0.5
0
(d)
Fig. 2. Four steps of the incremental training procedure. The expected covariance w.r.t. the Wishart prior is depicted with a dashed line. (a) An intermediate solution with 5 components. (b) One component is splitted in two. (c) The mixture after variational learning. (d) Another component is selected and splitted.
Image Modeling and Segmentation
20
20
20
40
40
40
60
60
60 80
80
80
100
100
120
120
601
100 120
140 20 40 60 80 100 120
140 50
100
150
200
20
20
20
40
40
40
60
60
60 80
80
80
100
100
120
120
50
100
150
200
50
100
150
200
100 120
140 20 40 60 80 100 120
(a)
140 50
100
150
(b)
200
(c)
Fig. 3. Segmentation of (a) an artificial image, (b) and (c) natural images from BSDS. Top row: original images. Bottom row: segmented images.
retained in the mixture model, otherwise the update equations will eliminate one of them. The splitting test is applied sequentially to all components and the method terminates when all mixture components have been unsuccessfully tested for splitting. In the case where a successful split is encountered, then the number of mixture components increases and a new round of split tests for all components is initialized. To illustrate the details of splitting, assume that some component jˆ has to be splitted, with density N (f ; μjˆ, Tjˆ). The idea is that in order to form the new mixture, we remove component jˆ and insert two new components with densities N (f ; μjˆ1 , Tjˆ1 ) and N (f ; μjˆ2 , Tjˆ2 ) respectively. We have selected to place the centers of the two components along the dimension of the principal axis of the covariance Tjˆ−1 and at opposite directions with respect to the center μjˆ. The mixing coefficients of the two components are set √ equal πjˆ1 = πjˆ2 =√πjˆ/2, and their parameters are set according to: μjˆ1 = μjˆ + λ u, μjˆ2 = μjˆ − λ u, Tjˆ1 = Tjˆ and Tjˆ2 = Tjˆ, where λ is the maximum eigenvalue of Tjˆ−1 and u the corresponding eigenvector. An important issue in the proposed method is the specification of the scale parameter V of the prior W(ν, V ) over the precision matrices, based on the splitted component. We set ν = d (which is the minimum allowed value) and V = νλI, where λ is the highest eigenvalue of Tjˆ−1 . The value of β was set to 10−10 . An example of component splitting is illustrated in Figure 2.
602
C. Constantinopoulos and A. Likas
20
20
20
40
40
40 60
60 80 100
80
60
100
80
120 20 40 60 80 100
100 20 40 60 80 100120
20
20
50
100
150
100
150
20
40
40
40 60
60 80 100
80
60
100
80
120 20 40 60 80 100
100 20 40 60 80 100120
(a)
(b)
50
(c)
Fig. 4. Segmentation using texture feature: (a) and (b) artificial images, (c) BSDS image. Top row: original images. Bottom row: segmented images.
3
Experimental Results
To illustrate the performance of the proposed method, we have conducted experiments using both artificially generated and natural images. For each image the following steps were taken. First the dataset containing the feature vectors for a subset of 5000 pixels was constructed and next the feature vectors were preprocessed so that each feature distribution has zero mean and unit standard deviation. The resulting dataset was then used to build the GMM for the image. Using the resulting GMM, the ’segmented’ image is produced. In the case where color features are used, the segmented image is produced by assigning to each pixel (x, y) the mean color value of the corresponding GMM component. In the case of texture images we use an arbitrary different color for each segment. For segmentation using artificial color features we used the (L,a,b) representation along with the (x, y) coordinates. Figure 3.(a) illustrates the segmentation result using an artificial color image. Figure 3.(b),(c) displays segmentation results with two natural images from the Berkeley Segmentation Data Set (BSDS) [5]. For these images we also added a texture feature, namely the polarity feature pl [1] to the feature vector describing a pixel, i.e. fn = (L, a, b, pl, x, y)n. Finally Figure 4 provides results for two artificial texture images and one BSDS image, using only the polarity feature, ie. fn = (pl, x, y)n .
Image Modeling and Segmentation
603
From the experimental results it is clear that, although only a subset of the pixels are used for GMM training, the method yields segmentations of satisfactory quality, that are spatially smooth and estimates reasonably well the number of image segments, without producing significant oversegmentation.
4
Conclusions
We have presented an efficient approach for image modeling and segmentation using GMMs that is based on a recently proposed Bayesian technique for estimating the components of a GMM [3]. The approach is fully automatic, makes no assumptions regarding the required number of GMM components and provides solutions that are spatially smooth. Future work will focus on a more systematic testing of the method in the case where other texture-related features are used and in the case where feature vectors are defined at the patch level instead of the pixel level. Also we plan to test the efficiency of the method in image retrieval tasks where GMMs are used as models of the stored images and the query image.
References 1. Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(8), 1026–1038 (2002) 2. Li, S.Z.: Markov Random Field Modelling in Computer Vision. Springer, Heidelberg (2001) 3. Constantinopoulos, C., Likas, A.: Bayesian gaussian mixture learning based on variational component splitting. IEEE Transactions on Neural Networks 18(3), 745–755 (2007) 4. Corduneanu, A., Bishop, C.M.: Variational Bayesian model selection for mixture distributions. In: Artificial Intelligence and Statistics 2001, pp. 27–34. Morgan Kaufmann, San Francisco (2001) 5. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. 8th Int’l Conf. Computer Vision, vol. 2, pp. 416–423 (2001)
Model-Based Segmentation of Multimodal Images Xin Hong, Sally McClean, Bryan Scotney, and Philip Morrow School of Computing and Information Engineering, University of Ulster Cromore Road, Coleraine, BT52 1SA, Northern Ireland, UK {x.hong, si.mcclean, bw.scotney, pj.morrow}@ulster.ac.uk
Abstract. This paper proposes a model-based method for intensitybased segmentation of images acquired from multiple modalities. Pixel intensity within a modality image is represented by a univariate Gaussian distribution mixture in which the components correspond to different segments. The proposed Multi-Modality Expectation-Maximization (MMEM ) algorithm then estimates the probability of each segment along with parameters of the Gaussian distributions for each modality by maximum likelihood using the Expectation-Maximization (EM) algorithm. Multimodal images are simultaneously involved in the iterative parameter estimation step. Pixel classes are determined by maximising a posteriori probability contributed from all multimodal images. Experimental results show that the method exploits and fuses complementary information of multimodal images. Segmentation can thus be more precise than when using single-modality images. Keywords: data fusion, multimodal images, model-based segmentation, Gaussian mixture, maximum likelihood, EM algorithm.
1
Introduction
Image segmentation is the process of separating an image into meaningful, homogeneous regions for further analysis. A single modality may only provide limited and possibly low accurate information about the image scene. Multimodal images often provide richer and more reliable information about the environment in which the modalities operate [10], so the use of multimodal images can help improve segmentation accuracy. Data fusion is the process of combining information from different sources or modalities to get an optimal estimation of objects. A number of approaches [2,3,5,10,11] have been presented to combine multimodality imaging information based on a data fusion framework: the Dempster-Shafer theory of evidence (DS theory). The advantage of using DS theory in data fusion lies in its ability to deal with ignorance and imprecision. But the derivation of the mass function for uncertainty measurements attracts criticisms about its practical applications such as image processing. In this paper, we propose a model-based method for segmenting images of the same object acquired from multiple modalities. Pixel intensity within a modality image is modelled by a univariate Gaussian distribution mixture in which W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 604–611, 2007. c Springer-Verlag Berlin Heidelberg 2007
Model-Based Segmentation of Multimodal Images
605
components correspond to different clusters or segments. Methods such as BIC [9] can be used to determine the number of segments contained in an image, and therefore we assume this to be known here. The Expectation-Maximization (EM) algorithm is adopted to estimate parameters of normal distribution mixtures by maximum likelihood. The main advantage of our method is that pixels in all modality sensed images are involved in estimating class distributions which in turn contribute to modality-specific class parameter estimations (means and variances). Segment assignment also takes into account contributions of all modalities. Our approach is based on well-established principles, namely probability models and maximum likelihood estimation, using the EM algorithm. As such it has the advantage of being underpinned by mathematical theory, as opposed to previous ad hoc approaches. The remainder of this paper is organised as follows. In section 2, we present a univariate Gaussian mixture model for multimodal sensed images. In section 3, we propose the Multi-Modality Expectation-Maximization (MMEM) algorithm for segmentation of multimodal images. Results of experiments on synthetic images and simulated MS Lesion brain images are given in section 4. In section 5, we discuss the segmentation results of section 4 and consider the problems our approach may encounter. We also outline possible improvements for the MMEM algorithm. Finally, conclusions are presented in section 6.
2
Gaussian Distribution Model of Images
In image segmentation each pixel of an image is considered as an observation and pixel intensity is considered to be a feature component of the observation. 2.1
Univariate Gaussian Distribution Mixture Model
A univariate Gaussian distribution is modelled by two parameters: mean μ and variance σ 2 . For an image x that consists of N one-dimensional observations, x = (x1 , . . . , xN ), following a Gaussian distribution, the Gaussian probability of pixel intensity xn , denoted as gθ (xn ) (θ = {μ, σ2 }), is given by gθ (xn ) =
1 (xn − μ)2 √ exp {− } 2 σ2 σ 2π
It is usual that pixels in an image are best represented by a mixture of two or more univariate Gaussian distributions rather than a single distribution, where the mixture components can be thought of as different clusters or segments. Let K be the number of Gaussian mixture distributions gk with mean μk and variance σk2 , k = 1, . . . , K. The probability distribution of xn , n = 1, . . . , N , denoted as gm(xn ), is given by gm(xn ) =
K k=1
λk gk (xn ) .
606
X. Hong et al.
where λk are weighting parameters that can be thought of as prior probabilities for the Gaussian mixture, satisfying: (1) λk ≥ 0, k = 1, . . . , K (2)
K
λk = 1 .
k=1
The pixels in an image are commonly assumed to be independently and identically distributed. The probability of image x with pixel intensities (x1 , . . . , xN ) can then be computed as follows: gm(x) =
N n=1
2.2
gm(xn ) =
N K n=1
k=1
λk · gk (xn ) .
Joint Probability Distribution of Multimodal Images
Observing an image scene under M modalities would give rise to M ideal observed images, {x1 , . . . , xM }. By saying ideal images, we mean that all images are exactly spatially registered, i.e., the pixel positions of all images correspond to the same location in the real world. Let m be the index of the observed image provided by the mth modality, denoted as xm , xm = {xmn : n = 1, . . . , N }, m = 1, . . . , M . Each of the M images follows a Gaussian mixture distribution gmm on the parameters (λ, θm ) with λ = {λk : k = 1, . . . , K}, θm = {θmk : 2 k = 1, . . . , K} = {μmk , σmk : k = 1, . . . , K}. We assume that how the scene is observed by a modality will not be affected by other modalities observing the same image scene. The probability that the observed images take on the value (x1 , . . . , xM ) given the parameters (λ, θ) = (λ, θ1 , . . . , θM ) can be estimated as M P (x1 , . . . , xM |θ, λ) = gmm (xm ) . m=1
The probability as the function of parameters θ and λ represents the likelihood of the parameters given the images x1 , . . . , xM , conventionally denoted as L(θ, λ|x1 , . . . , xM ),
3
Model-Based Segmentation
Let an image scene contain K distinct classes, labelled {w1 , . . . , wK }. Image segmentation can be considered as the process of assigning each pixel of the image to one of K classes. Model-based segmentation is concerned with fitting a mixture of normal distributions represented by parameters: mean μ, variance σ 2 and class probability λ to the data. Since the unlabelled pixel has no class label then the observed pixel intensity is said to be incomplete in representing the image pixel. The EM algorithm [4] is the most general and effective algorithm for solving estimation problems of maximum likelihood parameters (mean
Model-Based Segmentation of Multimodal Images
607
μ, variance σ 2 and class probability λ) when the observation data are subject to incompleteness. For examples of its application to single modality clustering, see [1,8,9]. Following this convention we propose the MultiModal EM algorithm (MMEM ) that uses the EM iterations to find the maximum joint likelihood estimate of the parameters of original distributions from multimodal images. Within each iteration of the MMEM algorithm the E-Step is concerned with finding the posteriori probability of each pixel belonging to a particular class given the current estimates of μ, σ 2 and λ, whereas the M-Step recalculates parameters based on the new posterior probabilities estimated in the E-Step and the joint likelihood. Iterations continue until convergence is satisfied upon a predefined condition. After the optimal parameters have been estimated, a pixel is assigned to a class with greatest a posteriori probability. The detail of the algorithm is given below. Parameters estimated at the tth iteration are marked by a superscript (t). MMEM Algorithm: Begin 1 Initialise the parameters 2 E-Step: compute the posteriori probabilities. (t)
λk · gmk (xmn ) P (wk |xmn ) = K (t) k=1 λk gmk (xmn ) 3 M-Step: update the parameters. 1 N M (t+1) λk = P (wk |xmn ) n=1 m=1 MN N (t+1) n=1 xmn P (wk |xmn ) μmk = N n=1 P (wk |xmn ) N (t+1) P (wk |xmn )(xmn − μmk )2 (t+1) (σ 2 )mk = n=1 N n=1 P (wk |xmn ) 4 Repeat steps 2 and 3, until the change in ln L is beyond the threshold, or convergence is reached. 5 Segmentation step: determine the class label for each pixel in the output image. 1 M yn = arg max( P (wk |xmn )) wk M m=1 End. This algorithm generalises previous use of the EM algorithm for image segmentation to the multimodal problem. It allows us to combine data from different sources in an intuitive and principled manner, with resulting potential for classification improvement.
608
4
X. Hong et al.
Experiments
To evaluate the proposed method, experiments have been carried out on both synthetically generated images and simulated MS Lesion brain images. 4.1
Synthetic Images
Two images (Fig. 1a, 1b) were generated to simulate strong and weak X-ray acquisitions, each of which contains four regions: three steps of increasing thickness from top to bottom, on a uniform background. Thickness here represents the density of a step, where thick steps are more effective in blocking X-rays than thin ones, and hence weak X-rays are more easily blocked by a dense (thick) step than strong X-rays. In the strong X-ray image (Fig. 1a), the thinnest step is overexposed and confused with the background. However, the two thicker steps are well distinguished. In the weak X-ray image (Fig. 1b), the greater thickness steps are underexposed but the step of the smallest thickness is clearly distinguished from the background. The ground-truth image is shown in Fig. 1c. The two images (Fig. 1a, 1b) providing complementary information on the four regions are used as input to our MMEM algorithm. The output image (Fig. 1f) shows the four regions are well identified and clearly distinguishable. In comparison, Fig. 1d and 1e show the output images resulting from segmenting the two original images individually by the EM algorithm. Table 1 shows class and overall accuracy measurements when compared to the ground-truth for the segmented images Fig. 1d, 1e and 1f respectively. The results from Fig. 1 and Table 1 indicate that the complementary information provided by the two sensing modalities has been well exploited by the MMEM segmentation.
(a)
(b)
(c)
(d)
(e)
(f )
Fig. 1. First synthetic images: (a) Strong X-ray image (b) Weak X-ray image (c) Ground-truth image (d) EM segmentation of the strong X-ray image (e) EM segmentation of the weak X-ray image (f) MMEM segmentation when using the two images together
Model-Based Segmentation of Multimodal Images
609
Table 1. The accuracy figures for three segmented images segmented images Fig. 1d Fig. 1e Fig. 1f
4.2
class accuracy C1 C2 C3 C4 60.99% 69.15% 97.17% 97.74% 99.98% 99.83% 41.35% 72.87% 95.94% 97.70% 96.01% 97.67%
overall accuracy 81.26% 78.15% 96.83%
Simulated MS Lesion Brain Images
Magnetic resonance imaging (MRI) is one application of multimodality imaging in medicine. Although the same imaging modality is used, multispectral acquisitions (T1, T2, PD, etc) are performed under different operating conditions resulting in different images. The MR simulator of the Montreal Neurological Institute [6] generates 3D brain images simulating T1-, T2- and PD-weighted imaging. Eleven tissue classes are modelled: background, cerebrospinal fluid (CSF), grey matter, white matter, fat, muscle/skin, skin, skull, glial matter, connective, and MS lesion. By varying scanning parameters, tissue contrast can be altered and enhanced in various ways to demonstrate different features. Fig. 2a - 2c shows a slice of the multiple-sclerosis (MS) lesion brain images generated by T1-, T2-, and PD-weighted simulations. In MR imaging of the brain, T1-weighting causes white matter to appear white, grey matter to appear grey, and CSF to appear dark. The contrast of “white matter”, “grey matter” and “CSF” is reversed
(a)
(b)
(c)
(d)
(e)
(f )
(g)
(h)
Fig. 2. Simulated MS Lesion brain images Top: T1-, T2-, PD-weighted images, Phantom image (as the ground truth) Bottom: segmentation of T1-, T2-, PD-weighted images individually by the EM algorithm, T1-, T2-, PD-weighted images together via the MMEM algorithm
610
X. Hong et al.
using T2-weighted imaging. Proton-density (PD) imaging provides little contrast in normal subjects. By use of the three modality images together via our MMEM algorithm the overall accuracy of segmentation is higher than that resulting from segmenting T1-, PD-weighted images individually using the EM algorithm (55.60%, 50.10% and 45.29% respectively), but slightly lower than that of T2-weighted image segmentation (58.11%). However unlike segmentation of the T2-weighted image, which is only able to identify some classes, the MMEM segmentation identifies all classes. Fig. 2e - 2h shows the segmented images with the groundtruth image given in Fig. 2d.
5
Discussion
From Table 1, we can see that the overall accuracy of segmenting two modality images together via our MMEM method is much higher than that of segmenting a single modality image by the EM algorithm. However we also observe that class accuracy of segmenting these two images together using our method is slightly lower than that of segmenting the strong x-ray image on classes C3 and C4, and of segmenting the weak x-ray image on C1 and C2. In the current methodology, the computation of segment probabilities inherently weights all modalities equally. It might however be desirable to weight credibility of modalities thus influencing their contributions to the segmentation. In the future, we will investigate a method of weighting modality credibility thus incorporating weights into parameter estimation and the segmentation decision rule. Although brain tissues such as white matter, grey matter and CSF usually have a characteristic intensity, pixels around the brain often show an MR intensity that is very similar to brain tissue [7]. This can result in erroneous classification of small regions surrounding the brain so segmentation may be poor. Additionally information provided by MR images may not be typical for independent multimodality which our method is especially developed to handle.
6
Conclusions
In this paper we have proposed a model-based method for segmentation of images acquired from multiple independent modalities. Model parameter estimation is based on the principle of Maximum Likelihood. Maximisation is iteratively performed by the EM algorithm. Experimental results using synthetic images and simulated MR brain images reveal that using multimodal images helps improve segmentation accuracy. It is also likely that averaging probabilities contributed by multimodal images in calculating the prior probability of clusters and determining segmentation class of a pixel has a trade off against performance of segmenting a single modality image. In the near future we plan to introduce credibility of a modality into parameter estimation and segmentation. We are also going to carry out further experiments on real world images to investigate how well our method is suited to different imaging environments.
Model-Based Segmentation of Multimodal Images
611
References 1. Al Momani, B., Morrow, P., McClean, S.: Knowledge based semi-supervised satellite image classification. In: Proc. ISSPA 2007 (2007) 2. Bendjebbour, A., Delignon, Y., Fouque, L., Samson, V., Pieczynski, W.: Multisensor image segmentation using Dempster-Shafer fusion in markov fields context. IEEE Trans. Geo. Re. Sensing 39(8), 1789–1798 (2001) 3. Boudraa, A.-O., Bentabet, L., Salzenstein, F., Guillon, L.: Dempster-Shafer’s basic probability assignment based on fuzzy membership functions. Electronic Letters on Computer Vision and Image Analysis 4(1), 1–9 (2004) 4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistics Society B(1), 1–38 (1977) 5. H´egarat-Mascle, S., Bloch, I., Vidal-Madjar, D.: Introduction of neighborhood information in evidence theory and application to data fusion of radar and optical images with partial cloud cover. Patt. Recog. 31(11), 1811–1823 (1998) 6. Kwan, R.K.-S., Evans, A.C., Pike, G.B.: MRI simulation-based evaluation of imageprocessing and classification methods. IEEE Trans. Med. Imaging 18(11), 1085– 1097 (1999), MRI simulator: http://www.bic.mni.mcgill.ca/brainweb/ 7. Leemput, K.V., Maes, F., Vandermeulen, D., Suetens, P.: Automated model-based tissue classification of MR images of the brain. IEEE Trans. Med. Imaging 18(10), 897–908 (1999) 8. McClean, S., Scotney, B., Morrow, P., Greer, K.: Knowledge discovery by probabilistic clustering of distributed databases. Data Knowl. Eng. 54(2), 189–210 (2005) 9. Murtagh, F., Raftery, A.E., Starck, J-L.: Bayesian inference for multiband image segmentation via model-based cluster trees. Img. and Vis. Comp. 23, 587–596 (2005) 10. Salzenstein, F., Boudraa, A.-O.: Iterative estimation of Dempster-Shafer’s basic probability assignment: application to multisensor image segmentation. Opt. Eng. 43(6), 1293–1299 (2004) 11. Zhu, Y.M., Bentabet, L., Dupuls, O., Kaftandjian, V., Babot, D., Rombaut, M.: Automatic determination of mass functions in Dempster-Shafer theory using fuzzy c-means and spatial neighborhood information for image segmentation. Opt. Eng. 41(4), 760–770 (2002)
Image Segmentation Based on Height Maps Gabriele Peters1 and Jochen Kerdels2
2
1 University of Dortmund Department of Computer Science - Computer Graphics Otto-Hahn-Str. 16, D-44221 Dortmund, Germany DFKI - German Research Center for Artificial Intelligence Robotics Lab Robert Hooke Str. 5, D-28359 Bremen, Germany
Abstract. In this paper we introduce a new method for image segmentation. It is based on a height map generated from the input image. The height map characterizes the image content in such a way that the application of the watershed concept provides a proper segmentation of the image. The height map enables the watershed method to provide better segmentation results on difficult images, e.g., images of natural objects, than without the intermediate height map generation. Markers used for the watershed concept are generated automatically from the input data holding the advantage of a more autonomous segmentation. In addition, we introduce a new edge detector which has some advantages over the Canny edge detector. We demonstrate our methods by means of a number of segmentation examples. Keywords: segmentation, edge detection, watershed, height maps.
1
Introduction
In this paper we introduce a new method for image segmentation. It is based on a height map which is generated from the gray values of the input image. Among the different methods for image segmentation morphological watersheds have some advantages [1]. They yield more stable results in comparison to other segmentation concepts such as detection of discontinuities, thresholding, or region processing. But they also have a drawback. Watersheds work on height level images. The association with height maps refers directly to the input image. If an image is interpretable as topographic image, such as images of cells under a microscope, watersheds perform well. To apply watershed segmentation to arbitrary images such as photographs of natural objects we propose to generate a height map which characterizes the content of the image in an appropriate way. A simple interpretation of an arbitrary image, e.g., of a tree, as height map would make no sense. We introduce the derivation of an appropriate height map from an edge filtered version of the input image. For that purpose we propose a new edge detector related to the Canny edge detector, but endowed with some advantages. This enables us to apply the watershed concept for the segmentation of arbitrary images and to exploit its friendly properties also for difficult images such as those W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 612–619, 2007. c Springer-Verlag Berlin Heidelberg 2007
Image Segmentation Based on Height Maps
original
new edge detector
edges
height map
skeleton
613
segments
watershed segmentation (based on height map and skeleton)
Fig. 1. Image Segmentation Based on Height Maps, Overwiew. The contributions of this paper are highlighted in red.
depicting natural objects. In addition, and in contrast to the general watershed concept, where markers are adopted to incorporate knowledge-based constraints in the segmentation process, we automatically generate markers from the height map and thus contribute to a more autonomous segmentation process. The segmentation based on height maps is derived in section 2, section 3 describes some results, and in section 4 we conclude the paper with a summary.
2
Image Segmentation Based on Height Maps
In fig. 1 an overview of the proposed image segmentation method is given. We start with the original image to be segmented. An edge detector is applied to the gray values of this image resulting in an edge image (subsection 2.2). From this binary edge image we generate a topographic image, the gray values of which can be interpreted as different height levels (subsection 2.3). Given this height map a skeleton image is derived, which is again a binary image (subsection 2.4). The height map is the source image for the watershed algorithm, which is applied utilizing the skeleton image as markers. This results in the final segmentation of the original image (subsection 2.5). The details of our approach, especially in a formalized version, are described in [2]. Here we restrict to the introduction of the basic principles of the proposed segmentation method. But first we start with a short retrospection of the concept of morphological watersheds (subsection 2.1). 2.1
Segmentation by Watersheds Revisited
Segmentation by morphilogical watersheds [3,4] requests a gray value image as input and is based upon the interpretation of this image as a topographic image, where different gray values code for different heights in a third dimension. Segmentation is accomplished by a slow flooding of this imaginary landscape. Water appears first in the deepest valleys and reaches higher landmarks at a later point in time. Any time the water impends to slop over a mountain crest a dam is erected to prevent the water from overflowing. The algorithm terminates when the whole landscape is flooded and only dams are visible. These dams constitute the segmentation of the image. The fact that each segment has its origin in a
614
G. Peters and J. Kerdels
local minimum of the height map can lead to an oversegmentation. A number of methods have been proposed dealing with this problem [5,6]. A suitable means to prevent oversegmentation is the utilization of so-called markers. They designate landmarks where water can emerge and thus, they restrict the number of segments in advance. Valleys, i.e., local minima, which are not labeled by a marker are not protected by dams. They are flooded eventually from one of the neighboring valleys and thus are assigned to the corresponding segment. It depends on the application at hand, where to place the markers. Originally markers have been regarded as an opportunity to incorporate knowledge-based constraints in the segmentation process [1]. Contrary to this opinion, we recognize the fact that an automatic generation of markers - as employed by our approach (and described in subsection 2.4) - holds the merit of a more automatic, independent, and self-contained segmentation process. 2.2
Edges
In this subsection we describe the derivation of a binary edge image from the input gray value image to be segmented. The edge image is the basis from which we derive the height map of the original image, which is needed as input for the watershed algorithm. As existing edge detectors display a series of disadvantages we develop a new edge detector, which is similar to the Canny edge detector [7] but has some advantages over it. Proposed Edge Detector. The proposed edge detection proceeds in 7 steps: 1. Simple Operator, Parameter d: We employ a simple edge operator, which is given in the left diagram of fig. 4 for one direction (horizontal only) and one intensity transition (dark to light only) as an example. The empty entries contain zeros. Parameter d determines the distance between 1 and −1 in the mask of the operator, and thus the size of the utilized local window. A larger distance effects that salient edges in the image generate d-wide edges in the response of the operator. On the other hand, edges, that are caused by noise oder disturbed structures in the image, i.e., non-salient edges, result in thinner edges in the filter response. 2. Gaussian Blur, Parameter b: The so created preliminary edge image is filtered with a Gaussian blur the magnitude of which is regulated by the parameter b. Whereas wide edges are diminished by the blur filter only at their ends, thin edges are degraded as a whole. 3. Potentiation, Parameter γ: Now the lowpass filtered, still preliminary edge image is exponentiated by the exponent γ. As the domain of the edge image is [0, 1], γ determines to which extent the values of the egde image are forced against zero. 4. Thresholding, Parameter : Afterwards we eliminate values below a threshold and set values above or equal the threshold to 1. This and the preceding steps result in a binary edge image with relatively wide edges.
Image Segmentation Based on Height Maps
615
5. Thinning: These wide edges are thinned to a width of one pixel only. 6. Deletion, Parameter λ: In addition, those edges are deleted the length of which is shorter than the average length of all edges muliplied by the factor λ. An exemplification of these six steps is given in fig. 2.
Fig. 2. Steps of the Edge Detector. The steps are illustrated for horizontal darkto-light transitions only. Upper left: original image. Upper right: result after the first step of the algorithm. Lower left: result after carrying out steps 2 to 4. Lower right: resulting edge image after steps 5 and 6. The parameters used are {d = 2, b = 2, γ = 3.0, = 0.01, λ = 1.0}.
7. Joining Directions and Transitions: In the final step the results from the edge detectors for the single directions and transitions are joined into the final edge image. Fig. 3 shows the complete edge image of the last example in the right image. It also visualizes the effects of different sets of parameters.
Fig. 3. Edge Images for Different Sets of Parameters. Left: {d = 1, b = 1, γ = 2.0, = 0.01, λ = 1.0}. Middle: {d = 1, b = 1, γ = 2.5, = 0.01, λ = 1.0}. Right: the same parameters as in fig. 2, i.e., {d = 2, b = 2, γ = 3.0, = 0.01, λ = 1.0}.
Advantages in Comparison with the Canny Edge Detector. Most of the steps are also carried out by the Canny edge detector, though in different order. Without going into detail, the principle steps of the Canny operator are Gaussian filtering of the image, edge detection, and thresholding. In particular, one difference of the proposed edge detector to the Canny detector is the potentiation of the blurred edge image in step 3. In combination with the following step (thresholding) this introduces an ”exponential thresholding” in contrast to
616
G. Peters and J. Kerdels
Fig. 4. Simple Edge Operator and Advantage of the Proposed Edge Detecor. Left: simple edge operator. Middle: original image. Right: resulting edge image. Salient edges in the original image are preserved even if they are of low contrast such as the edges of the light, vertical bar on the left. On the other hand, edges of similar contrast are largely suppressed if they are not salient such as the edges in the texture on the right.
Fig. 5. Edge Image and Height Map. Left: original gray value image. Middle: binary edge image. Right: gray value height map. Note the gap in the top of the tree outline in the edge image. The height map compensates perfectly for this deficiency.
the linear thresholding of the Canny detector. This kind of thresholding seems to be easier tunable. Experiments reported in [2] suggest a more robust behavior of the proposed edge detector. The same threshold value provides good results for a larger field of application than the Canny edge detector. Another advantage is illustrated in fig. 4. In step 1 edges of different strength are generated, depending on the saliency (not the contrast) of the edges in the original image: the more salient an edge in the source image, the wider the corresponding edge in the (intermediate) edge image. The subsequent application of the blur filter diminishes the thin edges (derived from non-salient edges in the original image) and thus boosts the wide, i.e., salient edges. In addition, the results of our edge detector seem to have similar advantages as those of Canny operators augmented with surround suppression such as described in [8], though it does not inhibit its surroundings actively. This can be seen in the edge image of fig. 7. 2.3
Height Map
As already mentioned we want to apply a watershed algorithm for segmentation, thus we need a height map. In this subsection we derive from the binary edge
Image Segmentation Based on Height Maps
617
image a gray value topographic image, which characterizes the original image in an adequate way. First we define a maximal distance dmax . It determines the maximal distance of a point in the edge image to the closest edge. Points with a larger distance from an edge are defined to be on the same height. That means that only gaps of a maximal size of 2dmax are tolerable in the edge image. For the images used, with a resolution of 640 × 480, dmax = 32 to 64 pixels provides good results. The idea is now that for each point in the edge image the minimal distance to the closest edge is determined. For that purpose we start a region growing from each point e of the edge image marked as an edge. For each image point p under consideration during this region growing we first verify whether a distance value has already been assigend to it and whether this distance value is smaller than the current distance dcur to the origin e of the region growing. If not, dmax − dcur is assigned to p. In addition, the neighboring points of p are stored as future candidates in the region growing. The region growing stops if either no more candidates for growing exist or the maximal distance dmax is reached. The right image of fig. 5 shows an example of a height map generated according to the preceding specification. Another example is shown in fig. 7. 2.4
Skeleton
As already pointed out in subsection 2.1 markers are used in combination with watershed algorithms to prevent the result from beeing oversegmented. The binary skeleton image, which we derive in this subsection, will be used as marker for the watershed algorithm we apply to the height map. By the application of Laplace operators to the height map one obtains a kind of gray value skeleton image. The coarseness of these skeletons can be controlled by the size of the Laplace operator as illustrated in fig. 6. For our examples we used a 3 × 3 filter. After the application of the Laplace filter we binarize the result with a threshold and, in addition, we delete those of the remaining edges that are thinner than
Fig. 6. Skeleton Images and Final Segmentation. Left: skeleton image of the same original image as depicted in fig 5, generated with a 3 × 3 Laplace filter. Middle: skeleton image generated with an 11 × 11 Laplace filter. Right: parts of the segmentation after applying the watershed algorithm with the left skeleton image as marker.
618
G. Peters and J. Kerdels
Fig. 7. Example of Image Segmentation Based on Height Maps. First: original image. Second: edge image. Third: height map. Fourth: parts of the resulting segmentation.
half of the filter size. Every white pixel of the resulting binary skeleton image acts as a marker for the following application of the watershed algorithm. 2.5
Segmentation
In this final step we apply the watershed algorithm as described in subsection 2.1 to the height map derived in subsection 2.3 utilizing the skeleton image as markers. As even the usage of markers cannot always prevent the result from beeing oversegmented similar neighboring segments are fused in a final postprocessing step. As a measure of similarity between two segments three values are compared: the average gray value, the variance of the gray values, and the entropy (calculated from the histogram of the gray values). Average, variance, and entropy of neighboring segments are compared, and only if they resemble significantly both segments are fused. As threshold for the difference between the average we used 0.1, for the variance 0.05, and for the entropy 0.5. The right image of fig. 6 shows the interesting segments of the result after the last fusion step has been carried out.
3
Results
For the evaluation of the proposed segmentation method we chose images depicting natural objects such as trees as these are typically difficult to segment. A first example is depicted in the figures 5 and 6. The height map compensates for gaps in the outline of the tree in the edge image. The resulting segmentation of the tree is fairly well accomplished and the foreground tree is separated from the tree in the background although these image areas are quite similar in terms of texture. Regrettably, the final fusion of similar segments turned out to be the weak point of the whole segmentation process as it was difficult to choose the values of the thresholds for average, variance, and entropy in such a way that neither an oversegmentation nor an undersegmentation occured, and this still for a series of different examples. But in the second example displayed in fig. 7 our approach provided one segment for the tree and another for the dominant tuft
Image Segmentation Based on Height Maps
619
in the foreground which corresponds quite well to our human perception. The separation between the tuft and the surrounding image areas can be regarded as difficult, because they are quite similar in structure. In addition, here again a gap in the contour of the tree poses no challenge for the further processing. More examples are discussed in [2].
4
Summary
We proposed a method to derive a height map from an arbitrary gray value image, which characterizes its content in such a way that the application of the watershed concept to the height map provides a proper segmentation of the image. This enables the application of the watershed segmentation also to images that are not similar to a topographic image. That means the watershed algorithm can provide now better segmentation results on difficult images, e.g., images of natural objects, than without the intermediate height map generation. The markers, usually introduced to the watershed algorithm to incorporate knowledge-based constraints, are generated automatically from the height map. This holds the merit of a more automatic, independent, and self-contained segmentation process. In addition, we introduced a new edge detector. Its advantages in comparison with the Canny edge detector are twofold. By the application of the blur filter to an intermediate edge image instead of the original image (as the Canny detector does) we derive an edge image which better distinguishes between salient and non-salient edges. The second merit of the proposed edge detector is an easier to handle thresholding which seems to provide more robust results. Acknowledgments. This research was funded by the German Research Association (DFG) under Grant PE 887/3-1.
References 1. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice-Hall, Englewood Cliffs (2002) 2. Kerdels, J.: Dynamisches Lernen von Nachbarschaften zwischen Merkmalsgruppen zum Zwecke der Objekterkennung. Diploma Thesis, University Dortmund (2006) 3. Serra, J.: Image Analysis and Mathematical Morphology. Ac. Press, NY (1988) 4. Beucher, S., Meyer, F.: The Morphological Approach of Segmentation: The Watershed Transformation. Mathematical Morphology in Image Processing. Marcel Dekker, New York (1992) 5. Najman, L., Schmitt, M.: Geodesic Saliency of Watershed Contours and Hierarchical Segmentation. IEEE PAMI 18(12), 1163–1173 (1996) 6. Bleau, A., Leon, J.: Watershed-Based Segmentation and Region Merging. Computer Vision and Image Understanding 77(3), 317–370 (2000) 7. Canny, J.: A Computational Approach to Edge Detection. IEEE PAMI 8(6) (1986) 8. Grigorescu, C., Petkov, N., Westenberg, M.A.: Contour and Boundary Detection Improved by Surround Suppression of Texture Edges. Image and Vision Computing 22(8), 609–622 (2004)
Measuring the Orientability of Shapes Paul L. Rosin School of Computer Science, Cardiff University, Cardiff CF24 3AA, Wales, UK
[email protected] Abstract. An orientability measure determines how orientable a shape is; i.e. how reliable an estimate of its orientation is likely to be. This is valuable since many methods for computing orientation fail for certain shapes. In this paper several existing orientability measures are discussed and several new orientability measures are introduced. The measures are compared and tested on synthetic and real data.
1
Introduction
It is often useful to determine the orientation of a shape. For instance, in robotics to locate a good grasping position. In computer vision, processing is often performed with respect to an object centred and oriented coordinate frame of reference. This can effectively provide descriptions invariant to certain geometric transforms of the object, and speeds up subsequent operations such as matching. Several schemes exist for estimating the orientation of a shape, but two problems arise. First, some shapes inherently have no well defined orientation – for example a circle. Second, even for shapes with a well defined orientation some methods fail to identify that orientation. This latter limitation applies to the most common method for estimating orientation based on the line which minimises the integral of the squares of distances of the points (belonging to the shape) to the line [10]. This line passes through the centroid of the shape S, and so the squared distance function of interior points to the line at orientation θ is F (θ) = (x · sin θ − y · cos θ)2 dx dy (1) S
where S has been translated such that its centroid lies on the origin. F (θ) is minimised when 2μ11 tan 2θ = (2) μ20 − μ02 where μpq are the central moments of order p + q. However, under certain conditions, namely μ11 = μ20 − μ02 = 0 (3) F (θ) is a constant function, and the method fails. Not only does this hold for all n-fold rotationally symmetric shapes with n > 2 [12] but also for many more general shapes [11]. This can be demonstrated by constructing some examples. One way we have constructed a simple polygonal example of such an irregular asymmetric shape is to express the conditions (3) for six vertices using line W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 620–627, 2007. c Springer-Verlag Berlin Heidelberg 2007
Measuring the Orientability of Shapes
(a)
(b)
(c)
621
(d)
Fig. 1. Shapes whose orientation is undefined according to the standard method given by (2). The pre-normalised shapes are shown in outline. In all cases μ11 = μ20 −μ02 = 0.
moments [9]. After setting coordinate values for four of the vertices the remainder are determined by numerically solving the set of equations. For example, starting with the values {(−40, 0), (−40, 100), (0, 300), (300, 300)} yields the remaining values {(97.9, −437.2), (−550.6, 277.5)} – see figure 1a. A more general approach by S¨ uße and Ditrich [11] enables an arbitrary shape to be normalised by a shearing and anisotropic scaling to yield a new shape that satisfies (3) – some examples of such shapes are shown in figure 1b–d. This limitation of the standard method for orientation estimation has led to the development of methods that can cope in these circumstances. Several are based on higher powers of the distance function F (θ) and/or higher order moments (including Zernike and generalised complex as well as geometric moments) [8,11,12,14]. Nevertheless, these new orientation estimators are not ideal either, since they may still fail in some cases, and are typically sensitive to noise as a consequence of the higher orders of moments used. The problems associated with orientation estimation suggest that a measure of shape orientability, whose purpose is to describe the degree to which a shape has distinct (but not necessarily unique) orientation, would be useful although there is little previous work in this area [1,16]. Orientability quantifies the likely reliability and stability of orientation estimates. For instance, even minor changes in a shape due to digitization or noise effects can substantially alter orientation estimates for shapes with low orientability [16].
2
Orientability Measures
In this section various schemes (new and old) are described for measuring orientability. There are certain basic properties that we expect all orientability measures to possess (which makes them easier to use and easier to compare): a) The measured orientability is in the interval [0, 1] for any shape; b) A circle has the measured orientability equal to 0; c) The measured orientability is invariant wrt similarity transformations. In addition, it would be desirable if d) The only shape with measured orientability equal to 0 is a circle; e) The only shape with measured orientability equal to 1 is a straight line segment (i.e. the limit of a rectangle as its aspect ratio tends to infinity).
622
2.1
P.L. Rosin
Moment Based Methods
Elongation. Consider the covariance matrix constructed from the second order central moments of the shape μ20 μ11 . C= μ11 μ02 The eigenvalues of C – denoted by I1 and I2 – provide the variances of the shape along the major and minor principal axes. and can be used to form a measure of elongation [5], which in turn is an indication of orientability: √ I1 − I2 4μ211 + (μ20 − μ02 )2 Φ2 DF = = = I1 + I2 μ20 + μ02 Φ1 where Φ1 and Φ2 are the first two rotation and translation moment invariants [5]. The same measure can be derived from the distance function F (θ) [16]: DF = 1 −
π · min{F (θ) | θ ∈ [0, π)} π . 0 F (θ) · dθ
When conditions (3) hold, and consequently F (θ) is constant, then DF = 0. Thus, apart from the only true unorientable shape of a circle, the elongation measure underestimates all the other shapes that satisfy (3). Higher Powers of Distance. One way to correctly determine the orientation of rotationally symmetric shapes is to use higher powers of distance in (1), i.e. F (θ, p) = (x · sin θ − y · cos θ)p dx dy. S
For shapes with N fold rotational symmetry Tsai and Chou [12] used p = N , ˇ c et al. [14] used p = 2N . In a similar manner, a higher order elongation while Zuni´ measure can be computed (and thereby an orientability measure). For instance DE (p) = 1 −
min{F (θ, p) | θ ∈ [0, 2 · π]} max{F (θ, p) | θ ∈ [0, 2 · π]}
is able to distinguish a circular from rotationally symmetric shapes for sufficiently large values of p. 2.2
Geometric Methods
ˇ c et al. [16] described a new measure of orientabilBounding Rectangles. Zuni´ ity based on bounding rectangles. For a given shape S let R(S, θ) be the minimal area rectangle whose edges make an angle θ with the coordinate axes and which includes S and let A(R(S, θ)) be the area of R(S, θ). Let Amin (S) =
min { A(R(S, θ)) } and Amax (S) = max { A(R(S, θ)) } .
θ∈[0,π)
θ∈[0,π)
Measuring the Orientability of Shapes
623
Fig. 2. The minimum area bounding rectangles (three are shown as examples) at all orientations for this shape have constant area but varying aspect ratio; Dα = 0
The orientability of the shape S is defined as: Dα (S) = 1 −
Amin (S) − α · A(S) Amax (S) − α · A(S)
where A(S) denotes the area of S, and the α · A(S) terms (α ∈ [0, 1]) have been included to enable shapes that have identical convex hulls to be differentiated. Dα satisfies the required properties (a), (b) and (c), but not property (d). In fact, Dα = 0 holds not only for the circle, but for all members of the class of curves of constant width (of which the circle is just one instance). The most famous example is the Reuleaux triangle. At all θ the minimum area bounding rectangles R(S, θ) will be identical squares. Furthermore, it is also possible to have other shapes with varying width that nevertheless have Dα = 0. In such cases the aspect ratio of R(S, θ) varies over θ, but A(R(S, θ)) remains constant, as shown in figure 2.
Fig. 3. This shape has orientability DP = 0
ˇ c [13] suggested computing the Projection of Edges. In place of (2) Zuni´ orientation that maximises the squared projection of the shape’s boundary. If the edges in polygon P have angles θi and lengths li , i = 1 . . . n, with respect to the x-axis, then the required orientation θ0 can be found as n 2 l sin 2θi tan 2θ0 = ni=1 2i . i=1 li cos 2θi We then compute orientability using the squared projection values at θ0 and θ0 + π2 as n 2 l sin (θi − θ0 ) DP = 1 − ni=1 2i . (4) l i=1 i cos (θi − θ0 )
624
P.L. Rosin
A problem with this approach is that it can give unexpected results in some instances, see for example figure 3 which has DP = 0 since the edge angles are distributed evenly in two orthogonal directions like a square. Given the measure’s sensitivity to local deviations in edge orientations, if the boundary is extracted from an image then it needs to have simplification (polygonal approximation) applied to eliminate these variations. However, this leads to another problem: variations in the sampling rate. For instance, consider an elongated rectangle from which points are selected along the boundary. Sampling at regular intervals will produce a high value of DP . However, the value of DP can be arbitrarily reduced by sampling the long side more densely than the short side. Since the projected lengths are squared then the effect in (4) of the segments from the long side will decrease. A solution to both these problems can be found in a technique previously employed for measuring rectilinearity [7]. The original data rather than its polygonal approximation is used, but is smoothed over a range of scales. The maximum orientability over scale is then selected. If a curve is smoothed using the geometric heat flow equation then it becomes more and more circular, eventually shrinking to a circular point in finite time [3]. Thus blurring will not increase orientability unless it reveals an elongated global structure that was effectively masked from the measure DP by local edges whose orientations are not aligned with the global structure. Weighting of Edges. Duchˆene et al. [1] describe a method for estimating orientation which also includes a confidence factor which can be used as an orientability measure. Orientations are considered for θ ∈ [0, π). The absolute difference in orientation φi of each edge from θ is taken modulo π. For each edge, φi if φi < δ then it contributes the following n weight wi = li (1 − δ ). Orientation is taken as the value of θ maximising i=1 wi . The same process is carried out to compute orientability as n i=1 wi DW = maxπ θ∈[0, 2 ) perim(P ) except that φi are now taken modulo π2 . Note that if δ = πn then the measure is over the interval n2 , 1 rather than π [0, 1). Duchˆene et al. set δ = 12 , and in our experiments we have done likewise. Circularity Measures. Since a circle should produce an extreme value of orientability it seems reasonable that some of the measures of circularity can be used to measure orientability. Let the circumscribed circle, inscribed (i.e. largest empty) circle and convex hull of polygon P be denoted by CC(P ), IC(P ) and CH(P ). These geometric constructions are robust with respect to perturbations of the boundary which makes them suitable for use as elements of shape descriptors. Moreover, the inscribed circle can be computed in O(n log n) time [6] and the other two can be computed in linear time [2,4]. Note that IC(P ) ≤ CH(P ) ≤ CC(P ).
Measuring the Orientability of Shapes
625
Only for the least orientable shape – a circle – does CC(P ) = IC(P ) = CH(P ), and so they can be simply combined to form the following orientability measures, which can be based on either area or perimeter values: C1 = 1 − C3 = 1 −
area(CH(P )) ; area(CC(P )) perim(CH(P )) π π−2 perim(CC(P ))
−
2 π
C2 = 1 − ; C4 = 1 −
area(IC(P )) area(CH(P )) ; perim(IC(P )) perim(CH(P )) .
Incorporating the circumscribed circle in the measure will ensure that protrusions are counted against circularity (and therefore in favour of orientability) while the inscribed circle will ensure that intrusions are treated likewise: counted against circularity and in favour of orientability. Further possibilities are also possible, using for instance linear combinations of the above such as αC1 + (1 − α)C2 or different scale normalisations (e.g. using polygon diameter).
3
Experiments
To visualise the similarities and differences between the various orientability measures they have been applied to order some simple shapes as shown in figure 4. Overall we can see that the circle is always assigned a low orientability score while the rectangle receives a high score, but there are many differences in behaviour for intermediate shapes. As expected, the low order moments measure DF is unable to discriminate between the rotationally symmetric shapes which all are assigned a value close to zero (the difference from zero being due to numerical and digitisation errors). In addition, two shapes – the rippled pear and the donkey – have been normalised according to S¨ uße and Ditrich’s scheme [11], and consequently also have approximately zero values. Using higher order powers overcomes this problem, as seen in the ranking by DE (50). Since this is still effectively an area based measure then small area features, such as the indentation in the circle (second shape from left), have little effect. As explained in section 2.2 Dα has a problem with certain shapes such as the irregular Reuleaux shape, fourth from the left for Dα=0.0 , and the third shape from the left, also described in figure 2. The effect of increasing the value of α is demonstrated: the circle and rectangle have identical values to their modified version with deep intrusions at α = 0, but are increasingly discriminated as α increases. Note however that Dα=1.0 assigns a square the same peak value of one as a rectangle since Amin = A. Putting the edge projection method in a multi-scale context (DP ) enables it to successfully cope with the zigzag rectangle (second on the right). Also, being boundary based it is very sensitive to the deep (but narrow) indentations which make the modified circle and rectangle much more orientable. It also discriminates between the plain square and the modified version with the zigzag pattern on the top and bottom, but fails however to distinguish between a square, cross and a circle. As mentioned previously, Duchˆene et al.’s method [1] (DW ) does not reach the lower bound of zero, even for a circle. Another point to note is that all the
626
P.L. Rosin
DF .000 .000 .000 .001 .003 .003 .004 .005 .009 .013 .025 .096 .405 .452 .886 .986 DE (50) .005 .007 .044 .065 .121 .204 .221 .245 .245 .250 .287 .295 .306 .425 .717 .917 Dα=0.0 .007 .007 .008 .008 .072 .109 .149 .176 .338 .388 .500 .500 .500 .500 .563 .860 Dα=0.5 .011 .012 .013 .013 .094 .133 .185 .236 .397 .537 .593 .604 .658 .667 .668 .925 Dα=1.0 .028 .028 .033 .034 .135 .170 .243 .357 .481 .729 .762 .820 .869 .963 1.00 1.00 DP .002 .005 .005 .005 .006 .021 .137 .427 .428 .443 .458 .503 .535 .699 .884 .984 DW .183 .188 .224 .240 .269 .334 .405 .425 .582 1.00 1.00 1.00 1.00 1.00 1.00 1.00 C1 .007 .017 .113 .127 .200 .240 .360 .362 .365 .365 .371 .382 .452 .591 .656 .898 C2 .013 .095 .215 .331 .396 .442 .597 .718 .770 .815 .860 .884 .890 .926 .932 .936 C3 .008 .021 .092 .150 .156 .180 .272 .278 .280 .283 .393 .427 .493 .574 .674 .875 C4 .007 .058 .213 .235 .303 .395 .413 .483 .521 .618 .652 .705 .735 .753 .758 .881 Fig. 4. Shapes ranked according to various orientability measures
rectilinear shapes achieve a value of one, suggesting that this is actually more of a rectilinearity measure [15,7] rather than an orientability measure. The circularity measures that incorporate the inscribed circle (C2 , C4 ) are more sensitive to intrusions that the remaining circularity measures, which makes the modified circle and rectangle more orientable. At least in these examples there does not appear to be a significant difference between the area and perimeter based versions. Note that several measures cannot distinguish the ‘U’ shape from a square: Dα=0.0 , DW , C1 , C3 . Also, the shapes are more clearly ordered by some measures; e.g. half the shapes have almost identical DF values. Quantifying this for each measure by the median of the differences in the ordered values gives {.004, .024, .037, .051, .049, .015, .016, .040, .046, .058, .051}, confirming that Dα>0 and the circularity measures perform well in this respect.
Measuring the Orientability of Shapes
4
627
Conclusions
Several orientability measures have been described and tested. While they all operate successfully for extremes of orientability – i.e. a circle and an elongated rectangle – performance is varied for intermediate shapes. Many of the measures cannot distinguish between dissimilar shapes, e.g. the class of rotationally symmetric shapes, constant width shapes, or squares versus circles. Future work will look at applying the orientability measures as shape descriptors, and evaluating their effectiveness for various object classification tasks.
References 1. Duchˆene, C., Bard, S., Barillot, X., Ruas, A., Tr´evisan, J., Holzapfel, F.: Quantitative and qualitative description of building orientation. In: Workshop on Progress in Automated Map Generalisation (2003) 2. G¨ artner, B.: Fast and robust smallest enclosing balls. In: Neˇsetˇril, J. (ed.) ESA 1999. LNCS, vol. 1643, pp. 325–338. Springer, Heidelberg (1999) 3. Grayson, M.A.: The heat equation shrinks embedded plane curves to round points. Journal of Differential Geometry 26, 285–314 (1987) 4. McCallum, D., Avis, D.: A linear algorithm for finding the convex hull of a simple polygon. Inform. Process. Lett. 9, 201–206 (1979) 5. Mukundan, R., Ramakrishnan, K.R.: Moment Functions in Image Analysis – Theory and Applications. World Scientific, Singapore (1998) 6. Preparata, F.P., Shamos, M.I.: Computational Geometry. Springer, Heidelberg (1985) ˇ c, J.: Measuring rectilinearity. Computer Vision and Image Un7. Rosin, P.L., Zuni´ derstanding 99(2), 175–188 (2005) 8. Shen, D., Ip, H.H.S.: Optimal axes for defining the orientations of shapes. Electronic Letters 32(20), 1873–1874 (1996) 9. Singer, M.H.: A general approach to moment calculation for polygons and line segments. Pattern Recognition 26(7), 1019–1028 (1993) 10. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision. PWS (1998) 11. S¨ uße, H., Ditrich, F.: Robust determination of rotation-angles for closed regions using moments. In: Int. Conf. Image Processing, vol. 1, pp. 337–340 (2005) 12. Tsai, W.H., Chou, S.L.: Detection of generalized principal axes in rotationally symetric shapes. Pattern Recognition 24(1), 95–104 (1991) ˇ c, J.: Boundary based orientation of polygonal shapes. In: Chang, L.-W., Lie, 13. Zuni´ W.-N. (eds.) PSIVT 2006. LNCS, vol. 4319, pp. 108–117. Springer, Heidelberg (2006) ˇ c, J., Kopanja, L., Fieldsend, J.E.: Notes on shape orientation where the stan14. Zuni´ dard method does not work. Pattern Recognition 39(5), 856–865 (2006) ˇ c, J., Rosin, P.L.: Rectilinearity measurements for polygons. IEEE Trans. on 15. Zuni´ Patt. Anal. and Mach. Intell. 25(9), 1193–1200 (2003) ˇ c, J., Rosin, P.L., Kopanja, L.: On the orientability of shapes. IEEE Trans. on 16. Zuni´ Image Processing 15(11), 3478–3487 (2006)
A 3–Subiteration Surface–Thinning Algorithm K´ alm´an Pal´ agyi Department of Image Processing and Computer Graphics, University of Szeged, Hungary
[email protected] Abstract. Thinning is an iterative layer by layer erosion for extracting skeleton. This paper presents an efficient parallel 3D thinning algorithm which produces medial surfaces. A three–subiteration strategy is proposed: the thinning operation is changed from iteration to iteration with a period of three according to the three deletion directions.
1
Introduction
Skeleton is a region–based shape feature that is extracted from binary image data. A very illustrative definition of the skeleton is given using the prairie–fire analogy: the object boundary is set on fire and the skeleton is formed by the loci where the fire fronts meet and quench each other [4]. In discrete spaces, the thinning process is a frequently used method for producing an approximation to the skeleton in a topology–preserving way [7]. It is based on digital simulation of the fire front propagation: border points of a binary object that satisfy certain topological and geometric constraints are deleted in iteration steps. The entire process is repeated until only the “skeleton” is left. A simple point is an object point whose deletion does not alter the topology of the image [9]. Sequential thinning algorithms delete simple points which are not end points, since preserving end–points provides important information relative to the shape of the objects. Curve thinning (i.e., a thinning process for extracting medial line) preserves line–end points while surface thinning (i.e., a thinning process for extracting medial surface) does not delete surface–end points. Parallel thinning algorithms delete a set of simple points simultaneously. A possible approach to preserve topology is to use directional approach (often referred to as subiteration–based or border sequential strategy) [6]: the thinning operation is changed from iteration to iteration with a period of n (n ≥ 2); each iteration of a period is then called a subiteration where only border points of certain kind can be deleted. Since there are six kinds of major directions in 3D images, 6–subiteration thinning algorithms were generally proposed [3,5,8,11,12,17]. Note, that 3–, 8–, and 12–subiteration algorithms were also developed [13,14,15]. In this paper, a new non–conventional 3–subiteration surface thinning algorithm is proposed. Some experiments are made on synthetic objects and the effectiveness is demonstrated. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 628–635, 2007. c Springer-Verlag Berlin Heidelberg 2007
A 3–Subiteration Surface–Thinning Algorithm
2
629
Basic Notions
Let p be a point in the 3D digital space Z3 . Let us denote Nj (p) (for j = 6, 26) the set of points j–adjacent to point p (see Fig. 1). A binary image I is a mapping (I : Z3 → {0, 1}), that assigns value 1 to black (or object) points and value 0 is assigned to white points. • • • • U • • • • • • N p E W • S • • • • •
• D • • •
Fig. 1. The set N6 (p) of the central point p ∈ Z3 contains the central point p and the 6 points marked U= u(p), N= n(p), E= e(p), S= s(p), W= w(p), and D= d(p). The set N26 (p) contains N6 (p) and the additional 18 points marked “•”.
We are dealing with (26, 6) images [7] (i.e., the equivalence classes of the set of black points induced by the transitive closure of the 26–adjacency form the objects of the given image; white components (the background and the cavities) are the equivalence classes of the set of white points induced by the transitive closure of the 6–adjacency). It is assumed that any image contains finitely many black points. A black point is called border point if it is 6–adjacent to at least one white point. A border point p is called U–border point in image I if I(u(p)) = 0 (see Fig. 1). We can define N–, E–, S–, W–, and D–border points in the same way. A black point p is called interior point if it is not border point (i.e., I(p) = I(u(p)) = I(n(p)) = I(e(p)) = I(s(p)) = I(w(p)) = I(d(p)) = 1). A black point is called simple point if its deletion does not alter the topology of the image [7]. (Note, that the simplicity of point p in a (26, 6) image is a local property; it can be decided in view of N26 (p).) We propose a new surface thinning algorithm for extracting medial surfaces from 3D (26, 6) images. The deletable points of the algorithm are border points of certain types and not surface end–point s (i.e., which are not extremities of surfaces). The proposed algorithm uses the following characterization of the surface end–points: A black point is surface end–point in a image if it is border point and it is not 6-adjacent to any interior point. Note, that the same characterization has been used by other authors [1,10].
3
The New Thinning Algorithm
Each conventional 6–subiteration 3D thinning algorithm uses the six deletion directions that can delete certain U–, D–, N–, E–, S–, and W–border points,
630
K. Pal´ agyi
respectively [3,5,8,11,12,17]. In our 3–subiteration approach, two kinds of border points can be deleted in each subiteration. The three deletion directions correspond to the three kinds of opposite pairs of points, and are denoted by UD, NS, and EW. The first subiteration assigned to the deletion direction UD can delete certain U– or D–border points; the second subiteration associated with the deletion direction NS attempt to delete N– or S–border points, and some E– or W–border points can be deleted by the third subiteration corresponding to the deletion direction EW. The proposed algorithm is given as follows: Input: binary image A Output: binary image B 3-subiteration thinning(A,B) begin B = A; repeat B = deletion from UD(B); B = deletion from NS(B); B = deletion from EW(B); until no points are deleted ; end.
/* 1st subiteration */ /* 2nd subiteration */ /* 3rd subiteration */
The new value of a black point depends on the values of 28 additional points. The considered special neighbourhoods are presented in Fig. 2. UD p
NS p
EW p
Fig. 2. The special local neighbourhoods assigned to the deletion directions UD, NS, and EW, respectively. The new value of a black point p depends on N26 (p) (marked “”) and two additional points (marked “”).
Deletable points in a subiteration are given by a set of matching templates. A black point is deletable if at least one template in the set of templates matches it. The set of templates TUD is given by Fig. 3. Note that Fig. 3 shows only the eight base templates TU1–TU4,TD1–TD4. Additionally, all their rotations around the vertical axis belong to TUD , where the rotation angles are 90◦ , 180◦ , and 270◦ . It is easy to see that the complete TUD contains 2 · (1 + 4 + 4 + 4) = 26 templates. This set of templates was constructed for deleting some simple points
A 3–Subiteration Surface–Thinning Algorithm
631
◦ ◦ ◦ TU2 ◦ ◦ ◦ TU3 · ◦ ◦ TU4 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ · ◦ ◦ ◦ ◦ ◦ · · · · · · ◦ ◦ ◦· ◦· ◦· · · · · · · · ·• · p p p · · · · · · p · • · · · · • · · •· ·• · · •· ·• · • · · ·• · • • • • • • • • • • • • · • · · · · • · • · • · • • • • • • • • · · · · · · · • • • • · TD1 • TD2• • • TD3• • • TD4• • • • • · • · · • · · • · · • · · · · · · · · · · · · · p · · p · · p · · p · • · · · · • · · • · • ·◦ ·◦ ◦ · ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦◦◦ ◦◦◦ · ◦◦ ◦◦◦ · · · · · · ◦ ◦ ◦ • ◦ ◦ TU1
Fig. 3. Base templates TU1–TU4,TD1–TD4 and their rotations around the vertical axis form the set of templates TUD assigned to the deletion direction UD. This set of templates belongs to the first subiteration. Notations: each position marked “p” and “•”, matches a black point; each position marked “◦” matches a white point; each “·” (“don’t care”) matches either a black or a white point.
UD 0 6
3
1
5 8 7 9
2
4
10
11
p 13 15 16 14 17 18 19 21 20 23 24 22 12
NS 6 14
7
16 23 24 22 3
8
15
4
5
p 13 21 20 0 1 2 10 11 9 18 19 17 12
EW 17 9 20 12 22
14 18
0
3
1
4
6 10
p
7 11 2 13 5 21 16 8 24 23
15 19
Fig. 4. Indices of the 25 Boolean variables (i.e., the considered points in N26 (p)). Note, that investigating the point marked “” is not needed. Since the deletion rule of a subiteration can be derived from the deletion rule of the reference subiteration UD by the proper rotation, the indexing scheme of a subiteration corresponds to the proper permutation of positions assigned to the reference subiteration.
which are neither surface end–points nor extremities of surfaces. The deletable points of the other two subiterations (corresponding to deletion directions NS and EW) can be obtained by proper rotations of the templates in TUD . Note that choosing another order of the deletion directions yields another algorithm. The proposed algorithm terminates when there are no more black points to be deleted. Since all considered input images are finite, it will terminate.
632
K. Pal´ agyi
Fig. 5. Two synthetic images containing a 140 × 140 × 50 horse and a 45 × 45 × 45 cube (top); and their skeletons produced by the proposed surface–thinning algorithm (bottom)
Implementing the proposed algorithm seems to be rather difficult and time consuming, but it is wide of the mark. We can state that a border point in image B is to be deleted from deletion direction UD if: ( ( B(d(p))=1 and B(u(p))=0 and d(p) is interior point ) or ( B(u(p))=1 and B(d(p))=0 and u(p) is interior point ) ) and f (x0 , x1 , . . . , x24 ) = 1, where f is a Boolean–function of 25 variables derived from the set of templates. It is easy to see, that function f can be given by a pre-calculated 4 Mbyte (unit time access) look-up-table. The considered 25 variables correspond to 25 points in N26 (p) (see Fig. 4). More details concerning the efficient implementation of 3D thinning algorithms are presented in [16].
A 3–Subiteration Surface–Thinning Algorithm
4
633
Discussion and Results
Thinning algorithms have to take care of the following four aspects: 1. forcing the “skeleton” to retain the topology of the original object (i.e., topology is to be preserved); 2. providing “shape preservation” (i.e., significant features of the original object are to be produced); 3. forcing the “skeleton” to be in its geometrically correct position (i.e., in the “middle” of the object); 4. producing “maximal” thinning (i.e., the desired “width” of the “skeleton” is one point). It is easy to see the topological correctness (the 1st requirement) by using a characterization of simple points [9] and a sufficient condition for parallel reduction operations of 3D (26, 6) images [13]. Shape preservation (the 2nd requirement) is a fairly important requirement, too. For example, an object like “b” cannot be thinned into an object like “o”. The aim of the thinning is not to produce the topological kernel [2] of an object: the thinning differs from shrinking. That is the reason why end–point criteria are used in thinning. It is easy to see that surface–end points are removed by none of our templates. Geometrical correctness (the 3rd requirement) of the extracted skeleton is mostly achieved by the subiteration (multi–directional) thinning approach. An object is to be shrunk uniformly from each directions.
Fig. 6. Three synthetic images containing a 45 × 45 × 45 cube with one, two, and three hole(s), respectively (top); and their skeletons produced by the proposed surface– thinning algorithm (bottom)
634
K. Pal´ agyi
Table 1. Computation times for the considered five kinds of test objects. The implemented surface–thinning algorithm was run under Linux on an Intel Pentium 4 CPU 2.80 GHz PC. Due to the efficient implementation [16], the time complexity depends only on the number of object points and the compactness of the objects (i.e., volume to area ratio); but it does not depend on the size of the image. test object
size
number of object points running time (sec.)
140 × 140 × 50 45 × 45 × 45 93 × 93 × 93 141 × 141 × 141 45 × 45 × 45 93 × 93 × 93 141 × 141 × 141 45 × 45 × 45 93 × 93 × 93 141 × 141 × 141 45 × 45 × 45 93 × 93 × 93 141 × 141 × 141
92 534
2
2
2
2
91 804 803 81 714 491 74 655 284 67 595 076
125 357 221 000 984 752 250 402 106 500 820 460
0.125 0.035 0.383 1.388 0.032 0.359 1.320 0.028 0.335 1.268 0.026 0.322 1.105
It is rather difficult to prove that the 4th requirement about maximal thinning is satisfied. Due to the used surface end–point criteria, the produced skeleton may contain 2–point thick surface patches [1,10]. It is easy to overcome this problem (e.g., by applying the final thinning step proposed by Arcelli et al. [1]). Our algorithm has been tested on objects of different shapes. Here we present five examples (see Figs. 5–6). The computation time of a thinning process depends on the complexity of an iteration step and the required number of iteration steps. The 3–subiteration 3D thinning strategy has been compared with other subiteration–based approaches with periods of 6, 8, or 12. It has been shown that the 3–subiteration approach requires the least number of iterations [15]. If we use unit time access lookup-tables (corresponding the deletion rules of the considered algorithms) and our efficient implemetation method [16] is applied, then the 3–subiteration algorithms are the fastest subiteration–based ones. The efficiency of the proposed method is illustrated in Table 1.
Acknowledgements The author is grateful to Stina Svensson (Centre for Image Analysis, Swedish University of Agricultural Sciences, Uppsala, Sweden) for supplying the horse image data (see Fig. 5).
A 3–Subiteration Surface–Thinning Algorithm
635
References 1. Arcelli, C., Sanniti di Baja, G., Serino, L.: New removal operators for surface skeletonization. In: Kuba, A., Ny´ ul, L.G., Pal´ agyi, K. (eds.) DGCI 2006. LNCS, vol. 4245, pp. 555–566. Springer, Heidelberg (2006) 2. Bertrand, G., Aktouf, Z.: A 3D thinning algorithms using subfields. In: Proc. SPIE Conf. on Vision Geometry III, vol. 2356, pp. 113–124 (1994) 3. Bertrand, G.: A parallel thinning algorithm for medial surfaces. Pattern Recognition Letters 16, 979–986 (1995) 4. Blum, H.: A transformation for extracting new descriptors of shape. Models for the Perception of Speech and Visual Form, pp. 362–380. MIT Press, Cambridge (1967) 5. Gong, W.X., Bertrand, G.: A simple parallel 3D thinning algorithm. In: Proc. 10th Int. Conf. on Pattern Recognition, pp. 188–190 (1990) 6. Hall, R.W.: Parallel connectivity–preserving thinning algorithms. In: Kong, T.Y., Rosenfeld, A. (eds.) Topological algorithms for digital image processing, pp. 145– 179. Elsevier Science, Amsterdam (1996) 7. Kong, T.Y., Rosenfeld, A.: Digital topology: Introduction and survey. Computer Vision, Graphics, and Image Processing 48, 357–393 (1989) 8. Lee, T., Kashyap, R.L., Chu, C.: Building skeleton models via 3–D medial surface/axis thinning algorithms. CVGIP: Graphical Models and Image Processing 56, 462–478 (1994) 9. Malandain, G., Bertrand, G.: Fast characterization of 3D simple points. In: Proc. 11th IEEE Internat. Conf. on Pattern Recognition, pp. 232–235 (1992) 10. Manzanera, A., Bernard, T.M., Pretˆeux, F., Longuet, B.: Medial faces from a concise 3D thinning algorithm. In: Proc. 7th IEEE Internat. Conf. Computer Vision, ICCV’99, pp. 337–343 (1999) 11. Mukherjee, J., Das, P.P., Chatterjee, B.N.: On connectivity issues of ESPTA. Pattern Recognition Letters 11, 643–648 (1990) 12. Pal´ agyi, K., Kuba, A.: A 3D 6–subiteration thinning algorithm for extracting medial lines. Pattern Recognition Letters 19, 613–627 (1998) 13. Pal´ agyi, K., Kuba, A.: Directional 3D thinning using 8 subiterations. In: Bertrand, G., Couprie, M., Perroton, L. (eds.) DGCI 1999. LNCS, vol. 1568, pp. 325–336. Springer, Heidelberg (1999) 14. Pal´ agyi, K., Kuba, A.: A parallel 3D 12–subiteration thinning algorithm. Graphical Models and Image Processing 61, 199–221 (1999) 15. Pal´ agyi, K.: A 3-subiteration 3D thinning algorithm for extracting medial surfaces. Pattern Recognition Letters 23, 663–675 (2002) 16. Pal´ agyi, K.: Efficient implementation of 3D thinning algorithms. In: Proc. 6th Conf. Hungarian Association for Image Processing and Pattern Recognition, pp. 266–274 (2007) 17. Tsao, Y.F., Fu, K.S.: A parallel thinning algorithm for 3–D pictures. Computer Graphics and Image Processing 17, 315–331 (1981)
Extraction of River Networks from Satellite Images by Combining Mathematical Morphology and Hydrology Pierre Soille and Jacopo Grazzini Spatial Data Infrastructures Unit Institute for Environment and Sustainability DG Joint Research Centre, European Commission, I-21020 Ispra, Italy {Pierre.Soille,Jacopo.Grazzini}@jrc.it
Abstract. In this paper, we propose a new methodology for extracting river networks from satellite images. It combines morphological generalised geodesic transformations with hydrological overland flow simulations. The method requires the prior generation of a geodesic mask and a marker image by applying a series of transformations to the original image. These images are then combined so as to produce a pseudo digital elevation model whose valleys match the desired networks. The performance of the methodology is demonstrated for the extraction of river networks from a single band of a Landsat image. The method is generic in the sense that it can be extended for the extraction of other types of arborescent networks such as blood vessels in medical images.
1
Introduction
Line networks [1], also called thin nets [2] or curvilinear structures [3], are found in numerous applications fields. Probably the most studied ones are those encountered in medical images (e.g., blood vessels [4]) and satellite images of the Earth (e.g., roads [5,6] and rivers [7,8,9]). The extraction of line networks is often based on a two step approach. Typically, a segmentation and/or classification leads to an initial detection of the desired network avoiding false positive detections but containing gaps. These gaps are then bridged using perceptual grouping techniques. In this paper, we focus on the extraction of river networks in satellite images and propose to exploit the arborescent nature of these networks (i.e., tree-like structure with a root). Similarly to the watershed based segmentation, this is achieved by combining concepts arising from mathematical morphology and hydrology. Morphological algorithms allow for the creation of a pseudo digital elevation model whose main valley lines match actual rivers. These valleys are then extracted using overland flow simulation procedures known in hydrology. The rest of the paper is organised as follows. Background morphological and hydrological concepts at the root of the proposed methodology are recalled in Sec. 2. The methodology itself is detailed in Sec. 3. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 636–644, 2007. c Springer-Verlag Berlin Heidelberg 2007
Extraction of River Networks from Satellite Images
2
637
Background Concepts
This section briefly describes the two main concepts at the basis of the proposed methodology: geodesic time function in mathematical morphology and contributing drainage areas in hydrology. Unless otherwise stated, we adopt definitions and notations presented in [10]. 2.1
Geodesic Time in Mathematical Morphology
The geodesic distance between two points of a connected set is defined as the length of the shortest path(s) linking these points and remaining in the set [11]. This idea has been generalised to grey tone geodesic masks in [12] by introducing the notion of geodesic time. The time necessary for travelling a path is defined as the sum of the intensity values of the points of the path. The geodesic time separating two points of a grey tone geodesic mask is then defined as the smallest amount of time of all paths linking these points and remaining within the definition domain of the geodesic mask. Formally, the time necessary to cover a discrete path P of length l (i.e., P = (po , . . . , pl )) defined on a discrete grey scale image f equals the mean of the values of f taken two at a time along P. It is denoted by τf (P): τf (P) =
l f (pi−1 ) + f (pi ) i=1
2
f (p0 ) f (pl ) + + f (pi ). 2 2 i=1 l−1
=
The geodesic time separating two points p and q in a grey scale image f is denoted by τf (p, q): τf (p, q) = min{τf (P) | P is a path linking p to q}. The geodesic time between a point p and a reference set of points Y of a geodesic mask image f is the smallest amount of time allowing to link p to any point q of Y : τf (p, Y ) = min τf (p, q). q∈Y
If p belongs to Y , the geodesic time from p to Y is zero. The geodesic time function from a reference set Y within a geodesic mask image f is defined by associating each point of the image definition domain with its geodesic time to Y . We denote by Tf (Y ) the resulting geodesic time function: Tf (Y ) (p) = τf (p, Y ). An efficient algorithm based on priority queue data structures is detailed in [10, pp. 237–238]. Note that the concept of geodesic time function is closely related to the notion grey weighted distances in image processing [13] and, more generally, cost functions in digital graphs. 2.2
Contributing Drainage Area in Hydrology
Digital elevation models (DEMs) can be viewed as grey tone images where the intensity value of each pixel represents the elevation of the terrain at the corresponding location. Since the availability of the first DEMs in the early 80’s,
638
P. Soille and J. Grazzini
numerous methodologies have been proposed for extracting river networks from them. A recent survey on this topic can be found in [14]. The best available method relies on the simulation of the flow of water on the topographic surface. It assumes that (i) all spurious minima have been suppressed beforehand and (ii) every point except the stream outlets have at least one neighbour with a lower elevation than the considered point. The flow direction at each point is then defined as the direction of the 8-neighbour of the point corresponding to the steepest slope. In this sense, each point of the terrain belongs to the network of streams. This is a valid assumption since the notion of stream is dynamic and depends on the intensity of the rain over the terrain so that for heavy rainfall all points of the terrain are flowing. In practice, the flow directions allow for the computation of the number of pixels located upstream of each point of the DEM. This number is called contributing drainage area. River network pixels are then defined as those pixels whose contributing drainage area exceeds some threshold value, see for example [15] and references thereof.
3
Methodology
Our methodology for extracting river networks from a single optical satellite image can be summarised as follows: 1. Define the outlets of the desired river network and a region of interest; 2. Enhance potential river stretches; 3. Generate a pseudo DEM by computing the geodesic time function with the outlets as reference set and the image with enhanced rivers as geodesic mask; 4. Calculate the local flow directions of the pseudo DEM and the subsequent contributing drainage areas; 5. Trim the resulting space filling network so as to obtain a network matching the target network. The successive steps are detailed below and illustrated on the Landsat image displayed in Fig. 1a. 3.1
Outlet Definition and Region of Interest
Typically, the outlets of the target network correspond to the sea pixels of the input imagery. Hence, the sea pixels play the role of the reference set of the geodesic time function and will therefore correspond to the level 0 of the pseudo DEM generated in Sec. 3.3. If the watershed boundaries of the catchment basins of the processed area are available, all computations are restricted to a region of interest defined as the union of the catchments fully included in the image and flowing towards sea pixels. Alternatively, other water features such as large inland lakes can be used instead of the sea for the reference set. Eventually, if no such features can be detected, the boundary of the image definition domain can be considered as a valid outlet in a first approximation.
Extraction of River Networks from Satellite Images
639
(a) Input Landsat image (band 5): f .
(b) Rank-min closing of f : φB,λ (f ).
(c) Top-hat of f : BTH(f ) = f − φB,λ (f ).
(d) Geodesic mask: g = f · T[0,0] (BTH(f )).
Fig. 1. Geodesic mask creation from the fifth band of a Landsat image
3.2
Generation of Geodesic Mask
The geodesic mask is obtained by modifying the input image in such a way that potential river pixels are set to low values. By doing so, the geodesic time function will propagate faster along potential rivers. Since satellite images are usually multispectral images, a combination of channels can be used. In this paper, we use a single spectral band having low reflectance values for water (alternatively, band combinations such as those defined by a water index could be used). Then, rather than exploiting knowledge about the reflectance values of water in the selected band, we translate knowledge about the shape, size, and local contrast of river stretches into a series of morphological operations. By doing so, we are able to enhance river stretches that could not be detected solely on the basis of their spectral values. Indeed, many rivers are detectable by a human operator owing to their shape and relative contrast since they appear as dark networks of thin lines. In mathematical morphology terms, the extraction of potential river pixels can be achieved by initially suppressing these networks with a closing by
640
P. Soille and J. Grazzini
(a) Geodesic time function: Tg (sea). This function can be seen as a pseudo DEM.
(b) Shaded view of lower complete transformation of Tg (sea).
(c) Flow directions of (b).
(d) Contributing drainage areas of (c).
Fig. 2. From geodesic time function to contributing drainage areas (see also Fig. 1). In this example, the sea pixels are falling outside the image frame.
a disk shaped structuring element whose diameter slightly exceeds the width of the targeted network. However, a closing by a plain structuring element would enhance too many structures. A less restrictive closing consists in considering the intersection (i.e., point-wise minimum) of the closing by all subsets of the disk containing a fixed number of pixels, see Fig. 1b. This type of closing is known as a parametric closing or rank-min closing owing to its representation in terms of a rank filter followed by a point-wise minimum with the input image (see [16] and references thereof). In Fig. 1b, taking into account the physical size of a pixel, we used a 3 × 3 square structuring element B and considered every subset of B containing 6 pixels. Although the closed image looks very similar to the input image, this operation closes many potential river stretches. This is highlighted by computing the difference between the closing and the input image called top-hat by closing, see Fig. 1c. Notice that the top-hat by closing leads to many false positive detections. This is however not an issue because the
Extraction of River Networks from Satellite Images
641
subsequent steps cope with false positives provided that higher rate of detection are obtained for actual river stretches. We therefore consider as potential river all pixels of this top-hat by closing that have a positive response. The geodesic mask is then formed by setting to zero these pixels, all other pixels retaining their original value, see Fig. 1d. 3.3
Geodesic Time Function
The geodesic time function is then computed from the marker set (sea pixels) within the geodesic mask. Since potential river pixels are set to zero in the geodesic mask, the propagation goes faster along potential rivers so that the resulting geodesic time function mimics a digital elevation model of the studied terrain. However, the computed time values correspond to a sum of reflectance values so that they do not match actual elevations. For this reason we call this geodesic time function a pseudo DEM, see Fig. 2a.
(a) River network extracted by applying a fixed threshold of 23000 pixels to Fig. 2d.
(b) Extracted network overlaid on input image.
Fig. 3. River network resulting from thresholding the contributing drainage areas of the geodesic time function
3.4
From Geodesic Time Function to Drainage Areas
We then compute the flow directions on the geodesic time function. The flow direction of a pixel is defined for each pixel as the direction of the 8-neighbour producing the steepest slope. Ties are solved either by randomly selecting a possible direction or choosing the central direction when there are three adjacent possibilities. However, flow directions cannot be defined from a local neighbourhood on plateaus (resulting from the presence of zero valued pixels in the geodesic mask). Plateaus can be suppressed using geodesic distance computations from their descending border [17]. A shaded view of the resulting lower complete function is shown in Fig. 2b and the corresponding flow directions in Fig. 2c.
642
P. Soille and J. Grazzini
(a) Retina image.
(b) Pseudo DEM.
(c) Extracted network.
Fig. 4. Detection of blood vessels in an image of an eye retina using the methodology described in this paper: preliminary results
Contributing drainage areas are then calculated by simulating the flow of water on the pseudo DEM using the previously defined flow directions, see [15] for a fast algorithm. The contributing drainage areas of our sample image are displayed in Fig. 2d. 3.5
Network Extraction
The extraction of an actual network from the contributing drainage area image can be achieved by thresholding it for a fixed threshold value. This is illustrated in Fig. 3. Rather than using fixed thresholds, adaptive threshold values can be defined using a priori knowledge or further transformations of the input image. By doing so, the river network is defined as the downstream of all pixels whose contributing drainage area exceed their threshold value. This idea has already been used for the extraction of river networks from actual DEMs in [18].
4
Conclusion and Perspectives
The originality of the method proposed in this paper lies in the combination of geodesic time computations with concepts related to overland flow simulation. The generation of the pseudo digital elevation model exploits information on the local contrast (river stretches are darker in the selected band) and shape (linear and thin structures) of the targeted network as well as its arborescent organisation. If necessary, physical image formation models and the use of several spectral bands rather than a single one could be used to constrain further the generation of the pseudo DEM and/or the extraction of the final network from this DEM. The proposed methodology is generic in the sense that it can be extended for the extraction of other arborescent networks such as blood vessels in medical
Extraction of River Networks from Satellite Images
643
images. An example of preliminary result for the extraction of blood vessels in retina images is displayed in Fig. 4. In this example, the optical nerve plays the role of the ’sea’ pixels and has been set manually, see yellow disk in Fig. 4b. All concepts are directly applicable to 3-dimensional images. This general framework will be detailed in a subsequent paper and will include the treatment of anastomoses (i.e., networks in which line features both branch out and reconnect such as for braided rivers). Multiscale concepts could also be considered for the enhancement of potential network stretches [19].
References 1. Geusebroek, J.M., Smeulders, A., Geerts, H.: A minimum cost approach for segmenting networks of lines. Int. J. of Computer Vision 43, 99–111 (2001) 2. Armande, N., Montesinos, P., Monga, O.: Thin nets extraction using a multi-scale approach. Computer Vision and Image Understanding 73, 285–295 (1999) 3. Steger, C.: An unbiased detector of curvilinear structures. IEEE Trans. on Pattern Analysis and Machine Intelligence 20, 113–125 (1998) 4. Zana, F., Klein, J.C.: Segmentation of vessel-like patterns using mathematical morphology and curvature evaluation. IEEE Trans. on Image Processing 10, 1010–1019 (2001) 5. Merlet, N., Zerubia, J.: New prospects in line detection by dynamic programming. IEEE Trans. on Pattern Analysis and Machine Intelligence 18, 426–431 (1996) 6. Mena, J.: State of the art on automatic road extraction for GIS update: A novel classification. Pattern Recognition Letters 24, 3037–3058 (2003) 7. Haralick, R., Wang, S., Shapiro, L., Campbell, J.: Extraction of drainage networks by using the consistent labeling technique. Remote Sensing of Environment 18, 163–175 (1985) 8. Ichoku, C., Karnieli, A., Meisels, A., Chorowicz, J.: Detection of drainage channel networks on digital satellite images. Int. J. of Remote Sensing 17, 1659–1678 (1996) 9. Dillabaugh, C., Niemann, O., Richardson, D.: Semi-automated extraction of rivers from digital imagery. GeoInformatica 6, 263–284 (2002) 10. Soille, P.: Morphological Image Analysis: Principles and Applications, 2nd edn. Springer, Heidelberg (2003) 11. Lantu´ejoul, C., Maisonneuve, F.: Geodesic methods in image analysis. Pattern Recognition 17, 177–187 (1984) 12. Soille, P.: Generalized geodesy via geodesic time. Pattern Recognition Letters 15, 1235–1240 (1994) Download a PDF preprint: http://ams.jrc.it/soille/ soille94.pdf 13. Verbeek, P., Verwer, B.: Shading from shape, the eikonal equation solved by greyweighted distance transform. Pattern Recognition Letters 11, 681–690 (1990) 14. Soille, P., Vogt, J., Colombo, R.: Carving and adpative drainage enforcement of grid digital elevation models. Water Resources Research 39, 1366 (2003) 15. Soille, P., Gratin, C.: An efficient algorithm for drainage networks extraction on DEMs. J. of Visual Communication and Image Representation 5, 181–189 (1994) 16. Soille, P.: On morphological operators based on rank filters. Pattern Recognition 35, 527–535 (2002)
644
P. Soille and J. Grazzini
17. Soille, P.: Advances in the analysis of topographic features on discrete images. In: Braquelaire, A., Lachaud, J.-O., Vialard, A. (eds.) DGCI 2002. LNCS, vol. 2301, pp. 175–186. Springer, Heidelberg (2002) 18. Colombo, R., Vogt, J., Soille, P., Paracchini, M., de Jager, A.: On the derivation of river networks and catchments at European scale from medium resolution digital elevation data. Catena (2007), Available online (November 22, 2006) 19. Grazzini, J., Chrysoulakis, N.: Extraction of surface properties from a high accuracy DEM using multiscale remote sensing techniques. In: Proc. of the 19th Conf. Informatics for Environmental Protection, Brno, Czech Republic, pp. 352– 356 (2005)
Fractal Active Shape Models Polychronis Manousopoulos, Vassileios Drakopoulos, and Theoharis Theoharis Department of Informatics and Telecommunications, University of Athens, Panepistimioupolis, 157 84, Athens, Greece {polyman, vasilios, theotheo}@di.uoa.gr
Abstract. Active Shape Models often require a considerable number of training samples and landmark points on each sample, in order to be efficient in practice. We introduce the Fractal Active Shape Models, an extension of Active Shape Models using fractal interpolation, in order to surmount these limitations. They require a considerably smaller number of landmark points to be determined and a smaller number of variables for describing a shape, especially for irregular ones. Moreover, they are shown to be efficient when few training samples are available.
1
Introduction
Active Shape Models (ASMs, see e.g. [1]) are widely used for image segmentation and motion tracking; they have been proven to be efficient and robust in many areas of application, especially biomedical ones (see e.g. [2], [3], [4]). Extensions of the original ASM formulation have also been proposed, in order to increase their robustness, efficiency and applicability (see e.g. [5], [6], [7]). In ASM applications, a significant number of landmark points has to be used in each training sample in order to model the desired variability of the allowable shape space, resulting in a time-consuming labelling process. Although (semi-)automatic methods have been developed for assisting this, the labelling remains a difficult and error-prone task. Moreover, the number of available training samples is often relatively small in practice, resulting in a rather restricted space of allowable shapes that reduces the efficiency of the image search. Our motivation is to create an extension of the ASMs that requires a considerably smaller number of landmark points, uses fewer variables to describe shapes, even irregular ones, thus increasing the accuracy of representation, and is efficient even for a small number of training samples. Motivated by the above points, we introduce the Fractal Active Shape Models, an extension of Active Shape Models using fractal interpolation.
2 2.1
Mathematical Background Active Shape Models
Active Shape Models are statistical models of shape that can be applied to image search (see e.g. [1]). An ASM uses a Point Distribution Model (PDM) for the W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 645–652, 2007. c Springer-Verlag Berlin Heidelberg 2007
646
P. Manousopoulos, V. Drakopoulos, and T. Theoharis
statistical modelling of shapes in a given training set of sample images. This model is then used for locating statistically allowable shapes in other images. A shape is described by a set of landmark points. These correspond to particular shape features and their interpretation should be consistent among all shapes of a class. For example, the boundary of the leaf Laurus nobilis depicted in Fig. 1(a) is represented by the 77 landmark points of Fig. 1(c). Moreover, they must be adequate for describing the shape variability of the training set as well as all other allowable shapes. The landmark points may be manually located on the training set images, but (semi-)automatic methods have also been developed, since this is often a time consuming process1 . The sets of landmark points located in the training images are aligned in a common coordinate frame, e.g. using the iterative algorithm of [1]. An aligned shape can be represented by the vector x = (x1 , . . . , xL , y1 , . . . , yL ),
(1)
where (xi , yi ), i = 1, . . . , L are the aligned landmark points. The aligned shapes are then statistically modelled by applying Principal Component Analysis (PCA) to them; each shape x is represented as ¯ + Pb , x=x
(2)
¯ is the mean shape, P is the matrix of eigenvectors of the covariance where x matrix of the shape vectors and b is a vector of parameters determining the deviation of x from the mean shape along each eigenvector. In practice, P contains only the eigenvectors of the largest eigenvalues by eliminating insignificant eigenvectors that may correspond to noise in the training set. Equation (2) can be used for defining the space of allowable shapes by suitably constraining b, √ e.g. |bk | ≤ 3 λk for all elements bk of b, where λk is the eigenvalue of the corresponding eigenvector. This statistical shape model of the training set can be used for locating similar shapes within graphical objects such as images. Starting with a rough approximation of the shape in the image, each point is moved to a nearby location to a better candidate. The points are usually moved along the shape boundary normal and are attracted to image edges or matching profiles of local intensity. This procedure is executed iteratively until it converges. At each step, the shape resulting from moving all points is constrained to the space of statistically allowable shapes. Therefore, the result is both compatible to the image structure as well as to the training set. 2.2
Fractal Interpolation
Fractal interpolation functions (see e.g. [8]) are based on the theory of iterated function systems (IFSs). An IFS, denoted by {X; wn , n = 1, 2, . . . , N }, consists 1
See Cootes T.F. and Taylor C.J., Statistical Models of Appearance for Computer Vision, http://www.isbe.man.ac.uk/∼ bim, March 2004, pp. 79–82 for a survey on landmark placement methods.
Fractal Active Shape Models
647
of a complete metric space (X, ρ), e.g. (IRn , || · ||), and a finite set of continuous mappings wn : X → X, n = 1, 2, . . . , N . If wn are contractions, then the IFS is termed hyperbolic and the transformation W : H(X) → H(X) with W (B) = ∪N n=1 wn (B), where H(X) denotes the space of nonempty compact subsets of X, has a unique fixed point A∞ = W (A∞ ) = limn→∞ W n (B), for every B ∈ H(X), which is called the attractor of the IFS. Let us represent the given set of data points as {(um , vm ) ∈ IR2 : m = 0, 1, . . . , M }. In general, the interpolation is applied to a subset of them, the interpolation points, represented as {(xi , yi ) ∈ IR2 : i = 0, 1, . . . , N }. Both sets are linearly ordered with respect to their abscissa, i.e. u0 < u1 < · · · < uM and u0 = x0 < x1 < · · · < xN = uM . The interpolation points partition the set of data points into interpolation intervals and may be chosen equidistantly or not. Let {IR2 ; wn , n = 1, 2, . . . , N } be an IFS with affine transformations x a 0 x d wn = n + n y cn sn y en constrained to satisfy x0 xn−1 wn = y0 yn−1
and
wn
xN yN
xn = yn
for every n = 1, 2, . . . , N . As results from solving the above equations (see e.g. [8], p. 213), the real numbers an , dn , cn , en are completely determined by the interpolation points, while the sn are free parameters of the transformations satisfying |sn | < 1, in order to guarantee that the IFS is hyperbolic with respect to an appropriate metric. The transformations wn are shear transformations and the sn their respective vertical scaling (or contractivity) factors. N It is well known (see for example [8]) that the attractor G = n=1 wn (G) of the aforementioned IFS is the graph of a continuous function f : [x0 , xN ] → IR that interpolates the points (xi , yi ), i = 0, 1, . . . , N . This function is called fractal interpolation function (FIF) corresponding to these points and is self-affine. If the interpolation points define a curve rather than a function, i.e. they are not linearly ordered with respect to their abscissa, then the direct use of a fractal interpolation function is not possible. In order to construct an IFS whose attractor interpolates the given points, and is therefore a curve, we can transform or extend the original points such that the application of a FIF is possible. This is then transformed or projected back to the plane to obtain a curve that interpolates the original points. Curve fitting using fractal interpolation is discussed in [9], where the Fractal Curve Fitting (FCF) method is introduced. The latter will be used in this paper, since ASM are usually applied to objects with landmark points that define curves. The FCF method consists of transforming the data points {(um , vm ) ∈ IR2 : m = 0, 1, . . . , M } by T1 (um , vm ) = (um , vm ), m = 0, 1, . . . , M , where um = u0 +
m j=1
(|uj − uj−1 | + ε) = um−1 + (|um − um−1 | + ε),
648
P. Manousopoulos, V. Drakopoulos, and T. Theoharis vm = vm ,
and ε > 0 is an arbitrary constant necessary, when all data points in an interpo lation interval have equal u-coordinates. A FIF is then constructed for (um , vm ), m = 0, 1, . . . , M , and this is finally transformed by T2 (u , v ) = (u, v), where u − um−1 u = um−1 + (um − um−1 ) , u ∈ [um−1 , um ], um − um−1 v = v , resulting in a fractal interpolation curve that interpolates the initial points and passes near the remaining data points. The advantage of the FCF method over other methods is that it uses a smaller number of total affine transformation parameters (specifically half). Therefore, it is more suitable to our purpose as will become clearer in the next section.
3
Fractal Active Shape Models
We now introduce the Fractal Active Shape Models (FASMs), an extension of the ASM using fractal interpolation. Within the FASM framework, a shape is represented by a fractal interpolation curve rather than a set of landmark points. The interpolation points for constructing the curve need not be all the landmark points of the respective ASM, but only a relatively small subset of them. Indeed, using fewer interpolation points we are able to construct a curve that accurately describes the shape while requiring a smaller total number of variables. Let us represent the set of all aligned shape boundary points, including the k landmarks, extracted from the k-th training set image as {(ukm , vm ) ∈ IR2 : m = 0, 1, . . . , M }. The selected interpolation points, a subset of the landmark points, are represented accordingly as {(xki , yik ) ∈ IR2 : i = 0, 1, . . . , N }. In the present work, the interpolation intervals are chosen with fixed length, i.e. every j-th data point is selected as interpolation point. A fractal interpolation curve is constructed for the shape points using the FCF method of Subsec. 2.2, with {IR2 ; wnk , n = 1, 2, . . . , N } being the respective IFS. The fractal interpolation curve, and therefore the respective shape, is described by the wnk transformation parameters, and is thus represented by the vector fk = (ak1 , . . . , akN , ck1 , . . . , ckN , sk1 , . . . , skN , dk1 , . . . , dkN , ek1 , . . . , ekN ).
(3)
If the shape consists of multiple disjoint parts, e.g. a face, then we model each part with a fractal interpolation curve and create an aggregate vector fk of all transformation parameters for representing the shape. The parameters akn , ckn , dkn , ekn are determined by the interpolation points, while the parameters skn by the remaining points. Thus, if we set skn = 0 for every n = 1, . . . , N , resulting in piecewise linear interpolation, we are taking into account only the
Fractal Active Shape Models
(a)
(b)
(c)
(d)
649
Fig. 1. (a) A leaf of Laurus nobilis. (b) The leaf’s boundary consisting of 930 points. (c) The 77 landmark points representing the leaf. (d) The fractal interpolation curve of 23 interpolation points representing the leaf.
interpolation points, i.e. the FASM is equivalent to the respective ASM. Therefore, the FASM can be viewed as an extension of the ASM. For example, the boundary of the leaf of Laurus nobilis depicted in Fig. 1(a) can be represented either by the 77 landmark points of Fig. 1(c), or by the fractal interpolation curve of 23 interpolation points of Fig. 1(d). The curve is constructed using the FCF method. The interpolation intervals have been chosen with fixed length and the vertical scaling factors sn have been calculated with the analytic method of [10]. In both cases, the accuracy of representation is similar, but the FASM requires less than 30% of the landmark points of the ASM. Moreover, in the ASM the shape is represented by 154(= 2 × 77) parameters, while in the FASM is represented by 115(= 5 × 23) parameters2 , i.e. the FASM representation is more economical. This is because fractal interpolation exploits the presence of self-affinity in the data, thus allowing some degree of compression. The shape of this example is rather simple; for more complex and irregular shapes the FASM representation is expected to be even more profitable. This representation also has the advantage of describing the whole shape and not only a set of landmark points. Although the landmark points can capture the overall shape, the information of the remaining shape points is lost in the ASM. The FASM representation captures the whole shape using a smaller amount of data than the ASM. The statistical modelling of the shapes is achieved by applying PCA to the vectors fk , k = 1, . . . , K. A shape can then be expressed as f = ¯f + Pb ,
(4)
where ¯f represents the mean transformation, P the matrix of eigenvectors of the covariance matrix of the shape vectors and b the parameter vector. The space of allowable shapes is similarly defined by suitably constraining b, e.g. |bk | ≤ 2
An additional copy of the first point is automatically appended to the interpolation points in order to obtain a closed curve, thus resulting in 23 affine transformations.
650
P. Manousopoulos, V. Drakopoulos, and T. Theoharis
√ 3 λk for all elements bk of b, where λk is the eigenvalue of the corresponding eigenvector. Note that in the allowable shape space it should always be |sn | < 1, for n = 1, 2, . . . , N . The feasibility of this approach stems from the fact that the attractor of an IFS, a fractal curve in our case, depends continuously on the affine transformation wnk parameters ([8], p. 111). Therefore it is reasonable to apply the aforementioned PCA. In order to locate an allowable shape in an image, we use the following algorithm, which is an extension of the ASM algorithm: 1. Initialize an approximate fit to the given image. 2. Calculate the fractal interpolation curve for the current parameter vector b and quantize it to pixel coordinates. 3. Move the curve points along the boundary normal towards the image edges. If duplicates exist, then remove them. 4. Calculate the parameter vector b for the adjusted points and limit it to the allowable space. 5. If the change in the shape is not negligible, goto Step 2. The initialization in Step 1 of the algorithm can be done in various ways. For example, we may select a set of initial points as in the ASM case, calculate the respective affine transformation parameters (akn , ckn , dkn , ekn ) and set the free parameters (sn ) to zero3 . In Step 2, the number of calculated attractor points defines the balance between shape accuracy and execution time. Note that increasing the number of attractor points beyond a limit is unnecessary because of the quantization to pixel coordinates. In Step 3, the adjustment of the points is performed as in the original ASM algorithm; the image edges are calculated using a gradient operator, e.g. Sobel operator. In Step 4, the parameter vector, i.e. the IFS, for the adjusted points is calculated as before with the FCF method; the interpolation points are the adjusted interpolation points of the previous step.
4
Results
We have applied the FASM on a training sample consisting of eight images of leaves of Laurus nobilis, as the one depicted in Fig. 1. On each leaf, 23 landmark points have been placed and a fractal interpolation curve has been constructed for the whole leaf boundary using the FCF method. The vertical scaling factors of the affine transformations have been calculated using the analytic algorithm described in [10]. Each shape is represented by 115(= 5 × 23) parameters; an example is given in Fig. 2, where the 115 shape parameters are depicted in five groups. Examination of the shape vector in the form of these groups can reveal useful information about the shape. For example, we can conclude from the an and dn groups that the interpolation intervals are similar in x-length, or, from the sn group, that the leaf shape is most irregular, i.e. divergent from linear, in the first and eighth interval where the magnitude of sn is the greatest. 3
The zero vertical scaling factors result in a piecewise linear attractor.
Fractal Active Shape Models
an
0,06
1500 1000
651
dn
500
0,04
0
0,02
-500
0,02
300
cn
0,01 0
200
-0,01
100
-0,02
0
0,04
en
sn
0,02 0 -0,02 -0,04
Fig. 2. The affine transformation parameters of the fractal interpolation curve of Fig. 1(d) divided into five groups, each consisting of 23 of the shape parameters
(a)
(b)
Fig. 3. (a) The initial position of the shape in the image. (b) The final position of the shape after convergence.
The result of an image search example is depicted in Fig. 3. Specifically, the initial estimate of the shape is depicted in Fig. 3(a), while its final position after convergence is depicted in Fig. 3(b). The leaf has been accurately detected in the image, with the FASM being able to capture its finer details with as few as 23 interpolation (i.e. landmark) points and 8 training samples.
5
Conclusions and Further Work
We have presented a novel extension of ASMs that uses fractal interpolation for representing shapes. The proposed FASMs have the advantage of requiring significantly fewer landmark points, even for irregular shapes, and of using fewer variables for representing a shape. Moreover, they are shown to be efficient, even when few training samples are available. Therefore, they are a competitive alternative to the original ASMs, offering an efficient and practical approach. Future work will focus on extending the FASMs to three dimensions and using piecewise self-affine fractal interpolation, which is expected to produce even better results.
652
P. Manousopoulos, V. Drakopoulos, and T. Theoharis
Acknowledgements This research project is co-financed by E.U.-European Social Fund (75%) and the Greek Ministry of Development-GSRT (25%), under the PENED programme 70/3/8405.
References 1. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active Shape Models - Their training and application. Computer Vision and Image Understanding 61, 38–59 (1995) 2. Duta, N., Sonka, M.: Segmentation and interpretation of MR brain images: An improved Active Shape Model. IEEE Transactions On Medical Imaging 17, 1049– 1062 (1998) 3. Paragios, N., Jolly, M.P., Taron, M., Ramaraj, R.: Active Shape Models and segmentation of the left ventricle in echocardiography. In: Kimmel, R., Sochen, N.A., Weickert, J. (eds.) Scale-Space 2005. LNCS, vol. 3459, pp. 131–142. Springer, Heidelberg (2005) 4. Smyth, P.P., Taylor, C.J., Adams, J.E.: Automatic measurement of vertebral shape using Active Shape Models. Image and Vision Computing 15, 575–581 (1997) 5. Davatzikos, C., Tao, X., Shen, D.: Hierarchical Active Shape Models, using the wavelet transform. IEEE Transactions On Medical Imaging 22, 414–423 (2003) 6. van Ginneken, B., Frangi, A.F., Staal, J.J., ter Haar Romeny, B.M., Viergever, M.A.: Active Shape Model segmentation with optimal features. IEEE Transactions On Medical Imaging 21, 924–933 (2002) 7. Twining, C., Taylor, C.J.: Kernel principal component analysis and the construction of non-linear active Shape Models. In: Proceedings of the British Machine Vision Conference, pp. 23–32 (2001) 8. Barnsley, M.F.: Fractals everywhere, 2nd edn. Academic Press Professional, San Diego (1993) 9. Manousopoulos, P., Drakopoulos, V., Theoharis, T.: Curve fitting by fractal interpolation. Transactions on Computational Science (to appear) 10. Mazel, D.S., Hayes, M.H.: Using iterated function systems to model discrete sequences. IEEE Trans. Signal Processing 40, 1724–1734 (1992)
Decomposition for Efficient Eccentricity Transform of Convex Shapes Adrian Ion1 , Samuel Peltier1 , Yll Haxhimusa1,2 , and Walter G. Kropatsch1 1
Pattern Recognition and Image Processing Group, Faculty of Informatics, Vienna University of Technology, Austria {ion,sam,yll,krw}@prip.tuwien.ac.at 2 Department of Psychological Sciences, Purdue University, USA
[email protected] Abstract. The eccentricity transform associates to each point of a shape the shortest distance to the point farthest away from it. It is defined in any dimension, for open and closed manyfolds. Top-down decomposition of the shape can be used to speed up the computation, with some partitions being better suited than others. We study basic convex shapes and their decomposition in the context of the continuous eccentricity transform. We show that these shapes can be decomposed for a more efficient computation. In particular, we provide a study regarding possible decompositions and their properties for the ellipse, the rectangle, and a class of elongated shapes.
1
Introduction
To extract the required information from a set of images, a frequently used pattern is to repeatedly transform the input image while gradually moving from the low abstraction level of the input data to the high abstraction level of the output data. The idea is to have a reduced amount of (important) data at these higher abstraction levels. A class of such transforms applied to 2D shapes, associates to each point of the shape a value that characterizes in some way it’s relation to the rest of the shape. This value in many cases is a distance between important points i.e. features. Examples include the distance transform [1], which associates to each point the length of the shortest path to the border, the Poisson equation [2], which can be used to associate to each point the average time to reach the border by a random path (an average of the random shortest paths), and the eccentricity transform [3] which associates to each point the longest of the shortest paths to any other point of the shape. In this way one tries to come up with an abstracted representation of the shape (e.g. the skeleton [4] or shock graph [5] build on distance transform), which could be easily used in e.g. shape classification or retrieval. Minimal path computation [6] as well as distance transform [7] are used in 2D and 3D image segmentation.
Supported by the Austrian Science Fund under grants P18716-N13 and S9103-N04.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 653–660, 2007. c Springer-Verlag Berlin Heidelberg 2007
654
A. Ion et al.
The eccentricity transform (ECC) has it’s origins in the graph based eccentricity [8,9]. Defined in the context of digital images in [3,10], where properties and robustness have been shown, it was applied for shape matching in [11]. The ECC can be computed for of any dimension, discrete closed (e.g. 2D binary image) or open sets (surface of an ellipsoid), and continuous (e.g. 3D ellipsoid or the 2D surface of the 3D ellipsoid, etc.). For discrete 2D shapes, a naive algorithm (O(N 3 ) in the number of pixels) and a more efficient one for 2D shapes without holes, have been presented in [3]. This paper presents top-down approach for the efficient computation of the ECC of basic convex shapes using decomposition. Section 2 gives a short recall of the ECC. Section 3 shows its computation on an ellipse, rectangle and an elongated shape. Section 4 concludes and gives an outlook of the future work.
2
Recall ECC
In this section basic definitions and properties of the ECC are introduced following [3,11]. Let the shape S be a closed set in R2 and ∂S be its border1 . A path π is the continuous mapping from the interval [0, 1] to S. Let Π(p1 , p2 ) be the set of all paths between two points p1 , p2 ∈ S within the set S (Π(p1 , p2 ) ⊂ S). The geodesic distance d(p1 , p2 ) between two points p1 , p2 ∈ S is defined as the length λ of the shortest path π(p1 , p2 ), such that π ∈ S, more formally 1 d(p1 , p2 ) = min{λ(π(p1 , p2 ))|π ∈ Π} where λ(π(t)) = |π(t)|dt ˙ (1) 0
where π(t) is a parametrization of the path from p1 = π(0) to p2 = π(1). The eccentricity transform of S can be defined as, ∀p ∈ S ECCS (p) = max{d(p, q)|q ∈ S} = max{d(p, q)|q ∈ ∂S}
(2)
i.e. to each point p it assigns the length of the shortest geodesics to the points farthest away from it. In [3] it is shown that this transformation is quasi-invariant to articulated motion and robust against salt and pepper noise [10] (which creates holes in the shape). An eccentric point is then the point y that reaches a maximum in Eq. 2. All the eccentric points lie on the border of S [3].
3
ECC of Basic Shapes
In the case of simply connected convex shapes S, geodesic distance equals Euclidean distance, as no obstacles exist that have to be avoided. In the rest of the paper, S denotes a simply connected convex shape, centered at the origin (center of gravity). We will show some properties of shapes S regarding the eccentricity transform, which will help designing new and faster algorithms. For clarity, we introduce the following notations: the set of points (x, y) ∈ S such that x > 0, is denoted Sr and called the right part of S, whereas the set of points (x, y) ∈ S such that x < 0, is denoted Sl and called the left part of S. 1
This definition can be generalized to higher dimensions.
Decomposition for Efficient Eccentricity Transform of Convex Shapes
655
b (- x0 , y 0 )
(x0 , y 0 )
a
0 (- x0 , - y0 )
(x0 , - y0 )
Fig. 1. The sets of eccentric points of the ellipse are shown with a thick line
3.1
Ellipse
Ellipse Recalls. The elliptical curve of points (x, y) around the origin with parameters a > 0 and b > 0 is defined by x2 y2 + = 1. a2 b2
(3)
In point (x, y) it has a tangent direction (x, ˙ y) ˙ satisfying xx˙ y y˙ + 2 = 0. 2 a b
(4)
Bounding Extremal points. We consider an elliptical region S around the 2 2 origin, defined by xa2 + yb2 ≤ 1. In the following, we assume that a > b, thus the small axis of S is the segment [−b, b], and the long axis of S is the segment [−a, a]. As mentioned before, it has been shown that the eccentric point(s) of any point (x, y) ∈ S, are on the border of S. We now provide two additional properties regarding eccentric points on an elliptical region. Property 1. Let (xe , ye ) be an eccentric point for (x, y) ∈ S. Then, the tangent to the ellipse at the point (xe , ye ) is orthogonal to the line defined by (x, y) and (xe , ye ). Proof. Suppose that the tangent is not orthogonal to the line defined by (x, y) and (xe , ye ), then there exists a point (xe , ye ) in the neighborhood of (xe , ye ), which is farther away from (x, y) than (xe , ye ). This would contradict the fact that (xe , ye ) is an eccentric point for (x, y). The following property is a general property for symmetric (having one axis of symmetry), simply connected and convex shapes. Property 2. Let S be a symmetric, simply connected and convex shape, with A its axis of symmetry. Let S1 and S2 denote the two symmetric parts of S delineated by A. Any point p ∈ S1 has its eccentric point pe in S2 .
656
A. Ion et al.
Proof. Let p ∈ S1 \A and pe ∈ S1 . Then there exists the symmetric point pe ∈ S2 . The straight line connecting p and pe intersects A in point q ∈ A having the same distance to pe and pe : d(p, pe ) = d(p, q) + d(q, pe ) = d(p, q) + d(q, pe ) > d(p, pe ) shows that pe is not eccentric due to the triangular inequality. The ellipse is a simply connected convex shape and the smaller axis is an axis of symmetry. For any ellipse S, we can choose S1 = Sl and S2 = Sr . We compute the eccentric points of (0, b) and (0, −b). This allow us to partition the ellipse into 4 subsegments alternating the property of being eccentric or not (see Fig. 1). Let us consider the line l that goes through the point (0, −b) and crosses the ellipse with orthogonal tangent at point (x0 , y0 ), such that x0 ≥ 0 and y0 ≥ 0. This line l(τ ) is defined by: x y
= τ y˙0 = −b − τ x˙0 .
As (x0 , y0 ) ∈ l, we can deduce from Eq. 5 that: τ0 = xy˙00 y0 = −b − xy˙˙00 x0 . From Eq. 4, we obtain − xy˙˙00 = −b −
y0 a2 b2 ,
y0 a2 x0 b2 .
(5)
(6) 2
Using Eq. 5, we obtain y0 = −b−x0 yx00ab2 =
so
b3 . a2 − b2 The x-coordinate is then determined using the ellipse formula: y0 =
x20 = a2 (1 −
(a2
b4 ). − b2 )2
(7)
(8)
Similar calculations deliver the eccentric point for (0, −b) with x0 < 0 and y0 ≥ 0, and the two eccentric points for the point (0, b). One can directly deduce that any point (xc , yc ) of the ellipse, s.t. x2c < x20 has normals that do not intersect the segment [−b, b]. Thus, according to Prop. 2, all these points cannot be eccentric points for any point inside the ellipse. Hence the points (x0 , y0 ), (x0 , −y0 ), (−x0 , −y0 ), (−x0 , y0 ) partition the ellipse into 4 subsegments alternating the property of being extremal or not. Eccentric lines through the smaller axis. In this section, we show how to efficiently compute the eccentricity transform of an elliptical region S, by considering separately Sl and Sr . Using Prop. 1, we first show how to compute the eccentricity of all the points of the small axis. Let p = (0, µb), −1 ≤ µ ≤ 1 be a point on the small axis, and let pe = (xe , ye ) be its eccentric point in Sr . Using Prop. 1, the points (x, y) of the line l(τ ) defined by (p, pe ) satisfy:
Decomposition for Efficient Eccentricity Transform of Convex Shapes
657
pl psa pe
Fig. 2. Efficient computation of eccentricity transform based on decomposition
p
Fig. 3. Ellipse decomposition along the bigger axis: more than one line, orthogonal to the ellipse tangent at the point of intersection, can go through one point
x y
= τ y˙e = µb + τ x˙e .
(9)
In particular, pe ∈ l, so we have xe = τe y˙e and ye = µb+τe x˙e . Thus we deduce 3 that τe = xy˙ee and ye = µb + xy˙˙ee xe = b2µb −a2 . The x-coordinate is then determined using Eq. 3 x2e = a2 (1 −
b4 µ2 ). − a2 )2
(b2
(10)
So, the eccentricity of any point (0, µb), −1 ≤ µ ≤ 1 of the small axis, can directly be computed by the above formula using a, b and µ. The direction to their eccentric point is also known and can be stored in each point. As it is a convex shape, the eccentric path from any point of an elliptic region to its eccentric point is a straight line. Moreover, from Prop. 2, we know that the eccentric point pe of any point pl ∈ Sl is in Sr . Thus, for computing the eccentricity of pl , we just have to find the point psa of the small axis, such that the direction (pl , psa ) is the same as the direction stored in psa (see Fig. 2). Eccentric lines through the bigger axis. In the previous section, we have shown that it is possible to decompose an ellipse S along its smaller axis to efficiently compute the eccentricity ECCS . This is not the case when decomposing S into Su and Sd along the bigger axis [−a, a] because
658
A. Ion et al.
v - w, h
vw , h p
q
v- w, - h
pc1 pc2 vw , - h
Fig. 4. Eccentric paths inside a rectangle
– the eccentric points pe of any point pba ∈ [−a, a] are either (−a, 0) or (a, 0), which is obviously not helpfull for deducing ECCS ; – even if we associate to each point pba ∈ [−a, a] the point p ∈ ∂Su s.t. (p, p ) is orthogonal to the tangent at p , ∃p ∈ Sd with at least two points p1 and p2 in ∂Su s.t. (p, p1 ) and (p, p2 ) are orthogonal to the tangent at p1 respectively p2 . So we cannot make a one to one mapping for a direct computation of ECCS (p) as d(p, p1 ) and d(p, p 2) have to be compared. See Fig. 3. Circle. In the special case where a = b, the shape S becomes a circle. Let r = a = b be the radius and O the center. For all p ∈ ∂S the line orthogonal to the tangent at p contains the center O. Thus all the eccentric paths go through the center O and the eccentricity of a point p(x, y) ∈ S is ECCS (p) = x2 + y 2 + r. Note that all the points of ∂S are eccentric points. 3.2
Rectangle
Even though the rectangle is a simple case, studying it’s decomposition in the context of eccentricity is of importance. Compared to the ellipse, a one to one association between points on the cut and eccentric paths cannot be made. In the case of the rectangle, two eccentric point candidates exist for each point on the cut. Let S be a rectangle with side lengths 2w and 2h (see Fig. 4). The four corners of the rectangle v−w,−h , v−w,h , vw,−h , vw,h make up the set of eccentric points of S. The rectangle S can be decomposed in two subparts Sl and Sr (see Fig. 4), along the cut C = [(0, −h), (0, h)]. The corners v−w,−h , v−w,h respectively vw,−h , vw,h are the eccentric points of all the points p ∈ C. The main difference compared to the ellipse, is that in the case of the rectangle we cannot associate to each point of C a single pair made of a direction and distance, because eccentric paths with more than one orientation can pass though the same point of the cut (see pc2 in Fig. 4). To solve this, to each point of C we associate two pairs of distances and directions, connecting it to the corners v−w,−h , v−w,h respectively vw,−h , vw,h . Thus, for any p ∈ Sl , ∃pc1 , pc2 ∈ C s.t. pc1 ∈ (p, vw,h ) and pc2 ∈ (p, vw,−h ). Then ECCS (p) = max(d(p, pc1 ) + ECCS (pc1 ), d(p, pc2 ) + ECCS (pc2 )).
Decomposition for Efficient Eccentricity Transform of Convex Shapes
659
2h
al
w
ar
Fig. 5. Elongated shape formed of two half ellipses and a rectangle
3.3
Elongated Shape
Let S be an elongated shape obtained by gluing two oppsite sides of a rectangle R with two halves of ellipses El and Er . Let us assume that: the width of R is 2w and its height is 2h; El is the half of the ellipse defined by the 2 parameters al and h; Er is the half of the ellipse defined by the 2 parameters (ar , h) (See Fig. 5). Symmetric Shape S: Let’s assume that a = al = ar and decompose S along the cut C = [(0, −h), (0, h)]. If a h, we can extend Prop. 1 from Section 3.1 to elongated shapes. Thus, for any point p ∈ Sl , its eccentric point pe is in Sr and is orthogonal to the ellipse tangent at pe . Moreover, we can deduce that pe is the unique eccentric point of p i.e. ep(p) = {p}. Like in Section 3.1, we can compute the eccentricity and the eccentric path orientation for all the points of C separately in Sl and Sr and for all p ∈ Sl compute the eccentricity ECCS (p) = d(p, pc ) + ECCSr (pc ) where pc ∈ C and (p, pc ) has the same direction as the eccentric path of pc in Sr . These results can directly be extended for the points of Sr . To find the eccentricity and eccentric path orientation for C in Sl and Sr one can decompose Sl and Sr in the halfellipse together with half the rectangle R, and directly use the formulas provided in Section 3.1. Note that if a = h then El and Er are half circles and all the eccentric paths will go through their centers. If r > h, the ellipses El and Er correspond to ellipses that have been cut along their bigger axis (see Section 3.1). In this case, as mentioned in Section 3.1, there exist some points p ∈ Sl which have more than one associated points {e1 , e2 , . . . , ek } ∈ Sr , such that (p, ei ) is orthogonal to the ellipse tangent at ei . Thus, we cannot associate each point in Sl with a single direction and distance based only on the orthogonality with the ellipse tangent. We need to take the maximum of the distances. Nonsymmetric Shape S: If al = ar , S is no longer symmetric and it cannot be decomposed by a line segment s.t. for any point p ∈ S all its eccentric points are contained in the part not containing p. An easy way to overcome this problem is to take for each point, the maximum between the eccentricity
660
A. Ion et al.
computed separately on the part containing the point and the one obtained by using the decomposition.
4
Outlook and Conclusion
In this paper we have studied top-down decomposition of basic shapes in order to speed up the computation of the eccentricity transform. Some partitions proved to be better suited than others. We showed that these shapes can be decomposed for a more efficient computation and also derived some properties that could be applied for more general shapes. In particular, we provide a study regarding possible decompositions and their properties for the ellipse, the rectangle and a class of elongated shapes. In the future we plan to extend this study to any 2D shape followed by a study for 3D and nD shapes.
References 1. Rosenfeld, A.: A note on ’geometric transforms’ of digital sets. Pattern Recognition Letters 1(4), 223–225 (1983) 2. Gorelick, L., Galun, M., Sharon, E., Basri, R., Brandt, A.: Shape representation and classification using the poisson equation. In: CVPR (2), pp. 61–67 (2004) 3. Kropatsch, W.G., Ion, A., Haxhimusa, Y., Flanitzer, T.: The eccentricity transform (of a digital shape). In: Kuba, A., Ny´ ul, L.G., Pal´ agyi, K. (eds.) DGCI 2006. LNCS, vol. 4245, pp. 437–448. Springer, Heidelberg (2006) 4. Ogniewicz, R.L., K¨ ubler, O.: Hierarchic Voronoi Skeletons. Pattern Recognition 28 (3), 343–359 (1995) 5. Siddiqi, K., Shokoufandeh, A., Dickinson, S., Zucker, S.W.: Shock graphs and shape matching. International Journal of Computer Vision 30, 1–24 (1999) 6. Paragios, N., Chen, Y., Faurgeras, O.: 6. In: Handbook of Mathematical Models in Computer Vision, pp. 97–111. Springer, Heidelberg (2006) 7. Soille, P.: Morphological Image Analysis. Springer, Heidelberg (1994) 8. Harary, F.: Graph Theory. Addison-Wesley, Reading (1969) 9. Diestel, R.: Graph Theory. Springer, New York (1997) 10. Klette, R., Rosenfeld, A.: Digital Geometry. Morgan Kaufmann, San Francisco (2004) 11. Ion, A., Peyr´e, G., Haxhimusa, Y., Peltier, S., Kropatsch, W.G., Cohen, L.: Shape matching using the geodesic eccentricity transform - a study. In: 31st OAGM/AAPR, Schloss Krumbach, Austria, May 2007. OCG (2007)
Euclidean Shortest Paths in Simple Cube Curves at a Glance Fajie Li and Reinhard Klette Computer Science Department The University of Auckland, New Zealand
Abstract. This paper reports about the development of two provably correct approximate algorithms which calculate the Euclidean shortest path (ESP) within a given cube-curve with arbitrary accuracy, defined by ε > 0, and in time complexity κ(ε) · O(n), where κ(ε) is the length difference between the path used for initialization and the minimumlength path, divided by ε. A run-time diagram also illustrates this lineartime behavior of the implemented ESP algorithm.
1
Introduction
Euclidean shortest path (ESP) problems are defined by a (2D, 3D, ...) Euclidean space which contains (closed) polyhedral obstacles; the task is to compute a path which connects two given points in the space such that it does not intersect the interior of any obstacle, and it is of minimum Euclidean length. Examples are the ESP inside of a simple polygon, on the surface of a convex polytope, or inside of a simply-connected polyhedron, or problems such as touring polygons, parts cutting, safari or zookeeper, or the watchman route. Alltogether, this defines a class of immensely important computational problems of huge impact in economy, science or technology. For time complexities of algorithms in this area, we cite two examples. The general 3D ESP problem (e.g., path-planning in robotics) is NP-hard, see J. Canny and J. H. Reif [5]. For 2D ESP problems, there are linear-time, but very complicated algorithms (e.g., algorithms for ESP calculation in a simple polygon, based on B. Chazelle’s [6] triangulation of whole polygons), or linear-time and easy-to-implement algorithms (e.g., for the relative convex hull in the 2D grid, see [9]). See [13] for work of the authors on 2D or 3D ESP problems in general. In this paper we consider ESPs in simple cube-curves (a cyclic sequence of subsequently face-adjacent grid cubes where each cube is only listed once), which are formed by successively face-adjacent grid cubes (of the uniform orthogonal 3D grid, see digital geometry [11]). T. B¨ ulow and R. Klette published between 2000 and 2002 (see, e.g., [4]) a so-called rubberband algorithm (RBA) for the calculation of a Euclidean shortest path in a simple cube-curve. [4] stated two open problems: is this approximate RBA actually always converging (with numbers of iterations) to the correct ESP, and is its time complexity actually linear as all experiments indicated at that time. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 661–668, 2007. c Springer-Verlag Berlin Heidelberg 2007
662
F. Li and R. Klette
This paper reports about the development of two approximate RBAs, which always converge towards the ESP, and have κ(ε) · O(n) time complexity. This paper is a first summarizing publication of results in the technical report [12] related to minimum-length polygon (MLP) calculation in simple cube-curves.
2
The Original RBA
Critical edges of a given cube-curve g are those grid edges which are incident with three cubes of the curve (see Figure 1). Critical edges are the only possible locations for vertices of an ESP [10]. A subset of those will define the step set of the RBA, which contains all those critical edges which contain exactly one ESP vertex each.
Fig. 1. Critical edges e1 , e2 , e3 , e4 , e5 , and e6
The Original RBA, as published in [4,11], is as follows: it consists of two subprocesses, (i) an initialization process (e.g., from an endpoint of one critical edge to the closest endpoint of the subsequent critical edge; satisfying a “closedpath” constraint at the end), and (ii) an iterative process which contracts the path during each of its loops, using a break-off criterion Ln − Ln+1 < ε where ε > 0, and Ln is the total length of the path after the nth loop. During each loop, the algorithm tries to shorten the path locally by checking three options, called OP1, OP2, and OP3. OP1 and OP2 find the step set of critical edges. OP3 optimizes the position of a vertex on its critical edge. These options are defined as follows: OP1: delete vertex pi if the line segment pi−1 pi+1 is in the tube g, which is the union of all the grid cubes in the given simple cube-curve g; OP2: calculate intersection points of the triangle pi−1 pi pi+1 with all critical edges (“between” pi−1 and pi+1 ) and replace the subsequence pi−1 , pi , pi+1 by the resulting convex arc, defined by these of intersection points;
Euclidean Shortest Paths in Simple Cube Curves at a Glance
663
Fig. 2. Illustration for the (original) Option 2
OP3: move pi on its critical edge e into the optimum position pnew , with de (pnew , pi−1 ) + de (pi+1 , pnew ) = inf{de (p, pi−1 ) + de (pi+1 , p) : p ∈ e}, where de denotes the Euclidean distance. We continue with vertices pnew , pi+1 , pi+2 of the path. At the end of each loop we compare the total length of the new path with that of the path at the end of the previous loop. See Figure 2 for OP2. Here, vertices on critical edges e11 , e14 and e18 are replaced by a convex arc with vertices on critical edges e11 , e13 , e16 , and e18 , and (in general) it may be e11 , e14 and e18 again within a subsequent loop – of course, for a reduced length of the calculated path at this stage. The situation with the original RBA in 2002 [4] was as follows: Even for very small values of ε, the measured time complexity indicated O(n), where n is the number of cubes in g. However, there was no proof for the asymptotic time complexity of the original RBA. For a small number of test examples, calculated paths seemed (!) to converge against the ESP. However, no implemented algorithm for calculating the correct ESP was available, and (more general) no proof whether the path, provided by the original RBA, converges towards the ESP. Nevertheless, the algorithm is in use since 2002 (e.g., in DNA research).
3
Non-existence of an Exact Arithmetic Algorithm
An arithmetic algorithm consists of a finite number of steps of arithmetic operations, possibly also using input parameters from the field of rational numbers, using only the following basic operators: +, −, ·, / or the kth root, for k ≥ 2. OP3 can be formalized by a system of three PDEs, involving parameters ti ∈ R for critical edges ei of the step set. The result ensures that pi (ti ) is the optimum point on ei . Considering the situation illustrated in Figure 3, this is equivalent to the problem of finding the roots of p(x) = 84x6 − 228x5 + 361x4 + 20x3 + 210x2 + 200x + 25 (see Chapter 7 in [12]). In fact, this problem is not solvable by radicals over the field of rationals; see [12]. (The proof uses a theorem by C. Bajaj [2] and the factorization algorithm by E. R. Berlekamp [3].)
664
F. Li and R. Klette
Fig. 3. Calculation of t1 and t2 such that the polyline p0 (t0 )p1 (t1 )p2 (t2 )p3 (t3 ) is fully contained in g. Point p1 is on e1 , and p2 on e2 .
This example allows two corollaries. First, there is no exact arithmetic algorithm for calculating the roots of polynomials of degree ≥5 (theorem by E. Galois; B.L. van der Waerdens famous example is p(x) = x5 − x − 1). Second, there is also no exact arithmetic algorithm for calculating 3D ESPs. C. Bajaj [1] showed this based on a polynomial of degree 20 for the general 3D ESP problem. As a new result, here we have a degree 6 polynomial, and the restricted ESP problem for simple cube-curves! Note that this is not just a “rounding number problem” but a fundamental non-existence of exact algorithms, no matter what kind of time-complexity is allowed (see Section 4). There is a uniquely defined shortest path, which passes through subsequent line segments e1 , e2 , . . . , ek in 3D space in this order; see, for example, [7]. Obviously, vertices of a shortest path can be at real division points, and even at those which cannot be represented by radicals over the field of rationals.
4
Approximate Algorithms
An algorithm is an (1 + ε)-approximation algorithm for a minimization problem P iff, for each input instance I of P , the algorithm delivers a solution that is at most (1 + ε) times the optimum solution [8]. 2
The general 3D ESP problem can be solved in O n4 [b + log(n/ε)] /ε2 time by an (1 + ε)-approximation algorithm; see C. H. Papadimitriou [15]. An algorithm is κ-linear iff its time complexity is in κ(ε) · O(n), and function κ does not depend on the problem size n, for ε > 0. We use κ(ε) = (L0 − L)/ε, where L is the true length of the ESP, and L0 the initial length. A cube-curve is first-class iff each critical edge contains one ESP vertex. The original RBA is correct and κ-linear for first-class cube-curves [12]. [12] analyzed the following approximate graph-theoretical algorithm: Subdivide each critical edge by m uniformly-spaced vertices; connect each vertex with
Euclidean Shortest Paths in Simple Cube Curves at a Glance
665
Fig. 4. Weighted undirected graph for m = 3
those vertices such that the resulting edge is contained in the tube g. This defines a weighted undirected graph (see Figure 4). Calculate a shortest-length cycle, and use this as a (first-class !) input for the original RBA. The time-complexity of the graph-theoretic algorithm (in our specification) equals O m4 n4 + κ(ε) · n . It applies Dijkstras algorithm repeatedly; possibly its time-complexity can be reduced, but certainly not to be κ-linear. However, this (slow) algorithm allowed for the first time to evaluate results obtained by the original RBA. Assume a simple cube-curve g and a triple of consecutive critical edges e1 , e2 , and e3 such that ei is orthogonal to ej , for i, j = 1, 2, 3 and i = j. If e1 and e3 are also coplanar, then we say that e1 , e2 , and e3 form an end angle, and a middle angle otherwise. The following approximate numerical algorithm (see [12]) requires an input which is first-class and has at least one end angle; the cube-curve is split at end angles into one or several arcs. For each arc, one vertex on each critical edge can be calculated using the systems of PDEs briefly mentioned already above; variable ti determines the position of vertex pi on edge ei . This algorithm is provably correct and κ-linear for the assumed inputs. An open problem in [11] (page 406) was stated as follows: Is there a simple cubecurve such that none of the vertices of its ESP is a grid vertex? The answer is “yes” [12], and any of those curves does not have any end angle; see Figure 5.1 Thus, the provably correct approximate numerical algorithm cannot be used in general. This lead us back to the initial two questions about the original RBA: is it correct? (We can use either the approximate graph-theoretical or the numerical algorithm for evaluation.) What is its time-complexity in general? Indeed, corrections were in place: 1
Here are two new open problems: What is the smallest (say, in number of cubes or in number of critical edges - both is equivalent) simple cube curve which does not have any end angle? What is the smallest (say, in number of cubes or in number of critical edges - both is equivalent) simple cube curve which does not have any of its MLP vertices at a grid point location? We assume that the second problem is more difficult to solve.
666
F. Li and R. Klette
Fig. 5. A simple cube-curve where the ESP does not have any grid-point vertex (and which has no end angle)
Fig. 6. The original Option 2 misses e5
OP2: if intersecting with the triangle pi−1 pi pi+1 and using the convex arc only, we may miss edges of the step set (see Figure 6 for such a situation) - more tests are needed, and this option was totally reformulated (for details, see [12] - the specifications require some technical preparations which cannot be given in this short paper). OP3: the vertex pnew , found by optimization, may specify edges pi−1 pnew and pnew pi+1 such that one or both of them are not fully contained in the tube of the curve; an additional test is needed (a simple correction). Therfore, those corrections define a provably correct (for any simple cubecurve) and κ-linear edge-based RBA [12].
Euclidean Shortest Paths in Simple Cube Curves at a Glance
667
Fig. 7. Edge-based RBA Implemented in Java, run under Matlab 7.0.4, Pentium 4, using ε = 10−10
Fig. 8. Let ε be the maximum accuracy of the program, that means the smallest number for discriminating between Ln and Ln+1 . Still, the difference to the true value L might be δ > ε. The algorithm allows to obtain arbitrary accuracy (with respect to L) when continuing iterations, but this would require to reduce ε.
Instead of moving points along critical edges, we can also move points within critical faces (which contain one critical edge). Of course, the vertices will finally move onto or towards critical edges. This conceptually simpler (in its OP2) facebased RBA is also provably correct, but showing a slower convergence (within the limits of being κ-linear) towards the EPS. See Figure 7 for some statistics about measured run time. Half of a simple cube-curve was generated randomly, and the second half then generated using three straight arcs for closing the curve. The number of cubes in generated curves was between 10 and 630. The break-off criterion was defined by ε = 10−10 . Figure 8 illustrates the meaning of the break-off criterion. The lengths Ln , for loops n = 1, 2, 3 . . . define a Cauchy sequence which converges towards the true length L. An in-depth study of this sequence may reveal whether we can assume δ < ε in general, or not.
668
5
F. Li and R. Klette
Conclusions
This paper reported about the process of solving the minimum-length polygon problem for simple cube-curves. The developed methodology [i.e., define “critical” subsets, specify the step set such that each critical subset in this set contains exactly one (possibly redundant, such as colinear) vertex, apply OP3] can be applied to ESP problems as considered (e.g.) in [14]. A few RBA applications have been illustrated in [12,13]. For more details, see technical report [12].
References 1. Bajaj, C.: The algebraic complexity of shortest paths in polyhedral spaces. In: Proc. Allerton Conf. Commun. Control Comput., pp. 510–517 (1985) 2. Bajaj, C.: The algebraic degree of geometric optimization problems. Discrete Computational Geometry 3, 177–191 (1988) 3. Berlekamp, E.R.: Factoring polynomials over large finite fields. Math. Comp. 24, 713–735 (1970) 4. B¨ ulow, T., Klette, R.: Digital curves in 3D space and a linear-time length estimation algorithm. IEEE Trans. Pattern Analysis Machine Intelligence 24, 962–970 (2002) 5. Canny, J., Reif, J.H.: New lower bound techniques for robot motion planning problems. In: Proc. IEEE Conf. Foundations Computer Science, pp. 49–60 (1987) 6. Chazelle, B.: Triangulating a simple polygon in linear time. Discrete Computational Geometry 6, 485–524 (1991) 7. Choi, J., Sellen, J., Yap, C.-K.: Approximate Euclidean shortest path in 3-space. In Proc. ACM Conf. Computational Geometry, pp. 41–48. ACM Press, New York, NY, USA (1994) 8. Hochbaum, D.S.: Approximation Algorithms for NP-Hard Problems. PWS Pub. Co, Boston (1997) 9. Klette, R., Kovalevsky, V.V., Yip, B.: Length estimation of digital curves. In: Proc. Vision Geometry, SPIE 3811, pp. 117–129 (1999) 10. Klette, R., B¨ ulow, T.: Critical edges in simple cube-curves. In: Nystr¨ om, I., Sanniti di Baja, G., Borgefors, G. (eds.) DGCI 2000. LNCS, vol. 1953, pp. 467–478. Springer, Heidelberg (2000) 11. Klette, R., Rosenfeld, A.: Digital Geometry. Morgan Kaufmann, San Francisco (2004) 12. Li, F., Klette, R.: Exact and approximate algorithms for the calculation of shortest paths. In: Platinum Jubilee Conference of The Indian Statistical Institute. IEEE Conference, Kolkata, Report 2141 on www.ima.umn.edu/preprints/oct2006 13. Li., F., Klette, R.: Rubberband algorithms for solving various 2D or 3D shortest path problems. In: Proc. Computing: Theory Applications, plenary talk, pp. 9–19 (2007) 14. Mitchell, J.S.B., Sharir, M.: New results on shortest paths in three dimensions. In: Proc. SCG, pp. 124–133 (2004) 15. Papadimitriou, C.H.: An algorithm for shortest path motion in three dimensions. Inform. Process. Lett., 20, 259–263 (1985) 16. Talbot, M.: A dynamical programming solution for shortest path itineraries in robotics. Electr. J. Undergrad. Math. 9, 21–35 (2004)
A Fast and Robust Ellipse Detection Algorithm Based on Pseudo-random Sample Consensus Ge Song and Hong Wang State Key Laboratory of Intelligent Technology and Systems Department of Computer Science and Technology Tsinghua University, 100084, Beijing, China
[email protected],
[email protected] Abstract. Ellipse is one of the most common features that appears in images. Over years in research, real-timing and robustness have been two very challenging problems aspects of ellipse detection. Aiming to tackle them both, we propose an ellipse detection algorithm based on pseudo-random sample consensus (PRANSAC). In PRANSAC we improve a contour-based ellipse detection algorithm (CBED), which was presented in our previous work. In addition, the parallel thinning algorithm is employed to eliminate useless feature points, which increases the time efficiency of our detection algorithm. In order to further speed up, a 3-point ellipse fitting method is introduced. In terms of robustness, a “robust candidate sequence” is proposed to improve the robustness performance of our detection algorithm. Compared with the state-of-the-art ellipse detection algorithms, experimental results based on real application images show that significant improvements in time efficiency and performance robustness of the proposed algorithm have been achieved. Keywords: Ellipse Detection, Parallel Thinning, Robust Candidate Sequence, PRANSAC.
1 Introduction In recent years, the research of fast and robust ellipses detection is very popular in image processing field. Over the last two decades researchers have developed many approaches to detecting ellipse. A commonly used algorithm is Hough transform [1].Hough transform is a usable method for ellipse detection but it costs a lot of time. The Fast Ellipse Hough Transform (FEHT) [2] method was proposed by Guil to decompose the parameter space into several subspaces of fewer dimensions. They introduced geometric features and the gradient vectors of the edge data to construct the lines passing through geometric centers. Accumulator-Free Ellipse Hough Transform (RAF-EHT) [3] presented by Yu, is based on two main ideas, i.e., an improved measure function (IMF) for handling the partiality and the obliqueness of ellipses, and a new accumulator-free computation scheme for finding the top k peaks of the IMF without complex peak detection. The methods based on Hough transform [1, 2, 3, 4, 5, 13] are most popular for ellipse detection. They are robust against outlier and occlusion but computationally expensive. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 669–676, 2007. © Springer-Verlag Berlin Heidelberg 2007
670
G. Song and H. Wang
Cai proposed a fast contour-based approach of circle and ellipse detection in 2004 [9]. A chain code based algorithm was adopted to detect contours and each contour was treated as a sequence of isolated inflexions. Then RANSAC [10] algorithm was applied to estimate the parameters of candidate ellipses from each candidate contour. This algorithm was faster than others in detection speed but the results are not stable. Therefore we propose new mechanism to improve the method in this paper. In this paper we propose a fast and robust ellipse detection algorithm, named PRANSAC. In PRANSAC we introduce a parallel thinning algorithm to eliminate useless feature points. Tangent direction is used to fit ellipses with 3 points. By doing so, we can reduce detection time. Also we introduce the parallel thinning algorithm in image pre-processing and propose a “robust candidate sequence” for RANSAC to make the detection algorithm more robust.
2 Overview of PRANSAC In order to make ellipse detection faster and more robust, we make several changes to a contour-based ellipse detection (CBED) algorithm that was proposed in our previous work [9]. In CBED, at least five points are used to determine an ellipse. Therefore, we have to calculate the equation with five parameters, which costs much computation. PRANSAC introduces the tangent directions for candidate feature points in pre-processing. Thus, we can fit an ellipse with 3 points. So it predigests the fitting calculation and reduces the detection time significantly. The parallel thinning algorithm is used to get rid of the useless feature points(the feature points are gained by edge detection as candidates for ellipse detection).The parallel thinning algorithm reduces the original lines with width greater than one pixel into ones with width equal to one-pixel. Thus, PRANSAC can reduce the number of feature points effectively. CBED uses RANSAC method for detection. RANSAC is an algorithm for robust fitting of models in the presence of many outliers. It’s fast and robust when there is much noise interference. However, the detection result from the RANSAC algorithm is not steady because it uses the random candidate points to fit ellipses. We improve the original RANSAC and propose “robust candidate sequence” for detection that can get the steady detection results. In the rest sections of this paper, section 3 presents CBED; section 4 describes the improvements of PRANSAC; experimental results and comparison with other methods are shown in section 5; and section 6 concludes the paper.
3 The Steps of CBED The original CBED algorithm [9] includes 3 steps mainly as follows: 3.1 Contour Detection Most of the existing ellipse detection methods, regard all the feature points in the edge image as a whole, so they have to deal with a larger feature point set, which costs lots of computation and memory. To cope with this deficiency, we propose an obvious but
A Fast and Robust Ellipse Detection Algorithm Based on PRANSAC
671
essential feature of detected objects, namely, connectivity constraint which emphasizes that the edge points of target object should be continuous. Relied on the constraint, the huge feature point set can be partitioned into several smaller subsets. Thus, the problem is simplified into detections of the targets in each subset. One example of contours is illustrated in Fig 1.
(a) Image
(b) Contours
Fig. 1. The contours of the image (Each contour is represented by different color.)
3.2 RANSAC For each candidate contour, many methods can be used to extract circle or ellipse. We introduce RANSAC [10] to estimate the parameters of ellipse from the data with outliers. 3.3 Ellipse Fitting Since five parameters determine an ellipse, five points at least are used to determine an ellipse (any three points are not collinear). The conic is written as: a 0 x 2 + 2 a1 xy + a 2 y 2 + 2a 3 x + 2 a 4 y = 1
.
(1)
We use five points to fit the ellipse and utilize the parameters in (1) to compute the geometric parameters of an ellipse. Then we vote and confirm the ellipse. To avoid the influence of small arcs and to avoid the case that several targets overlap in the real application, the 'fitting factor' of each possible figure is introduced to evaluate the estimation of parameters. We present the algorithm for calculating the fitting factor as follows: Begin Count: = 0; For I: = 0 to 360 do /* for each angle */ Compute the corresponding point P of I If P is a feature point in the image Count: = Count+1; Count: = Count/360; End;
672
G. Song and H. Wang
4 The Improvements of PRANSAC 4.1 The Parallel Thinning Algorithm Assuming there are P points on the edge. The computation complexity of CBED is O(P), which is introduced in [9]. So reducing the number of feature points is effective for reducing detection time. PRANSAC improves CBED by using the parallel thinning algorithm [12] in image pre-processing. The parallel thinning algorithm makes the original lines, whose width is greater than one pixel, reduced into lines with width equal to one-pixel. 8-connect skeleton can be achieved while avoiding excessive erosion that loses useful image information. The feature points are reduced effectively but the contours are conserved. The thinning result of this algorithm is shown in Fig 2.
(a) The original image (10296 feature points)
(b) The result after the parallel thinning algorithm (6933 feature points)
Fig. 2. Parallel Thinning Algorithm effectively reduces the number of feature points, which speeds up the proposed ellipse detection algorithm
4.2 Ellipse Fitting with 3 Points As we known, circles are special ellipses. However, circle detection is much faster than ellipse detection because of its simplicity. Three parameters, that is, the center and radius are required to decide a circle and only 3 points are needed for fitting circles. Therefore, we want to fit ellipses with three points like fitting circles. The original CBED needs five points to determine an ellipse. It wastes lots of time on fitting ellipse and voting. We introduce the tangent direction to fit ellipses with 3 points. In order to fit ellipses with 3 points, we calculate the tangent directions of these points. For every feature point, we get a 5*5 area whose center is this point and fit a line by a least-square solution for the 5*5 area, shown in Fig. 3. The direction of this line is considered as the tangent direction of this point. Then we obtain the tangent directions of the candidate points. We choose three points Pi ( xi , yi ) ( i=0,1, 2) which are not collinear, assuming they are on an ellipse. The
Pi is Li , which is calculated by (2). On the assumption that the midpoint of P1 and P2 is K and the cross point of L1 and L2 is M, the center of ellipse is on the
tangent of
A Fast and Robust Ellipse Detection Algorithm Based on PRANSAC
673
Fig. 4. Location of center
Fig. 3. Tangent direction (calculated by a least-square solution)
line of KM. Also the center is on another line calculated by P2 and P3 . According to this ellipse geometric property (Fig. 4), we confirm the center of ellipse ( x0 , y0 ) by three points and their tangents. Then we transform the reference frame as follows:
x ' = x − x0 y ' = y − y0
.
(2)
The new center of the ellipse is the origin of the new reference frame. So the new conic can be written as: a0 x '2 +2 a1 x ' y '+ a2 y '2 = 1 .
The new points are
(3)
Pi ' {i =1, 2, 3}. So we can obtain the new parameters as follows: ⎧ A = X −1 B ⎪ T ⎪ A = [a0 a1 a2 ] ⎪ B = [1 1 1 ]T ⎪ . ⎨ ⎡ x '12 , 2 x '1 y '1 , y '12 ⎤ ⎪ ⎪ X = ⎢ x ' 2 , 2x ' y ' , y ' 2 ⎥ ⎢ 2 2 2 2 ⎥ ⎪ ⎢ 2 2 ⎥ ⎪ ⎣ x '3 , 2 x '3 y '3 , y '3 ⎦ ⎩
(4)
We can calculate the major axis and the minor axis and calculate rotate angle by (4). Then we get the ellipse parameters with 3 points. 4.3 Robust Candidate Sequence CBED proposed RANSAC method to estimate the parameters of circle and ellipse from the data, which is fast and robust against outlier. However RANSAC chooses the candidates randomly and can get different results on the same image by multiple times
674
G. Song and H. Wang
(a) First detection
(b) Second detection
Fig. 5. Different detection results by original RANSAC
Fig. 6. Robust detection result by PRANSAC
of detection. One sample is shown in Fig.5. Fig.5 (a) is a good result for the first detection. We detect the image for the second time and Ellipse 4 hasn’t been detected as shown in Fig.5 (b). To get stable results, PRANSAC improves the algorithm by introducing a robust candidate sequence for every contour. We generate a Pseudo-random sequence of points as candidates to detect ellipses for RANSAC. The Pseudo-random sequence like a random sequence is deterministic actually. The seed of Pseudo-random can be set to the number of points on the contour. The sequence will supply the same series of points for different detections. With many experiments, robust candidate sequence can produce stable and reasonable candidates for ellipse detection and get the same result for multiple times of detection, see Fig.6.
5 Experiments and Comparison The method proposed in this paper has been implemented in C++ on Intel Pentium 4 CPU 3.06GHz and 512M memory running under Windows XP. The performance is shown in Table I and compared with CBED. In image pre-processing, we reduce useless points by the parallel thinning algorithm and calculate the tangent directions for all Table 1. Performance comparison with CBED Resolution
Test image 1 Test image 2 Test image 3 Test image 4 Test image 5
300*300 640*400 560*420 640*480 480*360
Image Pre-processing Time CBED PRANSAC 7(ms) 15(ms) 24(ms) 50(ms) 25(ms) 39(ms) 30(ms) 60(ms) 30(ms) 45(ms)
Total Detection Time(Including Image Pre-processing) CBED PRANSAC 109(ms) 39(ms) 161(ms) 94(ms) 170(ms) 100(ms) 220(ms) 150(ms) 379(ms) 80(ms)
Table 2. Performance comparison with RTEHT Resolution Test image 6 Test image 7
320*240 320*240
RTEHT (Pentium III 500MHz, 256M memory) 145(ms) 150(ms)
PRANSAC (Pentium III 700MHz, 256M memory) 84(ms) 78(ms)
A Fast and Robust Ellipse Detection Algorithm Based on PRANSAC
Test image 1
Test image 2
Fig. 7. Ellipse detection
Test image 4
675
Test image 3 Fig. 8. Ellipse instrument detection
Test image 5
Fig. 9. Ellipse work piece detection
Test image 6
Test image 7
Fig. 10. Test images in [13]
the feature points. So compared to CBED, image pre-processing time is longer but the whole detection time is reduced by 50%. We also compare PRANSAC with RTEHT [13], and show the result in Table II.
6 Conclusion A fast and robust ellipse detection algorithm has been proposed in this paper. We introduce the parallel thinning algorithm to get rid of useless feature points effectively with useful information conserved. Also we make use of the ellipse geometric property and introduce tangent directions to fit ellipses with 3 points, which reduces detection time by 50%. To avoid the unsteady results, we propose the “robust candidate sequence” as candidates for RANSAC. With many experiments in different scenes, PRANSAC has demonstrated its high performance. Acknowledgement. The research is supported by the National Natural Science Foundation of China (60433030).
676
G. Song and H. Wang
References [1] Nair, P.S., Ssaunders, A.T.: Hough transform based ellipse detection algorithm. Pattern Recognition Letters 17, 777–784 (1996) [2] Guil, N., Zapata, E.L.: Lower Order Circle and Ellipse Hough Transform. J. Pattern Recognition 30(10), 1729–1744 (1997) [3] Yu, X., Leong, H.W., Xu, C., Tian, Q.: A robust and accumulator-free ellipse Hough transform. In: ACM MM04, pp. 256–259 (2004) [4] Ser, P.-K., Siu, W.-C.: Novel detection of conics using 2-D Hough planes, Vision, Image and Signal Processing. IEEE Proceedings 142(5), 262–270 (1995) [5] Muammar, H., Nixon, M.: Approaches to extending the Hough transform. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP-89. vol.3, pp. 1556–1559 (May 23-26, 1989) [6] Kawaguchi, T., Nagata, R.-I.: Ellipse detection using a genetic algorithm. In: Proceedings, Fourteenth International Conference on Pattern Recognition, vol. 1, pp.141–145 (August 16-20 1998) [7] Frigui, H., Krishnapuram.: A comparison of fuzzy shell-clustering methods for the detection of ellipses. IEEE Transactions on Fuzzy Systems 4(2), 193–199 (1996) [8] Kim, E., Haseyama, M., Kitajima, H.: Fast line extraction from digital images using line segments. IEICE Trans. 84(8), 1566–1579 (2001) [9] Cai, W., Yu, Q., Wang, H.: A Fast Contour-based Approach to Circle and Ellipse Detection. In: Proceedings of IEEE 5th World Conference on Intelligent Control and Automation (WCICA 2004) June 15-19, 2004, Hangzhou, China, pp. 4686–4690 (2004) [10] Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Comm. of the ACM 24, 381–395 (1981) [11] Xie, Y., Ji, Q.: A New Efficient Ellipse Detection Method. In: International Conference on Pattern Recognition, pp. 957–960 (2002) [12] Guo, Z., Hall, R.W.: Parallel thinning with two-subiteration algorithms. Communications of the ACM 32(3), 359–373 (1989) [13] Zhang, S.-C., Liu, Z.-Q.: A robust real-time ellipse detector. Pattern Recognition 38, 273–287 (2004)
A Definition for Orientation for Multiple Component Shapes ˇ c1, and Paul L. Rosin2 Joviˇsa Zuni´ 1
2
Computer Science Department, Exeter University, Exeter EX4 4QF, UK
[email protected] School of Computer Science, Cardiff University, Cardiff CF24 3AA, Wales, UK
[email protected] Abstract. In this paper we introduce a new method for computing the orientation for compound shapes. If the method is applied to single component shapes the computed orientation is consistent with the shape orientation defined by the axis of the least second moment of inertia. If the new method is applied to compound shapes this is not the case, and consequently the presented method is both new and different. Keywords: Shape, orientation, image normalization, early vision.
1
Introduction
Determining the orientation of a shape is often performed in computer vision so as to enable subsequent analysis to be carried out in the shape’s local frame of reference (thereby simplifying that analysis). While for many shapes their orientations are obvious and can be computed easily, the orientation of other shapes may be ambiguous, subtle, or ill defined. The difficulty of the task can be seen from the multiplicity of mechanisms used in human perception in which orientation can be determined by axes of symmetry and elongation [1], as well as cues from local contour, texture, and context [9]. The most common computational method for determining a shape’s orientation is based on the axis of the least second moment of inertia [4,6]. Although straightforward and efficient to compute it breaks down in some circumstances – for example, problems arise when working with symmetric shapes [11,12]. This has encouraged the development of other competing methods [2,3,4,5,6,7,10,11]. Suitability of those methods strongly depends on the particular situation in which they are applied, as they each have their relative strengths and weaknesses (e.g. relating to robustness to noise, classes of shape that can be oriented, number of parameters, computational efficiency). In this paper we focus on the orientation of shape consisting of several components. We introduce a new approach to the shape orientation problem and, after that, we extend the method to shapes that consists of several components. We consider a line that maximises the integral of the squared lengths of the
ˇ c is also with the Mathematical Institute - SANU, Belgrade. J. Zuni´
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 677–685, 2007. c Springer-Verlag Berlin Heidelberg 2007
678
ˇ c and P.L. Rosin J. Zuni´
projections of line segments whose end points belong to the shape onto this line. Then we define the orientation of the shape by the slope of such a line. It turns out that, if applied to single component shapes, such a method for computing shape orientation is consistent with the standard method based on the line of the least second moment of inertia. Such a new approach leads to a natural definition of the orientation of compound shapes. The computed orientation of compound shapes differs from the orientation of compound shapes by the standard method. In some situations the new compounded shapes is more appropriate than the orientation computed by the standard method. The paper is organized as follows. In the next section we give a short sketch of the standard method for computing of shape orientation. In Section 3 we introduce a new approach for orienting shapes. Section 4 adopts the new introduced approach to the orientation of compound shapes. Section 5 discusses the properties of the new method and gives some illustrative examples. Section 6 gives concluding comments.
2
The Standard Method
The most standard method for computation of shape orientation is based on the axis of the least second moment of inertia. The axis of the least second moment of inertia ([4,6]) is the line which minimises the integral of the squares of distances of the points (belonging to the shape) to the line. The integral is I(α, S, ρ) = r2 (x, y, α, ρ) dx dy (1) S
where r(x, y, α, ρ) is the perpendicular distance from the point (x, y) to the line given in the form X · sin α − Y · cos α = ρ. It is well known that the axis of the least moment of inertia passes through the shape centroid. The centroid m (S) m0,1 (S) is computed as m1,0 , where m0,0 (S) is the zeroth order moment of 0,0 (S) m0,0 (S) S and m1,0 (S) and m0,1 (S) are the first order moments of S. In general, the moment mp,q (S) is defined as mp,q = xp y q dx dy and has order p + q. Thus, S m (S) m0,1 (S) if the shape S is translated by the vector − m1,0 , (such that the 0,0 (S) m0,0 (S) centroid of S coincides with the origin) then it is possible to set ρ = 0. Since the squared distance of a point (x, y) to the line X · sin α − Y · cos α = 0 is (x · sin α − y · cos α)2 the function F (α, S) that should be minimised in order to compute the orientation of S can be expressed as 2 m1,0 (S) m0,1 (S) F (α, S) = x− · sin α − y − · cos α dx dy m0,0 (S) m0,0 (S) S (m1,0 (S))2 (m0,1 (S))2 2 = m2,0 (S) − · sin α + m0,2 (S) − · cos2 α m0,0 (S) m0,0 (S) m1,0 (S) · m0,1 (S) − m1,1 (S) − · sin(2α). (2) m0,0 (S)
A Definition for Orientation for Multiple Component Shapes
679
Definition 1. The orientation of a given shape S is defined by the angle α for which the function F (α, S) reaches the minimum. Elementary mathematics says that the angle α which defines the orientation of S (as given by Definition 1) satisfies the following equation: sin(2α) 2 · m1,1 (S) = , (3) cos(2α) m2,0 (S) − m0,2 (S) p q m (S) m1,0 (S) where mp,q (S) = x − m1,0 y − dxdy are centralised moments. m0,0 (S) 0,0 (S) S
3
An Alternative Approach
In this section we start with a different motivation for the definition of shape orientation. It will turn out that this approach in its basic variant (when single component shapes are considered) leads to the same shape orientation as if computed based on the axis of the least moment of inertia, but the application to compound shapes leads to an essentially new method. Let S be a given shape, and consider the line segments [AB] whose end points → A and B are points from S. Let − a denote the unit vector in the direction → − α, i.e., a = (cos α, sin α). Also, let pr− → a [AB] be the projection of the line segment [AB] onto a line that makes an angle α with the x-axis, while |pr− → a [AB]| denotes the length of such a projection. Then, it seems very natural to define the orientation of the shape S by the direction that maximises the integral 2 |pr− [AB]| dx dy du dv of the squared length of projections of such → a
A∈S,B∈S
edges onto a line having this direction. Thus we give the following definition. Definition 2. The orientation of a given shape S is defined by the angle α where the function G(α, S) =
2 |pr− → a [AB]| dx dy du dv
(4)
A=(x,y)∈S B=(u,v)∈S
reaches its maximum. Even though Definition 1 and Definition 2 come from different motivations it turns out that they are equivalent. Theorem 1 shows that the difference G(α, S) − 2 · m0,0 (S) · F (α, S) depends only on the shape S but not on the angle α. Furthermore, this implies that the maximum of G(α, S) and minimum of F (α, S) are reached at the same point. In other words, the orientations computed by Definition 1 and Definition 2 are consistent. Theorem 1. The following equality holds: G(α, S) − 2 · m0,0 (S) · F (α, S) = 2 · ((m2,0 (S) + m0,2 (S)) · m0,0 (S) − (m1,0 (S))2 − (m0,1 (S))2 ).
(5)
ˇ c and P.L. Rosin J. Zuni´
680
y ]
[CD pr−> a
−> a α
]
a
[AB pr−>
B A
D
C
S x
Fig. 1. Projections of all the line segments whose endpoints lie in S are considered, irrespective of whether the line segment intersects the boundary of S
Proof. Let A = (x, y) and B = (u, v). Then by using trivial equalities: 2 2 |pr− and → a [AB]| = |(x − u, y − v) · (cos α, sin α)| xp y q ur v t dx dy du dv = mp,q (S) · mr,t (S) S×S
we complete the proof easily. Indeed, G(α, S) − 2 · m0,0 (S) · F (α, S) 2 = |pr− → a [AB]| dx dy du dv − 2 · m0,0 (S) · F (α, S) S×S
((x − u) · cos α + (y − v) · sin α)2 dx dy du dv − 2 · m0,0 (S) · F (α, S)
= S×S
= 2(m2,0 (S) + m0,2 (S)) · m0,0 (S) − 2(m1,0 (S))2 − 2(m0,1 (S))2 .
4
Orientation of Compound Objects
Image analysis often deals with groups of shapes or with shapes that are composed of several parts. The desired properties of the computed orientation of such compound shapes can vary. Sometimes it is reasonable that the computed orientation is derived from the orientations of components of compound shape, whereas in other cases it is preferable that the orientation is a global property of the whole compound object. In this section we introduce a new definition for computing the orientation of such compound shapes and give some examples of computed orientations. The definition of such an orientation follows naturally from Definition 2.
A Definition for Orientation for Multiple Component Shapes
681
Definition 3. Let S be a compound object which consists of m disjoint shapes S1 , S2 , . . . , Sm . Then the orientation of S is defined by the angle that maximises the function Gcomp (α, S) defined by Gcomp (α, S) =
m i=1
2 |pr− → a [AB]| dx dy du dv.
(6)
A=(x,y)∈Si B=(u,v)∈Si
The above definition allows an easy computation of the defined orientation. Theorem 2. The angle α where the function Gcomp (α, S) reaches the maximum satisfies the following equation
sin(2α) = m cos(2α)
2·
m
(m1,1 (Si ) · m0,0 (Si ) − m1,0 (Si ) · m0,1 (Si ))
i=1
((m2,0 (Si ) − m0,2 (Si )) · m0,0 (Si ) + (m0,1 (Si ))2 − (m1,0 (Si ))2 )
i=1
2· = m
m
m1,1 (Si ) · m0,0 (Si )
i=1
.
(7)
(m2,0 (Si ) − m0,2 (Si )) · m0,0 (Si )
i=1
Proof. Similarly as in the proof of Theorem 1, setting dGcomp (α, S)/dα = 0 the proof follows easily. The following notes are given to point out some properties of Gcomp (α, S). Note 1. The computed orientation of a single component object (based on F (α, S) and G(α, S)) breaks down when m1,1 (S) = m2,0 (S) − m0,2 (S) = 0 because under these conditions F (α, S) and G(α, S) became constant functions (see (2) and (5)). Consequently no direction can be selected as a shape orientation. Analom m gously, when m1,1 (Si ) = (m2,0 (Si ) − m0,2 (Si )) = 0 holds then the orii=1
i=1
entation of the compound shape S = S1 ∪ . . . ∪ Sm cannot be computed by Gcomp (α, S). Note 2. Any component Si of a compound shape S = S1 ∪ . . . ∪ Sm that is considered unorientable by G(α, Si ) (i.e. G(α, Si ) = const.) will not contribute to (7), and is therefore ignored in the computation of Gcomp (α, S). That is because G(α, Si ) = const. implies m1,1 (Si ) = 0 and m2,0 (Si ) = m0,2 (Si ). Note 3. If all components Si of a given shape S have identical orientation according to G(α, Si ) then this same orientation is also computed by Gcomp (α, S). Note 4. From (7) it can be seen that components of S contribute a weight proportional to m0,0 (Si )3 .
682
ˇ c and P.L. Rosin J. Zuni´
The cubic weighting given to components in (7) seems excessive since it will tend to cause the larger components to strongly dominate the computed orientation. For instance, is a compound object S consists of a shape S1 and the shape S2 which is the dilation of a shape S2 by a factor r, i.e. S2 = r · S2 = {(r · x, r · y) | (x, y) ∈ S2 } and, consequently, the size of S2 increases when r increases too. Then, m0,0 (S2 ) = r2 ·m0,0 (S2 ), m1,1 (S2 ) = r6 ·m1,1 (S2 ), m2,0 (S2 ) = r6 · m2,0 (S2 ), m0,2 (S2 ) = r6 · m0,2 (S2 ). Entering the above estimates into (7) we obtain that sin(2α)/ cos(2α) equals 2 · m1,1 (S1 ) · m0,0 (S1 ) + 2 · r6 · m1,1 (S2 ) · m0,0 (S2 ) (m2,0 (S1 ) − m0,2 (S1 )) · m0,0 (S1 ) + r6 · (m2,0 (S2 ) − m0,2 (S2 )) · m0,0 (S2 ) and obviously the influence of S2 to the computed orientation of S could be very big if the dilation factor r is much bigger than 1. This suggests a modification of (7) to enforce instead a linear weighting by area, namely:
sin(2α) = m cos(2α)
2·
m
m1,1 (Si )/m0,0 (Si )
i=1
.
(8)
(m2,0 (Si ) − m0,2 (Si ))/m0,0 (Si )
i=1
If the orientation α of S = S1 ∪ S2 = S1 ∪ r · S2 is computed by (8) then 2·m1,1 (S1 )/m0,0 (S1 )+2·r2 ·m1,1 (S2 )/m0,0 (S2 ) sin(2α)/ cos(2α) equals (m2,0 (S1 )−m0,2 (S1 ))/m0,0 (S1 )+r2 ·(m2,0 (S2 )−m0,2 (S2 ))/m0,0 (S2 ) . It is not difficult to imagine the situation where the size change of components of a compound objects should have no effect on the computed orientation. For instance, objects may be the same size in nature, but appear different sizes in the image due to varying distances from the camera. If we would like to avoid any impact of the size of the components on the computed orientation of a compound
m m
object then we can use the formula:
sin(2α) cos(2α)
2·
=
(m
1,1 (Si )/(m0,0 (Si ))
2
i=1
.
m
2,0 (Si )−m0,2 (Si ))/(m0,0 (Si ))
2
i=1
Following on from Note 2, any component that is almost unorientable by G(α, Si ) will have little effect on the computation of Gcomp (α, S). Thus we can say that Gcomp (α, S) is not the same as computing a simple circular mean1 of the orientations produced by G(α, Si ) since Gcomp (α, S) weights the contributions of components according to both their area and their orientability. The new methods (given by (7) and (8)) for computing orientation are demonstrated on some trademarks in figure 2. In figure 2a-c the computed orientations computed from Gcomp (α, S) are different and preferable to those based on F (α, S) (in which the trade marks are considered as single component objects). 1
The circular mean μ of a set of n orientation samples θi is defined as [8]: μ if S > 0 and C > 0 μ = μ + π if C < 0 μ + 2π if S < 0 and C > 0 n S 1 where μ = tan−1 C , C = n1 n i=1 cos θi , S = n i=1 sin θi .
A Definition for Orientation for Multiple Component Shapes
61.9◦ , 61.9◦ , 20.5◦ (a)
88.5◦ , 88.6◦ , −0.9◦ (e)
66.8◦ , 67.0◦ , 32.2◦ (b)
50.6◦ , 49.6◦ , −17.8◦ (c)
−64.2◦ , −71.1◦ , −28.1◦ −65.54◦ , 89.94◦ , −4.4◦ (f) (g)
683
−44.4◦ , −44.4◦ , −40.3◦ (d)
24.0◦ , −4.8◦ , 13.9◦ (h)
Fig. 2. Orientations of trademarks are shown computed by (7), (8), and (3)
Fig. 3. Orientations computed by (8) of multiple components are overlaid
The computed orientations for the shape in figure 2d are very close and coincident with the one of the shape’s symmetry axes. Some inconsistency is caused by the discretization process. The trademark in figure 2e has reflective symmetry, and so the orientation given by Gcomp (α, S) is along the symmetric axis as opposed to the standard method for estimating orientation. The individual orientations of the components are {−63.7◦, 61.7◦ }. When one component is reduced in size (figure 2f) the larger component dominates (7) whereas this effect is reduced by (8). In figure 2g four quarter area components combine to have identical effect to a single full size component, and the orientation given by (8) is close to 90◦ again. The larger component still strongly dominates (7).
684
ˇ c and P.L. Rosin J. Zuni´
Figure 2h gives an example of a shape that is unorientable by (7) and (8) – see Note 1. The individual component orientations of {65.5◦, −54.7◦, 5.6◦ } are well defined; |m2,0 (Si ) − m0,2 (Si )| + |m1,1 (Si )| are around the value 400. However, 3 3 | (m2,0 (Si ) − m0,2 (Si )) | + | m1,1 (Si )| is only equal to 6. This explains the i=1
i=1
inconsistency between the orientation estimates from (7) and (8) despite the component areas being almost equal. The new method is further demonstrated on several natural scenes in figure 3. The overall orientations: 47.16◦, 87.68◦, and 82.87◦ computed by (8) are appropriate and seems to be more acceptable than orientations −15.6◦, −8.1◦ and −4.3◦ computed by minimizing F (α, S). Note that in the last example the tilted blob has only a minor impact on the overall orientation estimated by (8).
5
Conclusion
In this paper we have considered an alternative approach for the computation of shape orientation. One benefit of such a new approach is that it leads to a new method for computing the orientation of compound objects which consist of several components. It was shown that such a defined compound shape orientation has several attractive properties. The computed orientations are given for several examples and they are reasonable. If applied to a single component objects the new defined shape orientation is consistent with the standard method for shape orientation computation, but the methods are different when applied to shapes consisting of several components. It is not possible to say that orientations computed by one of the methods are better then the orientations computed by the other one. A final judgement can be given only based on knowing the particular application where the methods are applied.
References 1. Boutsen, L., Marendaz, C.: Detection of shape orientation depends on salient axes of symmetry and elongation: Evidence from visual search. Perception & Psychophysics 63(3), 404–422 (2001) 2. Cortadellas, J., Amat, J., De la Torre, F.: Robust normalization of silhouettes for recognition applications. Pattern Recognition Letters 25(5), 591–601 (2004) 3. Ha, V.H.S., Moura, J.M.F.: Affine-permutation invariance of 2-d shapes. IEEE Trans. on Image Processing 14(11), 1687–1700 (2005) 4. Jain, R., Kasturi, R., Schunck, B.G.: Machine Vision. McGraw-Hill, New York (1995) 5. Kim, W.Y., Kim, Y.S.: Robust rotation angle estimator. IEEE Trans. on Patt. Anal. and Mach. Intell. 21(8), 768–773 (1999) 6. Klette, R., Rosenfeld, A.: Digital Geometry. Morgan Kaufmann, San Francisco (2004) 7. Lin, J.-C.: The family of universal axes. Pattern Recognition 29(3), 477–485 (1996) 8. Mardia, K.V., Jupp, P.E.: Directional Statistics. John Wiley, New York, NY (1999)
A Definition for Orientation for Multiple Component Shapes
685
9. Palmer, S.E.: Vision Science-Photons to Phenomenology. MIT Press, Cambridge (1999) 10. Shen, D., Ip, H.H.S.: Optimal axes for defining the orientations of shapes. Electronic Letters 32(20), 1873–1874 (1996) 11. Tsai, W.H., Chou, S.L.: Detection of generalized principal axes in rotationally symetric shapes. Pattern Recognition 24(1), 95–104 (1991) ˇ c, J., Kopanja, L., Fieldsend, J.E.: Notes on shape orientation where the stan12. Zuni´ dard method does not work. Pattern Recognition 39(5), 856–865 (2006)
Definition of a Model-Based Detector of Curvilinear Regions Cédric Lemaître1, Johel Miteran1, and Jiří Matas2 1
2
LE2I Faculté Mirande Aile H. Université de Bourgogne BP 47870 21078 Dijon Center for Machine Perception, Dept.of Cybernetics, CTU Prague, Karlovo nam 13, CZ 12 135 {cedric.lemaitre,miteranj}@u-bourgogne.fr,
[email protected] Abstract. This paper describes a new approach for detection of curvilinear regions. These features detection can be useful for any matching based algorithm such as stereoscopic vision. Our detector is based on curvilinear structure model, defined observing the real world. Then, we propose a multi-scale search algorithm of curvilinear regions and we report some preliminary results.
1 Introduction Detecting specific features has been shown to be useful in many computer vision applications such as stereoscopic vision [1, 2] or object recognition [3], image retrieval [4]. A number of feature detectors [1, 2, 5-9] have been proposed in the literature. As these detectors can be combined in order to improve the recognition performances, it can be useful to add new features. We propose in this paper to define a new model of such features, based on the extraction of curvilinear regions, often simply named: line extraction. There are a huge number of application fields where curvilinear structures extraction is used. In photogrammetic and remote sensing tasks, it is used to extract roads, railroads or rivers from satellite or low-resolution aerial imagery [10-12], which can be used for the capture or update of data for geographic information systems. Moreover, it is useful in medical imaging for the extraction of anatomical features or simply for the segmentation of medical images from CT or MR devices [13-15]. A few models [12, 14, 16-18] have been described for curvilinear structures detection, often application specific (for roads, vessels). Our aim is to define here a superset of these models, which can be used in the general case of object recognition allowing building an affine-invariant detector. Steger [18] classified previous work in curvilinear detection into three families. The first approach detects lines by considering the gray values of the image only and uses purely local criteria. The second approach regards lines as object having parallel edges. The last family is to regard the image as a function Z(x,y) and extract lines from it by using differential geometric properties. The basic idea behind these algorithms is to locate the positions of ridges and ravines in the image function. Steger described in his paper an explicit model for lines and line profile models. A scale space analysis is carried out for each of the models. This analysis is used to W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 686 – 693, 2007. © Springer-Verlag Berlin Heidelberg 2007
Definition of a Model-Based Detector of Curvilinear Regions
687
derive an algorithm in which lines and their widths can be extracted with subpixel accuracy. The algorithm uses a modification of the differential approach to detect lines and their corresponding edges. In this paper, we propose an extension of Steger’s model, which allows detecting lines using all criteria of three families described above. Observing the real world, we give in the first section the definition of the curvilinear structure. The section 2 presents the detector which can be seen as a minimization problem. The components of the cost function are detailed. Using the approach of Steger, we defined two main steps of detection: the profile detection and the whole curvilinear structure detection. The main new criteria we introduce are texture attributes inside and around the shape, and information about line profile symmetry and local curvature. In the last section, we present some results concerning the profile detection and some preliminary results concerning the whole detector.
2 Curvilinear Structure Definition 2.1
Geometry
We could define a curvilinear region as a set of pixels delimited by left and right G G G boundaries l (t ) and r (t ) . In the continuous case, it can be defined by {a (t ), w(t ) } as shown by fig. 1. In the discrete case, it can be defined by the following set: G G c = {(ai , wi ), i = 0, ", L } where L is the length of the curvilinear region, a (t ) G G G l (t ) + r (t ) is a vector defining the axis between the boundaries: a ( t ) = and 2 G G w(t ) = l (t ) − r (t ) .
3 Appearance G On a 1D view a curvilinear region could be defined by v (t ) (the “crossection”) is the G vector of pixel values on a linear segment centered at and to a ′(t ) and of length Similar areas
w (t )
JG a' ( t ) G a( t)
w(t)
Maximum of Gradient
JG V (t ) h(t ) k(t )
JG Vl (t )
Fig. 1. Geometric and appearance attributes
k(t) JG Vr (t )
688
C. Lemaître, J. Miteran, and J. Matas
G G w(t ) . vl (t ) (resp. vr (t ) ) is the vector of pixel values on a linear segment perpendicuG lar to a ′(t ) , and of length k (t ) , with k (t ) < w(t ) .
4 Detection of Lines in 1D The problem of curvilinear detection using the previous model can be posed as a minimization problem, defining a cost function for which it is necessary to find a G local minimum for over the space of all a , w . We could then define the following G cost function: J ( I , a(t ), w ( t ) ) where : I is the input image. The output will be a set S of N curvilinear regions defined by: JJG S = {c j , j = 0,..., N } , and c j = {(aij , wij ), i = 0,..., Lj } , where
L j is the length of the j th of the curvilinear region.
4.1 Cost Function Components Our cost function is composed by different constraints, chosen using some observations of real pictures (Fig . 2). Observing a 1D profil (Fig. 3) we could see that a curvilinear structure has a high difference between the exterior (left and right) of the region and the “crossection”. This constraint can be defined as follow:
G G G G ⎡1 − m V ,Vl ⎤ ⎡1 − m V ,Vr ⎤ ≈ 0 , ⎣ ⎦⎣ ⎦
(
)
(
)
where m is for example the normalized Euclidian distance computed in Fourier space, defined by :
( fˆl (u )
fˆr (u )
2
)
u
m( fl (t ), fr (t )) =
m max
This distance allows taking the local texture into account. Another simplified distance could be the normalized module of luminance gradient. Considering the value of left and right side should be similar that satisfied the following relation:
G G m Vl , Vr ≈ 0
(
)
Moreover, we can observe (Fig. 2 and Fig. 3) that the texture in the crossection is constant along the curvilinear region and defined by :
G G m V (t ), V (t + dt ) ≈ 0
(
)
Observing 2D signal (Fig. 3), we could add some 2D constraints on width and curvature: the width of curvilinear region is constant along the axis, implying:
Definition of a Model-Based Detector of Curvilinear Regions
689
dw(t ) ≈0 dt Moreover the local curvature γ (t ) should be also constant along the axis, and the boundaries are locally parallel and symmetric about the central axis :
G G d γ (t ) dr (t ) dl (t ) ≈ 0 and ≈ . dt dt dt
Fig. 2. Example of images containing curvilinear regions Similar borders
2 high gradients
Homogeneous section Curvilinear region
Fig. 3. Observations on 1D section
4.2 1D Search Algorithm Along the 1D profile of length imax, we have to minimize Q: G G G G G G Q = 1 − m V , Vl m V , Vr 1 − m Vl , Vr
( ( ) ( )( ( )))
(1)
Since we want to detect the crossection for different scales, we have to compute Q for all the possible widths of crossection. In the case that there is only one curvilinear crossection along the profile, the pseudo code of the multi-scale algorithm work as follows: For i1=0 to imax ¾ For i2=i1+ǻ to imax Compute FFT on left and right vectors of width ǻ near i1 and i2. Compute the complex distances in Fourier space. Normalize distances. ¾ Compute Q(i1,i2) according eq. 1. Find minimum: Q min(Ia)=min(Q(i1,i2))for i2 : i1 -> imax and for i1 : 0 -> imax with Ia= (i1+i2)/2
690
C. Lemaître, J. Miteran, and J. Matas
5 Experiments 5.1 Results of Crossection Detection In order to validated the 1D model using some artificial and real signals. We have applied some previous presented algorithm on particular picture. We depicted on the following figures the original signal and the value of Q ' (i ) = 1 − Qmin (i ) . This value should maximum when i=Ia, where Ia is the abscise of the axis. The first particular signal simulates two textures of different frequencies and magnitude. The detector response is maximum for the center of the central texture (fig. 4). We compared the result of our own to the result of the usual correlation with a “gate” signal1. It is clear that localisation of the Q’(i) maximum is more precise using our own detector (since the correlation function gives multiple maxima in the area of the theoretical maximum). 0,6 45
0,5
40 35
0,4
f(i)
30 0,3
25 20
0,2
15 10
0,1
5 0
0 0
50
100
150
200
0
250
50
100
150
200
250
i
i
Fig. 4. Original textured signal and detector response
The detector response for a 1D selection of retina-vessel image is depicted fig.5. This signal has a low contrast; indeed all original values are between 95 and 127. Moreover the edges of the vessel are not clearly defined. Despite these constraints, the maximum value of our detector is obtained at the centre of the vessel and with a high value which is near the value obtained for the standard “gate” signal. The last example 140
0.5 0.45
120
0.4
100
0.35
f(i)
f(i) Q'(i)
60
0.1
20
0.05
0
0 0
50
100
150
Fig. 5. Detector response in the case of retina vessel
(of the central texture width).
0.2 0.15
40
1
0.25
200 i
Q'(i)
0.3
80
Definition of a Model-Based Detector of Curvilinear Regions
691
0.6
250 f(i) Q'(i) 200
0.5 0.4
f(i)
0.3
Q'(i)
150
100 0.2 50
0.1
0
0 0
50
i
100
150
Fig. 6. Detector response in the case of road detection
presented fig.6 example is a selection of a road in an aerial picture. The selection has made inside a part of a forest. Despite woods around the road, the value of the maximum is high (0.57) and is well localised on the centre of the road. 5.2 1 D Repeatability Study In order to validate our detector, we have realized a repeatability study. The robustness is evaluated against point of view change (4 sets of images, each made of 14 points of view of the same scene) and against natural noise introduced by camera sensor when increasing sensibility from 25 to 1600 ISO (one set of 20 images of one scene). Some examples are depicted on Fig. 7. For each image, 2 sections have been analyzed by a human expert and our algorithm in order to determine the centre position and the width of section. The errors introduced by our method (absolute value of differences between human measurements and filter responses) are presented in the Table 1 and Table 2. The error measured on the axis position and the curvilinear region width is from 1 to 2 pixels in many cases. The width values were from 8 to 180 pixels.
Fig. 7. Example of images used for repeatability study Table 1 . Repeatability study: robustness against point of view change Scene Exterior scene White cable on textured flat Grey cable on textured flat White cable on woodmake table
Absolute axis position error (pixels) 1.5
Absolute width error (pixels) 1.69
1.46
3
1.46
1.61
2.04
1.92
692
C. Lemaître, J. Miteran, and J. Matas Table 2. Repeatability study: robustness against natural noise Scene Interior scene
Absolute axis position error (pixels) 1.5
Absolute width error (pixels) 4.08
5.3 2D Detection Preliminary Results We applied a preliminary version of the 2D algorithm to real images. On the following figures, the segments of curvilinear regions are depicted using colors, and the axis is depicted in white. In the case of the picture of the retina (Fig. 8), depending on parameters tuned by the user, it is possible so select a few families of regions (using with and minimal length of curvilinear region, for example). As presented in Fig. 9, it is also possible to use the algorithm for matching. The two original pictures are taken from near opposite points of view. However, some common parts are detected in both cases (branches, part of leafs), and thus could be used for a stereo matching process.
Fig. 8. Blood vessel segmentation (retina)
Fig. 9. Example of use of detector for stereo analysis
6 Conclusion In this paper, we proposed a model for curvilinear structure detection, useful for objects recognition, matching or image retrieval. Our approach is based on an extension of Steger’s model, introducing some texture based attributes and some new constraints on the shape of the region to be detected. After the segmentation step, it is also possible to use the features of the model to classify the curvilinear regions available in the images. We proposed a first filter allowing detecting the profile of curvilinear regions taking into account textures. We validated the filter using real images. However the full detector consumes high computing time. We planned thus to implement a simplified version of the method, for which the cost function will be computed only for a sampled set of pixels, belonging for example to the edges.
Definition of a Model-Based Detector of Curvilinear Regions
693
References [1] Tuytelars, T., Gool, L.V.: Matching widely separated views based on affine invariant regions. International journal of computer vision 59, 61–85 (2004) [2] Matas, J., Chum, O., Urban, M., Padjla, T.: Robust wide baseline stereo form maximally stable extremal regions. In: presented at British Machine Vision Conference (BMVC), pp. 384–393 (2002) [3] Belongie, S., Malik, J., P. J.: Shape matching and object recognition using shape context. IEEE Transaction on Pattern Analysis and Machine Intelligence 24, 509–522 (2002) [4] Smeukers, A.W.M., Woming, M., Santini, S., Gupta, A., Hamesh, J.: Content-based Image retrieval at the end of the early years. IEEE Transaction on Pattern Analysis and Machine Intelligence 22, 1349–1380 (2000) [5] Kadir, T., Zisserman, A., Brady, M.: An affine invariant salient region detector. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 228–241. Springer, Heidelberg (2004) [6] Harris, C., Stephens, M.: A combined corner and edge detector. In: presented at Alvey vision conference, pp. 147–151 (1998) [7] Mikolajczyk, K., Tuytelars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Gool, L.V.: A comparaison of affine region detectors. International journal of computer vision 65, 43–72 (2005) [8] Moreels, P., Perona, P.: Evaluation of features detectors and descriptors based on 3D objects. In: International conference of computer vision, pp. 800–807 (2006) [9] Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transaction on Pattern Analysis and Machine Intelligence 27, 1615–1630 (2005) [10] Baumgartner, A., Steger, C., Mayer, H., Eckein, W.: MultiResolution semantic and context for road extraction. In: Verlag, S. (ed.) Semantic Modeling for the acquisition of topographic from images and maps, Basel - Switzerland, pp. 140–156 (1997) [11] Fisher, M.A., Tenebaum, J.M., Wolf, A.C.: Detection of roads and linear Structures in low resolution aerial imagery using a multisource knownledge integration technique. Computer graphics and image processing 15, 201–223 (1981) [12] Geman, D., Jadynak, B.: An active testing model for tracking roads in satelite image. IEEE Transaction on Pattern Analysis and Machine Intelligence 18, 1–14 (1996) [13] Walter, Klein, Massin, Zana.: Automatic segmentation and registration of retinal fluorescing angiographies. In: Presented at CAFIA (2000) [14] Géraud, T.: Segmentation of curvilinear objects using watershed-based curve adjacency graph. In: Perales, F.J., Campilho, A., Pérez, N., Sanfeliu, A. (eds.) IBPRIA 2003. LNCS, vol. 2652, pp. 279–286. Springer, Heidelberg (2003) [15] Martinez-Perez, M.E., Hughes, A.D., Stanton, A.V., Thom, S.A., Bharath, A.T., Parker, K.H.: Segmentation of retinal blood vessels based on the second derivative and region growing. In: presented at International Conference on Image Processing, Kobe (Japan) (1999) [16] Deschènes, F., Ziou, D.:Detection of line junctions in Gray-Level Images. In: presented at International conference on Pattern Recognition (ICPR) (2000) [17] Ziou, D.: Optimal line detector. In: presented at International conference on Pattern Recognition (ICPR), pp. 3762 (2000) [18] Steger, C.: An unbiased detector of curvilinear structures. IEEE Transaction on Pattern Analysis and Machine Intelligence 20, 113–125 (1998)
A Method for Interactive Shape Detection in Cattle Images Using Genetic Algorithms Horacio M. Gonz´alez–Velasco, Carlos J. Garc´ıa–Orellana, Miguel Mac´ıas–Mac´ıas, Ram´on Gallardo–Caballero, ´ and Fernando J. Alvarez–Franco CAPI Research Group, University of Extremadura. Politechnic School. Av. de la Universidad, s/n. 10071 C´ aceres - Spain
[email protected] http://capi.unex.es
Abstract. Segmentation methods based on deformable models have proved to be successful with difficult images, particularly those using genetic algorithms to minimize the energy function. Nevertheless, they are normally conceived as fully automatic, and not always generate satisfactory results. In this work, a method to include the information of fixed points whithin a contour detection system using point distribution models and genetic algorithms is presented. Also, an interactive scheme is proposed to take advantage of this technique. The method has been tested against a database of 93 cattle images, with a significant improvement in the success rate of the detections, from 61% up to 95%.
1
Introduction
Segmentation is one of the most important tasks in digital image analysis [1], as a previous step to other tasks related to the concrete application of an artificial vision system: feature extraction for classification, area and volume measures, shape analysis, etc. Among all segmentation methods, those based on deformable models stand out. These methods have become very popular since their introduction in the eighties [2], and have been successfully applied to different segmentation tasks, particularly when dealing with very difficult images (noisy or with many structures and objects), as medical images are [3]. Deformable models normally involve the fitting of a contour to some structure or object within the image. This fitting is typically carried out through the minimization of a particular energy function, that is built considering concrete characteristics extracted from the image. Though the calculus of variations was used for that minimization at the beginning, several authors have proposed the use of genetic algorithms (GA) for this task, both with “traditional” deformable models [4,5] and with statistically based active shape models [6,7] or similar approaches. Here we can cite also our previous work [8], where we proposed a method similar to Hill’s [7] for the extraction of cows contours in outdoors-taken cattle images (see fig. 1). W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 694–701, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Method for Interactive Shape Detection in Cattle Images Using GA
695
However, the fully automatic methods do not always generate completely satisfactory results. Sometimes failures are obtained in the contour adjustment (even completely wrong fittings), and these failures can not be detected without human intervention. For instance, in our application [8] we obtained a lot of errors due to failures in the adjustment of the limbs, because our modelled contour only considers the limbs in the foreground, and many times the animal position is such that both the limbs in the foreground and background can be observed, with a very similar shape. This supports the need for interactive approaches in which an operator can “guide” the segmentation process, as described in [9,10], and more recently in [11], where the same technique as ours is used to model the shapes (point distribution models, PDM [12]). In this work, we describe the manner in which the information of fixed points can be used in the contour search using the Hill’s approach [7], that we applied in [8], so that this technique can be used in an interactive scheme. To find out the improvement obtained in the search with fixed points, we carried out successive experiments over our database of 93 cattle images, considering 2, 1 and none fixed points. It can be observed that the results obtained with fixed points are notably better. In section 2, the method used to fit contours with GA is presented, describing the shape modelling, the objective function and the GA configuration. We then explain in section 3 the way in which the information of fixed points can be used to reduce our search space. The results obtained by applying the search with fixed points to our database are presented in section 4, whereas conclusions and possible improvements of our work are described, finally, in section 5.
2
Shape Detection Using GAs
The shape detection problem can be converted into an optimization problem considering two steps: the parametrization of the shape we want to search and the definition of a cost function that quantitatively determines whether or not a contour is adjusted to the object. For the contour modelling and parametrization the PDM technique [12] has been applied, which lets us restrict all the parameters and precisely define the search space. On the other hand, the objective function proposed is designed to place the contour over areas of the image where edges corresponding to the object are detected. 2.1
Shape Modelling and Search Space
The first step of our method consists in appropriately representing the desired shape we want to search along with its possible variations. In some previous works [7,8,11] the general method known as PDM has been used, which consist in deformable models representing the contours by means of ordered groups of points (figure 1) located at specific positions of the object, that are constructed statistically, based on a set of examples. As a result of the process, the average shape of our object is obtained, and each contour can be represented mathematically by a vector x such as
696
H.M. Gonz´ alez–Velasco et al.
25 34
50 46 62
40
65 61
56 57
24
28 15 11 106
92
126
20
17
7
71 93
74 86 78 80
(a)
82
99
105 110
1 122
114 116 118
(b)
Fig. 1. Figure (a) illustrates the images dealt with. Figure (b) shows the model of the cow contour in lateral position.
x = xm + P · b
(1)
where xm is the average shape, P is the matrix of eigenvectors of the covariance matrix and b a vector containing the weights for each eigenvector, and which properly defines the contour in our description. Fortunately, considering only few eigenvectors corresponding to the largest eigenvalues of the covariance matrix (modes of variation, using the terminology in [12]), practically all the variations taking place in the training set could be described. This representation (model) of the contour is, however, in the normalized space. To project instances of this model into the space of our image, a transformation that preserves the shape is required (translation t = (tx , ty ), rotation θ and scaling s). Thus, every permitted contour in the image can be represented by a reduced set {s, θ, t, b} of parameters. In order to have the search space precisely defined, the limits for this parameters are required. As stated in [12], to maintain a shape similar to those √ √ in the training set, bi parameters must be restricted to values − λi ≤ bi ≤ λi , where λi are the eigenvalues of the covariance matrix. The transformation parameters can be limited considering that we approximately know the position of the object within the image: normally the animal is in horizontal position (angle restriction), not arbitrarily small (restriction in scale) and more or less centered (restriction in t). In our specific case (lateral position), the cow contour is described using a set of 126 points (figure 1), 35 of which are significative. The model was constructed using a training set of 45 photographs, previously selected from our database of 138 images, trying to cover the variations in position as much as possible. For those 45 photographs the contours of the animals were extracted manually, placing the points of figure 1 by hand. Once the calculations were made, 10 modes of variation were enough to represent virtually every possible variation (more than 90%).
A Method for Interactive Shape Detection in Cattle Images Using GA
2.2
697
Objective Function and GA Configuration
Once the shape is parameterized, the main step of this method is the definition of the objective function. In several consulted works [6,7] this function is built so that it reaches a minimum when strong edges of similar magnitude are located near the points of a certain instance of the model in the image. We use a similar approach, generating an intermediate grayscale image that acts as a potential image (inverted, so we try to maximize the “energy”). This potential image could be obtained using a conventional edge detector. However, to avoid many local maxima in the objective function, it is important that the edge maps does not contain many more edges than those corresponding to the searched object. This is not always possible because of background objects. In our method, a system for edges selection, based on a multilayer perceptron neural network is applied, as described in [8]. With this potential image, we propose the following objective function to be maximized by the GA: ⎛
n−2
fL (X) = ⎝
j=0
⎞−1 Kj ⎠
·
n−2 j=0
⎛ ⎝
⎞ Dg (X, Y )⎠
(2)
(X,Y )∈rj
where X = (X0 , Y0 ; X1 , Y1 ; . . . ; Xn−1 , Yn−1 ) is the set of points (coordinates of the image, integer values) that form one instance of our model, and can be easily obtained from the parameters using equation 1 and a transformation; rj is the set of points (pixels of the image) that form the straight line which joins (Xj , Yj ) and (Xj+1 , Yj+1 ); Kj is the number of points that such a set contains; and Dg (X, Y ) is a gaussian function of the distance from the point (X, Y ) to the nearest edge in the potential image. This function is defined to have a maximum value Imax for distance 0 and must be decreasing, so that a value less than or equal to Imax /100 is obtained for a given reach distance DA , in our case 20 pixels. Regarding the GA, two aspects must be configured: the general structure (coding, selection method, etc) and the values for the parameters that determine the algorithm (population size, mutation rate, etc.). To obtain the most suitable configuration for our problem a methodology also based in GA was used. As a result, the best performance was obtained with a population of 2000 individuals, real coding, binary tournament selection method (using linear normalization to calculate fitness), two-point crossover with Pc = 0.45, probability of mutation Pm = 0.4, and steady-state replacement method with S = 60%.
3
Fixing Points to Drive the Search: Method
In the method described above, we codify the parameters {s, θ, t, b} in the chromosome, and, in the process of decoding, we calculate the projection of the contour into the image, using those parameters. Let us denominate x = (x0 , y0 , x1 , y1 , . . . , xn−1 , yn−1 )T the vector with the points of the contour in the normalized space, and X = (X0 , Y0 , X1 , Y1 , . . . , Xn−1 , Yn−1 )T the vector with
698
H.M. Gonz´ alez–Velasco et al.
the points of the contour projected into the image. The vector x can be calculated using equation 1, with the parameters b. In order to obtain vector X, all the points must be transformed into the image space: Xj tx s cos θ −s sin θ xj tx ax −ay xj = + · = + · (3) Yj ty s sin θ s cos θ yj ty ay ax yj To include the information of fixed points, our proposal is not to codify some of the parameters into the chromosome, but calculate them using the equations described above. Therefore, we reduce the dimensionality of the search space, being easier for the algorithm to find a good solution. Fixing one point. Consider that we know the position of the point p, (Xp , Yp ). Then we can leave the parameters (tx , ty ), the translation, without codifying, and calculate them using equations 1 and 3. With equation 1 we calculate the vector x, and applying equation 3 we obtain Xp = tx + ax xp − ay yp tx = Xp − ax xp + ay yp and then (4) Yp = ty + ay xp + ax yp ty = Yp − ay xp − ax yp Fixing two points. Consider now that we know the position of the points p and q, (Xp , Yp ) and (Xq , Yq ). In this case, we can leave two more parameters without codifying: s and θ, appart from (tx , ty ). With equation 1 we calculate again the vector x, and applying the equation 3 we obtain ⎧ Xp = tx + ax xp − ay yp ⎪ ⎪ ⎨ Yp = ty + ay xp + ax yp (5) ⎪ Xq = tx + ax xq − ay yq ⎪ ⎩ Yq = ty + ay xq + ax yq This is a system of linear equations where tx , ty , ax and ay are the unknowns, and its resolution is quite straightforward. Fixing more than two points. If we know the position of more than two points, we have to leave some of the shape parameters (components of the vector b) without coding, thus restricting the allowable shapes that can satisfy the fixed points. Mathematically, this problem is more difficult, because applying equations 1 and 3 when m > 2 points are fixed, we obtain a non-linear system of 2m equations with 2m unknowns. It must be notice that this system has to be solved in the decoding process of the chromosome, before each calculation of the objective function. For this reason, the method used to solve the system should not take very much time, or the method would be useless. Due to its complexity, we have not considered this situation in this work, leaving it as a future improvement of the system. Interactive scheme. We propose to use an approach similar to [11]. First we start running the GA search without fixed points. Then the user decides if the
A Method for Interactive Shape Detection in Cattle Images Using GA
699
result is acceptable or not. If not, he also has to decide whether to fix two, one or zero points depending on the previous result, fix the points and run the algorithm again. This process can be repeated until an acceptable fitting is reached.
4
Experiments and Results
We have tested our method with the 93 images (640x480 pixels size) from our database that were not used in the construction of the shape model. Our primary goal was to test the performance of the system with and without fixed points and to compare results. Besides, we were interested in determining if we could solve the problems that we had with the limbs in our application, described in the introduction and in [8] (see fig. 2,b). In order to evaluate quantitatively a resulting contour, the mean distance (in pixels) between its points and their proper position has been proposed as a figure of merit. In our case, the distance was estimated by using only seven very significant points of the contour. The points used were {20, 25, 40, 50, 78, 99, 114}, and are represented in figure 1. Those points were precisely located for all the images in the database. To test the performance of the method we considered the significant points referred above and run the GA search, without fixed points and with all the possible combinations of one and two fixed points (1 + 7 + 21 = 29 cases), over the 93 images. The results obtained are presented in table 1 and in fig. 3. In table 1 the best and worst results for zero, one and two points fixed are presented. In order to define “success”, we applied a hard limit of ten pixels to the distance, that is enough for the morfological assesment application in which these contours are to be used [8]. More information can be found in the graph, where the percentage of successes vs. the distance considered to define success is presented. Attending to the best cases, we can see that fixing points is clearly helpful, because the results fixing the point 78 and the points 78 and 114 are
(a) Dist. 4.6
(b) Dist. 16.6
Fig. 2. In this figure some examples of the GA search without fixed points are presented, along with the final mean distance (in pixels) to the correct position
700
H.M. Gonz´ alez–Velasco et al. 100
% of succeses
80
60
40
None 78 78, 114 40 50, 99
20
0 0
5
10
15
20
25
30
Distance
Fig. 3. In this figure, the percentage of successes versus the distance used to define success is represented, considering fixed points Table 1. In this table we show the number of successes and the mean distance obtained by the method when it is applied to our database of 93 images Fixed points None 78 40 78, 114 50, 99
Successes 57 80 52 89 54
(61.3 (86.0 (55.9 (95.7 (58.1
%) %) %) %) %)
Mean distance 10.03 6.52 10.49 5.42 10.33
significantly better, compared to the ones obtained in the None case. However, it is very important the selection of the points to fix, because the performance in the worst cases is very similar to the original (without fixed points). We can also observe that the best cases are obtained fixing the points corresponding to the limbs (78 and 114, see fig. 1), indicating that the main problem of the automatic approach was the location of those limbs, and that this problem can be solved almost completely with this technique. Finally, we have to remark that the results presented has been reached without human intervention, considering the same fixed points for all the images in each execution. However, human interaction can be determinant to select the suitable points for each image, thus increasing the performance.
5
Conclusions and Future Work
In this work, we have shown how to use the information of up to two fixed points in the detection of modelled shapes withing images, using genetic algorithms. The method has been tested against a database of 138 cattle images, 45 of which were
A Method for Interactive Shape Detection in Cattle Images Using GA
701
used to model the shape, and 93 to test the performance. As a result, more than 95 % of the detections have been successful in the best of the cases with two points fixed. In order to improve the method, two lines are being considered at this moment. First, we are approaching the problem of fixing more than two points, trying to approximately solve the resulting non-linear system with an affordable computational cost. On the other hand, we are trying to improve the objective function, through the generation of better potential images.
Acknowledgements This work has been supported in part by the Junta de Extremadura through project PRI06A227.
References 1. Jain, A.K.: Fundamentals of digital image processing. Prentice Hall, Englewood Cliffs (1989) 2. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. Journal of Computer Vision 1, 321–331 (1988) 3. McInerney, T., Terzopoulos, D.: Deformable models in medical image analysis. Medical Image Analysis 1, 91–108 (1996) 4. Ballerini, L.: Genetic snakes for color image segmentation. In: Boers, E.J.W., Gottlieb, J., Lanzi, P.L., Smith, R.E., Cagnoni, S., Hart, E., Raidl, G.R., Tijink, H. (eds.) EvoIASP 2001, EvoWorkshops 2001, EvoFlight 2001, EvoSTIM 2001, EvoCOP 2001, and EvoLearn 2001. LNCS, vol. 2037, pp. 268–277. Springer, Heidelberg (2001) 5. MacEachern, L., Manku, T.: Genetic algorithms for active contour optimization. In: IEEE Proc. of the Int. Symp. on Circuit and Systems 4, 229–232 (1998) 6. Hill, A., Taylor, C.J.: Model-based image interpretation using genetic algorithms. Image and Vision Computing 10, 295–300 (1992) 7. Hill, A., Cootes, T.F., Taylor, C.J., Lindley, K.: Medical image interpretation: a generic approach using def. templates. Jour. of Med. Informatics 19, 47–59 (1994) 8. Gonz´ alez, H., Garc´ıa, C., Mac´ıas, M., Gallardo, R., Acevedo, M.: Application of repeated GA to deformable template matching in cattle images. In: Perales, F.J., Draper, B.A. (eds.) AMDO 2004. LNCS, vol. 3179, Springer, Heidelberg (2004) 9. Barret, W., Mortensen, E.: Interactive live-wire boundary extraction. Medical Image Analysis 1, 331–341 (1997) 10. Liang, J., McInerney, T., Terzopoulos, D.: Interactive medical image segmentation with united snakes. In: Taylor, C., Colchester, A. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI’99. LNCS, vol. 1679, pp. 116–127. Springer, Heidelberg (1999) 11. Ginneken, B.v., Bruijne, M.d., Loog, M., Viergever, M.: Interactive Shape Models. In: Proceedings of SPIE. vol. 5032, pp. 1206–1216 (2003) 12. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models – their training and application. Comp. Vision and Image Understanding 61, 38–59 (1995)
A New Phase Field Model of a ‘Gas of Circles’ for Tree Crown Extraction from Aerial Images Peter Horv´ath1,2 and Ian H. Jermyn2 1
University of Szeged, Institute of Informatics, P.O. Box 652, H-6701 Szeged, Hungary Fax:+36 62 546 397 2 Ariana (INRIA/I3S), INRIA, B.P. 93, 06902 Sophia Antipolis, France Fax:+33 4 92 38 76 43
Abstract. We describe a model for tree crown extraction from aerial images, a problem of great practical importance for the forestry industry. The novelty lies in the prior model of the region occupied by tree crowns in the image, which is a phase field version of the higher-order active contour inflection point ‘gas of circles’ model. The model combines the strengths of the inflection point model with those of the phase field framework: it removes the ‘phantom circles’ produced by the original ‘gas of circles’ model, while executing two orders of magnitude faster than the contour-based inflection point model. The model has many other areas of application e.g. , to imagery in nanotechnology, biology, and physics.
1 Introduction Due to the high cost of field studies, forestry services are increasingly turning to image processing to gather the information they need. An important problem in this context is the extraction of the region in the domain of any given image corresponding to tree crowns. Using this information, forestry services can compute or estimate, for example, the mean diameter of the crowns, the biomass, and so forth. If tree crown extraction from aerial images could be performed automatically, then, in addition, expensive semiautomatic and manual image processing could be avoided. In [1], we addressed this problem using ‘higher-order active contours’ (HOACs) [2], a new generation of active contours [3] allowing the incorporation of non-trivial prior knowledge about region geometry. Unlike most methods for incorporating prior geometric knowledge into active contours [4,5,6], HOACs do not necessarily constrain region topology, thereby allowing the detection of multiple instances of a single entity at no extra cost, a critical requirement for the current application. To extract tree crowns, the HOAC model was analysed theoretically to find parameter values favouring regions composed of a number of approximate circles of approximately a given radius, by making such regions minima of the energy. The result was the ‘gas of circles’ model [1]. When combined with a likelihood energy, the model works
This work was partially supported by EU project MUSCLE (FP6-507752), Egide PAI Balaton, OTKA T-046805, and a HAS Janos Bolyai Research Fellowship. We thank the French National Forest Inventory (IFN) for the data.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 702–709, 2007. c Springer-Verlag Berlin Heidelberg 2007
A New Phase Field Model of a ‘Gas of Circles’ for Tree Crown Extraction
703
well, but suffers from some drawbacks: computation time is long, and the model can create ‘phantom circles’ in homogenous areas of the image. In [7], the parameters were further fixed so that circles are energy inflection points: they are not in themselves stable, but even a small amount of appropriate image information can render them stable. This removed the second drawback but not the first. Rochery et al. [8] introduced an alternative formulation of HOACs, based on the ‘phase field’ framework used in physics. The basic phase field model is, to a good approximation, equivalent to a classical active contour, with energy given by boundary length. By adding a nonlocal term, phase field models equivalent to HOACs can be created. Phase fields have several advantages over other region modelling frameworks: a neutral initialization, obviating the need to choose an initial region; gradient descent based solely on the PDE arising from the model energy, thereby simplifying implementation, especially for HOACs; and enhanced topological freedom. The purpose of the present paper is to construct the phase field version of the model of [7], thereby benefitting from all the advantages of the phase field framework, and solving both drawbacks of the model in [1]. In particular, we will see a gain of two orders of magnitude in execution time compared to the contour formulation. In section 2, we give an overview of the original HOAC ‘gas of circles’ model. In section 3, we describe the phase field version. In section 3.2, we describe the phase field version of the HOAC inflection point ‘gas of circles’ model. In section 4 apply the model to the tree crown extraction problem, and we conclude in section 5.
2 Higher-Order Active Contours and the ‘Gas of Circles’ Model HOACs can incorporate not only the local differential-geometric information included in classical active contours, but also more complex knowledge carried by long-range dependencies between tuples of contour points. These dependencies are expressed by multiple integrals over the contour. Combined with the classical region area and boundary length terms, one of the forms of Euclidean invariant quadratic HOAC energy is βC EC,G (R) = λC L(∂R)+ αC A(∂R)− dpdp γ(p)· ˙ γ(p ˙ )Ψ (r(p, p ), d) , (2.1) 2 where ∂R is the boundary of region R; γ is an embedding defining ∂R, and γ˙ is its derivative; p is a coordinates on the domain of γ; L is the region boundary length functional; A is the region area functional; r(p, p ) = |r(p, p )|, where r(p, p ) = γ(p) − γ(p ); and Ψ is an interaction function, described in [1], parameterized by d ∈ R+ , that determines the geometric content of the model. 2.1 The HOAC ‘Gas of Circles’ Model For certain parameter ranges, circles, with radius r0 depending on the parameters, are stable configurations of EC,G . These parameter ranges can be found by expanding EC,G to second order in a functional Taylor series about a circle [1]. The series is most simply expressed in terms of the Fourier components of the perturbation: the first order term E1 is zero for non-zero frequency k = 0, while the second order term E2 is diagonalized by
704
P. Horv´ath and I.H. Jermyn
the Fourier basis. Thus we require E1 (r0 ) = 0 (energy extremum) and E2 (k, r0 ) > 0 for all k (energy minimum). These two quantities are also functions of the model parameters λC , αC , βC , and d. The E1 constraint determines βC in terms of the other parameters, while the E2 constraint restricts the range of αC for given λC and d. 2.2 The HOAC Inflection Point ‘Gas of Circles’ Model The above model has a disadvantage when minimized using gradient descent. Circles of the stable radius, once created, cannot disappear again because they are local minima, even if they have no support from the data. This problem was solved in [7], by further constraining the parameters so that the energy function has an inflection point at r0 with respect to changes of radius (i.e. E2 (0, r0 ) = 0), rather than a minimum. In this case, the effect of the data can easily create an energy minimum, but otherwise the circle will vanish. This second constraint fixes both αC and βC in terms of r0 , d, and λC . Since r0 is fixed by the application, the only remaining parametric degrees of freedom are the value of d, and the overall strength of the prior term, represented by λC . We also require αC and βC to be positive. When combined with the other constraints [7], this means that there is only a small range of permissible values of rˆ = r0 /d, to wit 0.6897 ≤ rˆ ≤ 0.7827. This effectively enables d to be fixed, once r0 is known. Thus despite the initial complexity of the model, we are able to fix all except one parameter, λC , representing the overall strength of the prior term.
3 Phase Fields and the ‘Gas of Circles’ Model A phase field φ is a real-valued function on the image domain Ω. Given a threshold z, a phase field determines a region by the map ζz (φ) = {x ∈ Ω : φ(x) > z}. Thus phase fields are a level set representation. The difference with the usual distance function level set representation is that the functions are not constrained: the set of possible φ, Φ, is a linear space. The simplest phase field energy is D 1 1 1 E0 (φ) = dx ∂φ · ∂φ + λ( φ4 − φ2 ) + α(φ − φ3 ) . 2 4 2 3 Ω With D = 0, φR arg minφ: ζz (φ)=R E0 (φ), i.e. the minimizing phase field for a given fixed region, would take the value 1 inside R and −1 outside. The effect of D = 0 is to smooth φR so that it has an interface of finite width around ∂R. The phase field model is approximately1 equivalent to a classical active contour in the sense that E0 (φR ) ≈ λC L(∂R) + αC A(R) EC,0 ,
(3.1)
Equation (3.1) means that gradient descent on φ using E0 will duplicate gradient descent on γ using EC,0 . The contour parameters are given by αC = 4α/3 , λ2C = 16DλK/15 , K = 1 + 5(α/λ)2 , w2 = 15D/λK . (3.2) 1
For a detailed discussion of why the approximation error does not matter, see [8].
A New Phase Field Model of a ‘Gas of Circles’ for Tree Crown Extraction
Rochery et al. [8] added the following nonlocal term to E0 : β EN L (φ) = − dx dx ∂φ(x) · ∂φ(x ) G(x − x ) , 2 Ω2
705
(3.3)
where G(x − x ) = Ψ (|x − x |/d). With βC = 4β, Eg = E0 + EN L is equivalent (as usual, to a good approximation) to the HOAC model EC,G , and can be used in its place, thus allowing the incorporation of non-trivial prior knowledge about region geometry while still profiting from all the advantages of the phase field framework. 3.1 The Phase Field ‘Gas of Circles’ Model Using the relations between the phase field and contour parameters, we can now create a phase field ‘gas of circles’ model equivalent to the HOAC model described in section 2.1 (units are chosen so that d = 1): 1. Choose w. It cannot be too small, or a subpixel discretization will be needed, and it cannot be too large or the phase field model will not be a good approximation to the HOAC model √ [8]. We have found that w = 3 or w = 4 work well. 2. Choose α ˜ C ≤ 5/(2w). This constraint arises from inverting equations (3.2). 3. Determine the β˜Cparameter corresponding to r0 and α ˜ C using the method in [1]. 2 w2 /5], α ˜ = 15 [1 + 1 − 4α ˜ = β˜C /4, and D ˜ = w/4. 4. Set λ ˜ ˜ = 3 α ˜ /4, β C C 8w ˜ λ, ˜ α 5. Choose λC and multiply D, ˜ , and β˜ by it to get D, λ, α, and β. 3.2 The Phase Field Inflection Point ‘Gas of Circles’ Model We now want to combine the phase field ‘gas of circles’ model with the constraint that the circle energy have an inflection point rather than a minimum at the desired radius. This is a non-trivial requirement. The relations between the phase field and contour parameters were derived using an approximate ansatz for φR [8]. For the ‘gas of circles’ model, the approximations are not expected to be important, since small errors in the parameters will produce small changes in behaviour. However, an inflection point represents a set of measure zero in the parameter space. It is important to see whether these approximate parameter relations preserve the inflection point behaviour. √ In the previous section, we chose α ˜ C according to the constraint α ˜ C ≤ 5/(2w). In section 2, we described how a given√value of rˆ fixes α ˜C and β˜C . To satisfy both these constraints we need to choose w ≤ 5/2α ˜ C . To determine the parameters of the new phase field inflection point ‘gas of circles’ model, we therefore take the following steps: 1. 2. 3. 4.
Choose an rˆ value satisfying 0.6897 ≤ rˆ ≤ 0.7827. This fixes α ˜ C and β˜C . Choose w using the above criterion. ˜ α ˜ and D ˜ as before. Determine λ, ˜ , β, Multiply these parameters by λC .
To test that the parameter transformations work in practice, we fixed a set of contour parameters corresponding to an inflection point at r0 = 10, and then translated these parameters to give an equivalent phase field model. We then performed three gradient descent experiments using Eg . One used the value of β corresponding to the inflection
706
P. Horv´ath and I.H. Jermyn
Fig. 1. Preservation of the inflection point when translating from contour to phase field: gradient descent evolutions using Eg , starting from a circle of radius r0 = 10, for values of β close to β ∗ , the value giving an inflection point at r0 = 10. Row 1: evolution using βC = 0.96β ∗ ; row 2: evolution using β = 1.04β ∗ ; row 3: evolution using β = β ∗ .
point, β ∗ , while the other two used β values 4% above and below β ∗ . The region was initialized to a circle of radius r0 . Figure 1 shows the results. In the first row, with β < β ∗ , the region shrinks and disappears. In the second row, with β > β ∗ , the region grows until it reaches the corresponding energy minimum, at a radius of 12 pixels. In the third row, with β = β ∗ , the circle grows only very slightly, to a radius of 10.5 pixels. The inflection point behaviour is therefore preserved to a very good approximation.
4 Tree Crown Extraction The tree crown extraction problem has been studied in several papers. Gougeon [9] uses a valley following method to delineate the crowns, while Larsen [10] introduces a species-specific method based on the matching of 3D tree templates. Both these, and other similar methods, look for local maxima of certain features. Perrin et al. [11] use a global method, modelling tree crown configurations as a marked point process. This has the advantage over our method that overlapping trees are easily handled, but the disadvantage that trees are represented by ellipses: their outlines are not found. Our model for tree crown extraction is the sum of two terms: a prior energy Eg , as described above, and a likelihood energy, Ei , which we will now describe. We will ¯ using Gaussian distributions.2 model the image I, both in R and in the background R, We add a term that predicts high gradients along the boundary ∂R: (I − μ)2 (I − μ ¯)2 Ei (I, R) = dx λi ∂I · ∂φ + αi φ+ + φ− , 2σ 2 2¯ σ2 Ω where φ± = (1 ± φ)/2. Note that to facilitate comparison of parameters in the prior energy, we set λ = 1 in Eg and introduce a weight αi in Ei . 2
Ê We ignore the normalization constant Z(R) = DI e−Ei (I,R) since in our case it merely changes λ and α, and we are interested in stability of the posterior in the absence of imagedependent terms.
A New Phase Field Model of a ‘Gas of Circles’ for Tree Crown Extraction
(a)
(b)
(c)
707
(d)
c IFN (0.6, 0.06, 0.33, 0.19, 2.5); (b) the result with the HOAC Fig. 2. (a) an image of poplars ‘gas of circles’ model with inflection point (1160, 0.7); (c) result with the phase field ‘gas of circles’ model (1111, 278, 177.2, 55.8, 30, 4); (d): result with the new model (1160, 0.7, 4)
4.1 Energy Minimization We minimize E = Eg + Ei using gradient descent. The functional derivative of EN L is δEN L (x) = β dx ∂ 2 G(x − x )φ(x ) . (4.1) δφ Ω The nonlocal force arising from the HOAC term makes gradient descent using EC,G complex to implement. One must extract the contour, perform numerical contour integrations, and extend the force off-contour [2]. In contrast, the force (4.1) arising from the nonlocal phase field term is simply a convolution. Implementation is thereby made much simpler, and execution much faster. The time for one iteration in the HOAC formulation scales as the square of the boundary length, which scales as the number of trees, which in turn scales as the number of pixels. Thus execution time for the HOAC formulation can be expected to scale as the number of pixels squared. In contrast, execution time for the phase field formulation scales linearly with the number of pixels. 4.2 Experimental Results Our data are ∼ 0.5m resolution colour infrared aerial images of the ‘Saˆone et Loire’ region in France provided by the French National Forest Inventory. The images show regularly and irregularly planted poplar forests. We compared the new phase field inflection point ‘gas of circles’ model (section 3.2) with the HOAC inflection point ‘gas of circles’ model (section 2.2) and with the phase field ‘gas of circles’ model with a minimum rather than an inflection point (section 3.1). The parameters μ, σ, μ ¯, and σ ¯ are learned from examples using maximum likelihood, and then fixed. 3 Figure 4.1(a) shows an image (200 × 200) of a regularly planted poplar forest. In the top right and bottom left there are fields, while in the middle, two different sizes of poplars. The aim is to extract the larger trees. Figure 4.1(b) shows the result with the 3
In the figure captions, the image parameters are (μ, σ, μ ¯, σ ¯ , r0 ); in the HOAC ‘gas of circles’ case, the prior parameters are (αi , rˆ), in the phase field ‘gas of circles’ case they are (αi , D, λ, α, β, w); while in the case of the new model they are (αi , rˆ, w).
708
P. Horv´ath and I.H. Jermyn
(a)
(b)
(c)
Fig. 3. (a) An image of regularly planted poplar stands with a less regularly planted trees at the c IFN (0.8, 0.06, 0.43, 0.2, 3.5); (b) result with the phase field ‘gas of circles’ model upper part (500, 50, 34.2, 9.3, 5.2); (c) result with the new model (500, 0.74, 4)
c IFN (0.8, Fig. 4. (a) An image of regularly planted poplars with different fields in the right part 0.06, 0.43, 0.2, 3.5); (b) result with the phase field ‘gas of circles’ model (500, 50, 34.2, 9.3, 5.2); (c) result with the new model (500, 0.74, 4)
HOAC inflection point ‘gas of circles’ model. The result is good, but the method found two small trees, and there is another false positive in the bottom left of the image. The execution time was 89 minutes. Figure 4.1(c) shows the result using the phase field ‘gas of circles’ model. There were no false negatives, but the model found false positives in the fields. Figure 4.1(d) shows the result with the new phase field inflection point ‘gas of circles’ model. All but two somewhat smaller circles were successfully found. For the phase field models, the execution time was less than 3 minutes. Figure 4.2(a) shows an image (133 × 271) of a planted forest. Figure 4.2(b) shows the result with the phase field ‘gas of circles’ model. The result is good, with only a few joined tree crowns. Figure 4.2(c) shows the result with the new model. There are fewer joined tree crowns. Both results were obtained in less than 2 minutes. Figure 4.2(a) shows a difficult image (129 × 139) to analyse. It has two fields with different intensities on the right. The result with the phase field ‘gas of circles’ model is shown in figure 4.2(b). This result clearly demonstrate the disadvantage of the noninflection point model: phantom objects are created in the homogenous areas. Figure 4.2(c) shows the result with the new model. The result is very good, with only one false positive. Both results were obtained in less than 1 minute.
A New Phase Field Model of a ‘Gas of Circles’ for Tree Crown Extraction
709
5 Conclusion We have described a phase field version of the HOAC inflection point ‘gas of circles’ model described in [7]. The model was applied to the problem of tree crown extraction from aerial images, a problem of great practical importance for the forestry industry. The model combines the strengths of the inflection point model with those of the phase field framework: it removes the ‘phantom circles’ produced by the original ‘gas of circles’ model, while executing two orders of magnitude faster than the contour model.
References 1. Horv´ath, P., Jermyn, I.H., Kato, Z., Zerubia, J.: A higher-order active contour model for tree detection. In: Proc. International Conference on Pattern Recognition (ICPR), Hong Kong, China (2006) 2. Rochery, M., Jermyn, I.H., Zerubia, J.: Higher-order active contours. International Journal of Computer Vision 69, 27–42 (2006) 3. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1, 321–331 (1988) 4. Cremers, D., Kohlberger, T., Schn¨orr, C.: Shape statistics in kernel space for variational image segmentation. Pattern Recognition 36, 1929–1943 (2003) 5. Foulonneau, A., Charbonnier, P., Heitz, F.: Geometric shape priors for region-based active contours. In: Proc. IEEE International Conference on Image Processing (ICIP). vol. 3, pp. 413–416 (2003) 6. Leventon, M., Grimson, W., Faugeras, O.: Statistical shape influence in geodesic active contours. In: Proc. IEEE Computer Vision and Pattern Recognition (CVPR), Hilton Head Island, SC, USA, pp. 316–322 (2000) 7. Horv´ath, P., Jermyn, I.H., Kato, Z., Zerubia, J.: An improved ‘gas of circles’ higher-order active contour model and its application to tree crown extraction. In: Kalra, P., Peleg, S. (eds.) ICVGIP 2006. LNCS, vol. 4338, Springer, Heidelberg (2006) 8. Rochery, M., Jermyn, I., Zerubia, J.: Phase field models and higher-order active contours. In: Proc. IEEE International Conference on Computer Vision (ICCV), Beijing, China (2005) 9. Gougeon, F.A.: Automatic individual tree crown delineation using a valley-following algorithm and rule-based system. In: Hill, D., Leckie, D. (eds.) Proc. International Forum on Automated Interpretation of High Spatial Resolution Digital Imagery for Forestry, Victoria, British Columbia, Canada, pp. 11–23 (1998) 10. Larsen, M.: Finding an optimal match window for Spruce top detection based on an optical tree model. In: Hill, D., Leckie, D. (eds.) In: Proc. International Forum on Automated Interpretation of High Spatial Resolution Digital Imagery for Forestry, Victoria, British Columbia, Canada, pp. 55–66 (1998) 11. Perrin, G., Descombes, X., Zerubia, J.: A marked point process model for tree crown extraction in plantations. In: Proc. IEEE International Conference on Image Processing (ICIP), Genova, Italy (2005)
Shape Signature Matching for Object Identification Invariant to Image Transformations and Occlusion Stamatia Giannarou and Tania Stathaki Communications and Signal Processing Group Imperial College London {stamatia.giannarou,t.stathaki}@imperial.ac.uk
Abstract. This paper introduces a novel shape matching approach for the automatic identification of real world objects in complex scenes. The identification process is applied on isolated objects and requires the segmentation of the image into separate objects, followed by the extraction of representative shape features and the similarity estimation of pairs of objects. In order to enable an efficient object representation, a novel boundary-based shape descriptor is introduced, formed by a set of one dimensional signals called shape signatures. During identification, the cross-correlation metric is used in a novel fashion to gauge the degree of similarity between objects. The invariance of the method to uniformscaling and partial occlusion is achieved by considering both cases as possible scenarios when correlating shape signatures. The proposed vision system is robust to ambient conditions (partial occlusion) and image transformations (scaling, rotation, translation). The performance of the identifier has been examined in a great range of complex image and prototype object selections.
1 Introduction Recently there has been, within the image processing industry, an increasing emphasis on being able to extract from an image an object’s shape and characteristics. For example, in military applications it is essential to be able to extract an object’s shape for comparison with stored prototype shapes for target characterization. This work is concerned with the investigation of the fundamental principles and methods needed for an autonomous system which is able to recognize objects in digital images. Automatic object recognition is a hard image processing task comprising a great number of difficulties and requirements. A robust object recognition system should succeed to identify objects which are shifted, rotated, distorted or partially occluded in the scene. In addition, the identification system should recognize objects regardless of their scale. The fact that the shape of objects in real world varies depending on the scale of observation has significant implications in describing them. The aim of this work is to introduce a novel object identification system that meets the above invariance requirements. In this contribution, we present a novel vision system for the recognition of real world objects in complex scenes which is robust to ambient conditions (partial occlusion) and image transformations (scaling, rotation, translation). The identification process is based on the comparison of assemblies of image regions with a previously stored view of a known prototype object. The method has two steps: pre-processing W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 710–717, 2007. c Springer-Verlag Berlin Heidelberg 2007
Shape Signature Matching for Object Identification Invariant
711
and recognition. The first involves the extraction of significant features from the image (such as object boundaries), the segmentation of the image into separate objects and the tracing of object contours. The second step takes into account a novel boundary-based shape descriptor formed by a set of one dimensional signals called shape signatures, to enable an efficient object representation. During identification, the cross-correlation and related metrics of object signatures are used to gauge the degree of similarity between two shapes yielding a set of similarity terms. A test object will be of no interest to us as long as compared with the prototype, the degree of similarity remains below a suitable acceptance threshold.
2 Prior Work on Shape Recognition The problem of object recognition is one of the most familiar and fundamental problems in image processing. There is a great variety of shape descriptors and associated similarity measures which emphasize different image properties such as pixel intensities, color, texture and edges. Comprehensive surveys on methods for shape analysis have been reported [9, 12, 14]. In general, object identification techniques can be classified into two distinct categories presented below. The first category includes techniques based on the so called interest point operators. In these approaches, image features which mainly indicate local saliency are generated by applying local shape descriptors on image regions as for example, the Harris-Laplace detector [10]. The extracted features are used as input to a matching system that yields the final identification result. Representative interest point operators include the Scale Invariant Feature Transform (SIFT) approach developed by Lowe [10], Lindeberg’s blob detector [8] and the epipolar alignment of images taken from different viewpoints proposed in [15]. Techniques that belong to the second category are based on shape matching and are applied on isolated objects or pre-segmented images. They involve the segmentation of the image into separate objects, followed by shape representation and estimation of the similarity between pairs of objects. Common shape representation techniques of this category use a specific type of the so called moment invariants, first introduced by Hu [7]. A more efficient descriptor is the generic Fourier descriptor (GFD) [13]. It is obtained by applying the two-dimensional Fourier transform on a polar-raster sampled shape image. A recent approach to shape representation is the Shape Context descriptor [2]. This descriptor is applied on edge point locations and yields a 2D histogram describing the edge distribution in the surrounding region of each contour point. Another representative example is the shape signature approach. The signature is a one dimensional function derived from the boundary points of the shape. Common shape signatures include centroid distance, tangent angle and area [5, 11]. In this work, the object representation is based on shape signatures extracted from the object’s boundary. The paper is organized as follows. A new boundary-based object descriptor is introduced in the next section. A novel identification method is proposed in Section 4. Experimental results of applying the identifier on complex images and prototype objects are presented in Section 5. We discuss the performance of the identifier and conclude with Section 6.
712
S. Giannarou and T. Stathaki
3 Shape Representation Using a Set of Signatures In this section, we focus on 2D shape representation and based on shape signatures, we propose a novel descriptor that captures the object’s boundary information under the constraint of translation, rotation and uniform-scaling invariance properties. A number of preprocessing steps applied on the complex and the prototype shape image are prerequisite for the object representation. To begin with, the recovery of the shape descriptor requires the extraction of the edge map (collection of edge pixels) of the image of interest. This can be easily obtained by applying an edge detection process (as for example the Canny operator [3]) to the image. Since the proposed descriptor is applied on isolated objects, the so called ”Connected-Component Labeling” technique described in [6], is employed for the image segmentation into separate objects. The detected connected components represent separate objects in the image. In order to arrange the contour points of an object as an ordered sequence of adjacent pixels, each connected component becomes input to a contour tracer. The boundary tracing approach employed in this work is based on the one proposed in [4]. The output of the algorithm is an array containing the co-ordinates of the boundary traced in a clockwise direction. Let an object P represented by a set of traced contour (edge) points (pixels) {(x1 , y1 ) . . . (xn , yn )}, where x and y stand for the x and y-coordinates of the points on the image plane. The notation P ≡ {(x1 , y1 ) . . . (xn , yn )} may be used. Prior to the analysis of the proposed shape descriptor, we introduce the shape signature sk , defined as the set of the distances between the points of the traced boundary sequence and a fixed point on the contour. Mathematically, the above signature is expressed as: sk = (xi − xk )2 + (yi − yk )2 , for i = 1 . . . n (1) From the above definition it is clear that the signature sk depends on the reference point (xk , yk ); a different reference point will result in a different signature for the same object. In order to enable an accurate shape representation, we propose a novel shape descriptor which takes into consideration all the signatures yielded from an object’s contour. For the object P , the proposed shape descriptor S is defined as: S = {sk , for k = 1 . . . n} where sk is the distance function defined by (1). According to this representation, each object will be described by a set, S, of shape signatures generated by calculating the distance functions sk with respect to different reference points. The length of the signature set S will be equal to the length of the traced boundary sequence of the object. Since the distances in (1) are calculated with respect to points on the shape, the descriptor is invariant to changes in the location of the test object in the image. In addition, since all the contour points are used as reference points, the shape descriptor of a rotated object will include the same signatures as the descriptor of the original one. It can be proven that uniform scaling of an object does not affect the shape of signature sk , only its amplitude and its length are multiplied by a factor proportional to the scaling one. Therefore, as it will be shown later on, when combined with a suitable similarity measure, the proposed shape descriptor is also invariant to uniform-scaling of the
Shape Signature Matching for Object Identification Invariant
713
test object. The above descriptor is also robust to ambient conditions, such as partial occlusion. Assume an object P ≡ {(x1 , y1 ) . . . (xn , yn )} represented by the signature set {spl , for l = 1 . . . n} and an occluded version of it, C ≡ {(x1 , y1 ) . . . (xm , ym )} described by the set {sck , for k = 1 . . . m}, where m < n. Since C is an occluded version of P , the former’s contour will be part of the latter’s. That means a signature sci from the set {sck , for k = 1 . . . m} will be part of another one spj from the set {spl , for l = 1 . . . n}. Thus, during retrieval sci and spj will exhibit a good partial (local) match.
4 Shape Similarity Evaluation As we explained in the previous section, invariance to rotation and translation is intrinsic to the proposed shape descriptor. When measuring shape similarity, special treatment is given to achieve invariance to uniform-scaling and partial occlusion, as well. In both of the aforementioned cases (scaling and occlusion) the length of the signature changes. Since the shape of a signature is not seriously affected by changes in the object’s size, uniform-scaling could be handled in similarity estimation by normalizing the compared signatures and making them of the same length. On the other hand, we could deal with partial occlusion by comparing the shorter signature with parts of the longer one. However, the proposed vision system for automatic object recognition is blind and it is not possible to determine in advance the transformation imposed to the test object, i.e., whether the object is scaled or occluded. For that reason, if the lengths of the compared shape signatures are different, the proposed method for similarity estimation distinguishes both cases. For the final decision, the outcomes from both cases are considered and compared. Let a prototype object P consisting of N boundary points, be represented by the set of shape signatures SP ≡ {Sprot i , for i = 1 . . . N }, where Sprot i is the distance function with reference to the contour point (xi , yi ) on the boundary of P . Similarly, let a test object T of M boundary points be described by the set ST ≡ {Stestj , for j = 1 . . . M }. Without loss of generality, let’s assume that M < N . The similarity between P and T is estimated by comparing the N × M signature pairs (Sprot i , Stest j ). In the first step of similarity extraction, the smaller of the compared objects, T , is treated as a scaled version of the bigger one, P , and the maximum of the Normalized Circular Cross-Correlation (NCCC) is employed to gauge the degree of similarity between each signature pair (Sprot i , Stestj ). Since the above measure requires the compared signatures to be of the same length, interpolation is employed to stretch the Stest j signatures and make them of size N , yielding the signatures SItest j . The Normalized Circular Cross-Correlation of a signature pair (Sprot i , SItest j ) is defined as: for, k = −N + 1 . . . N − 1 (Sprot i (n) − S prot i )(SItestj (n − k) − SI testj ) i = 1...N N CCCij (k) = n 2 2 j = 1...M n (Sprot i (n) − S prot i ) n (SItestj (n − k) − SI testj )
where, n = 1 . . . N is the signature point index and S prot i , SI test j stand for the mean value of Sprot i and SItestj , respectively. The similarity of a signature pair (Sprot i , SItestj ) is given by:
714
S. Giannarou and T. Stathaki
Cij = C(Sprot i , SItest j ) = max {N CCCij (k)} k
The use of circular cross-correlation eliminates the effect of the unknown shift between the starting points of the two contours, while the normalization makes the similarity measure invariant to differences in the signatures’ amplitude due to scaling. According to the above object comparison, for each contour point on the smaller object associated with a signature Stestj , there will be a point on the bigger object associated with a signature Sprot i which bears the highest similarity to Stest j among all the signatures corresponding to the bigger object’s contour points. Therefore, each contour point j on the smaller object is assigned a value (equal to maxi {Cij }) indicating the maximum similarity of the two objects with reference to point j. Our main idea for similarity estimation stems from the fact that, when comparing two similar objects differing only in size, the maxi {Cij } values will be high (ideally equal to 1) for all the contour points. On the other hand, when comparing dissimilar objects, even if few signature pairs are highly correlated, the majority of contour point pairs are associated with dissimilar signatures, yielding low maxi {Cij } values. In order to make the similarity estimation more robust, the degree of similarity between two shapes is expressed with respect to two terms, namely the maximum and minimum similarity values assigned to the smaller object’s contour points. Therefore, the similarity between the objects P and T , when the latter is treated as a scaled version of the former, is determined by the terms: Vw max = maxj {maxi {Cij }}
and Vw min = minj {maxi {Cij }}
(2)
When comparing similar objects, both Vw max and Vw min terms should be of high value. For that reason, as it is explained later on, the product of the above terms is considered for the final decision on the objects’ similarity. The second step of similarity evaluation treats the smaller object as an occluded version of the larger one. In that case, we accept that if objects P and T match, one of the signatures {Stestj , for j = 1 . . . M } will be part of another one from the set {Sprot i , for i = 1 . . . N }. We should make clear the fact that the above postulation is not valid if the test object has been imposed both uniform-scaling and partial-occlusion. The signature comparison is achieved by cross-correlating each Stestj signature with frames of length M , {FSfprot , for f = 1 . . . (N − M )}, of each Sprot i signature. So, i the degree of similarity, Dij , for the pair (Sprot i , Stest j ) is defined as the maximum of the Normalized Cross-Correlation (NCC) values across all the frames. That is: Dij = max {N CC(FSfprot , Stestj )} f
i
for, i = 1 . . . N, j = 1 . . . M f = 1 . . . (N − M )
where N CC(r, q) represents the maximum value of the Normalized Cross-Correlation between the signatures r and q. Following the previous analysis, the object similarity for the case of treating the smaller object as an occluded version of the larger is defined by the terms: Vomax = maxj {maxi {Dij }}
and Vomin = minj {maxi {Dij }}
(3)
Shape Signature Matching for Object Identification Invariant
715
60 Original Object Scaled Object 50
40
30
20
10
0
(a)
0
20
40
60
(b)
80
100
120
140
160
180
(c) 40
60 Original Object Signature
Original Object Scaled Object
35
Streched Signature of Scaled Object
50
30 40
25 20
30
15 20
10 10
5 0
0
20
40
60
80
100
120
140
160
0
180
(d)
0
20
40
60
80
100
120
140
160
180
(e)
Fig. 1. (a) Complex and (b) prototype shape images. (c) The most similar shape signatures for the airplane objects. Comparison of the airplanes’ shape signatures when treating the smaller object as (d) a scaled (e) an occluded version of the bigger one.
Having defined the similarity terms for a pair of objects in equations (2) and (3), a decision has to be made on whether the two objects are similar or not. This decision is based on the ratio of the similarity terms, defined as: Rs =
Vo max Vo min · Vw max Vw min
(4)
The value of Rs is indicative of the transformation (uniform-scaling or partial-occlusion) that has been imposed to the test object. If Rs is greater than 1, the test object will be an occluded version of the prototype providing that the Vo min term exceeds a predefined threshold, T hro . Likewise, if Rs is less than 1, the value of Vw min is compared with the threshold T hrw to determine whether the test object is a scaled version of the prototype or not. In any of the above cases for Rs , if the examined similarity term survives the thresholding process, the target object is present in the complex scene at a location indicated by the x and y-coordinates of the contour points of T , otherwise the test object will be of no interest to us.
5 Experimental Results Generally, the signature correlation between two shapes tends to exhibit high values, even if the shapes correspond to different objects. This observation is even more profound in man-made objects (as for example aircrafts), the type of objects we are mainly
716
S. Giannarou and T. Stathaki
interested in this work. An obvious reason is that a large part of the contour of a manmade object consists of straight lines or low-curvature parts. Therefore, when estimating the signature correlation between two man-made objects, the above mentioned parts will be most probably present in both objects, enhancing the value of signature correlation. Moreover, when an object is treated as an occluded version of another object using the technique presented in this work, is again likely that the signature of the occluded object will be highly correlated to a signature frame of the original non-occluded object, even if the two objects are dissimilar. The reason is, that it is likely, that a part of a man-made object will exhibit some serious similarity with a part of any other manmade object. This observation is more identifiable if the occluded object contains less number of salient features, i.e., high activity points. However, in the proposed system we observed that although signature correlations are generally of high values, the correct scenario provides consistently better results (higher correlations) compared to the false scenario. The similarity thresholds were experimentally set equal to T hrw = 0.95 and T hro = 0.85. In this work, we have applied the proposed object identification algorithm on a great range of complex image and prototype object selections [1]. A representative simulation is illustrated in Figure 1 for the identification of a Harrier airplane in a synthetic image containing three different objects, a bird, a cat and a scaled Harrier airplane (of size half of the target’s size). In Figure 1(d) we present the comparison of the signature pair (Sprot i , SItest j ) that gives the maximum similarity value (Vw max ), when treating the smaller object as a scaled version of the bigger one. Figure 1(e) illustrates the comparison of the signature pair that gives the maximum similarity value (Vo max ), when treating the smaller object as an occluded version of the bigger one. The comparison of the shape signatures yielded a specific signature pair with the highest similarity, illustrated in Figure 1(c), corresponding to the Harrier prototype and the scaled Harrier test objects, respectively. The two signatures have different length but the same shape. The results of evaluating the similarity terms (Vw max , Vw min , Vo max , Vo min ) for each pair of test-prototype objects are: for Bird-Prototype (0.98, 0.80, 0.96, 0.78), for Cat-Prototype (0.96, 0.70, 0.95, 0.81) and for Harrier - Prototype (0.99, 0.96, 0.97, 0.89). The ratio of similarity terms in the case of the Bird-Harrier pair is equal to Rs B = 0.95 < 1 and since Vw min < T hrw , the two objects are not similar. Likewise, for the Cat-Harrier pair RsC = 1.14 > 1 and Vomin < T hro . The similarity term evaluation for the two Harrier objects results in Rs H = 0.9 < 1 and Vw min > T hrw , verifying that the compared objects are similar differing only in the size, which agrees with the actual scenario. The above results emphasize the ability of the proposed approach to distinguish between similar/ dissimilar objects and to determine the transformation (uniform-scaling or occlusion) that has been imposed to the test objects.
6 Conclusions A new system for the identification of two-dimensional objects in complex scenes has been presented in this work. The identification method is a prototype-based one, where prior knowledge of the target is given by a prototype template consisting of a set of representative features. In order to make the proposed system invariant to rotation and
Shape Signature Matching for Object Identification Invariant
717
translation, a novel shape descriptor is introduced as the set of shape signatures extracted from the object’s contour and defined as the distance of the boundary points with reference to a fixed point on the contour. The similarity between two objects is estimated by cross-correlating signature pairs included in their shape descriptors. During the comparison process, two distinct cases are considered in order to achieve invariance of the identifier to uniform-scaling and partial occlusion. The robustness of the identifier is verified by the presented experimental results.
Acknowledgements The investigations which are the subject of this paper were initiated by Dstl under the auspices of the United Kingdom Ministry of Defence Systems Engineering for Autonomous Systems Defence Technology Centre.
References 1. http://www.lems.brown.edu/vision/software 2. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Machine Intell. 24(4), 509–522 (2002) 3. Canny, J.F.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Machine Intell. 8(6), 679–698 (1986) 4. Carter, J.R.: Boundary tracing method and system. European Patent Applcation EP341819A3 (1989) 5. Davies, E.R.: Machine Vision: Theory, Algorithms, Practicalities. Academic Press, London (1997) 6. Haralick, R., Shapiro, L.G. (eds.): Computer and Robot Vision, vol. I, pp. 28–48. AddisonWesley, London, UK (1992) 7. Hu, M.K.: Visual pattern recognition by moment invariants. IRE Transactions on Information Theory IT 8, 179–187 (1962) 8. Lindeberg, T.: Feature detection with automatic scale selection. International Journal of Computer Vision 30(2), 77–116 (1998) 9. Loncaric, S.: A survey of shape analysis techniques. Pattern Recognition 31(8), 983–1001 (1998) 10. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 2(60), 99–110 (2004) 11. Van Otterloo, P.J.: A contour-oriented approach to shape analysis. Prentice-Hall, Englewood Cliffs (1991) 12. Veltkamp, R.C., Hagedoorn, M.: State of the art in shape matching. Principles of visual information retrieval, pp. 87–119 (2001) 13. Zhang, D., Lu, G.: Shape based image retrieval using generic fourier descriptors. Signal Processing: Image Communication 17(10), 825–848 (2002) 14. Zhang, D., Lu, G.: Review of shape representation and description techniques. Pattern Recognition 37(1), 1–19 (2004) 15. Zhang, Z., Deriche, R., Faugeras, O.D., Luong, Q.T.: A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence 78(1–2), 87–119 (1995)
Junction Detection and Multi-orientation Analysis Using Streamlines Frank G.A. Faas and Lucas J. van Vliet Quantitative Imaging Group, Delft University of Technology, The Netherlands
[email protected] Abstract. We present a novel method to detect multimodal regions composed of linear structures and measure the orientations in these regions, i.e. at line X-sings, T-junctions and Y-forks. In such complex regions an orientation detector should unmix the contributions of the unimodal structures. In our approach we first define a (streamline) divergence metric and apply it to our streamline field to detect junctions. After this step we select all streamlines that intersect a circle of radius r around the junction twice, cluster the intersection points and compute the direction per cluster. This yields a multimodal descriptor of the local orientations in the vicinity of the detected junctions. The method is suited for global analysis and has moderate memory requirements.
1
Introduction
Several well known tools exist for the detection, localization and characterization of unimodal linear structures based on the structure tensor and the Hessian matrix. For multimodal structures the detection can still rely on the structure tensor [7], as demonstrated by the Harris-Stephens corner detector [8]. However these methods cannot unmix the responses from the composing structures. Most detectors have a response which decreases strongly with decreasing angular separation between the composing structures. The filterbank approach [4,6,3] is a well known method for the analysis of multimodal regions. However, its excellent angular selectivity is combined with a poor localization. The key observation to our method is the following: any X-sing, T-junction or Y-fork is composed of a few unimodal structures (line or edge segments). At X,T,Y-transitions a single measure for orientation, e.g. by means of the structure tensor, yields a weighted sum of orientations of the constituent elements. Moving away from the center of the junction toward one of the unimodal structures, the measure will approach that of a unimodal structure. In this paper we present a streamline based method which connects the unimodal regions to multimodal regions, e.g. the arms of a junction to the junction itself. The streamlines are constructed from a vector field that represents the local structure. This method allows us to accurately locate junctions and crossings and at the same time measures the attributes of the underlying unimodal structures. The method offers a good angular selectivity within a relatively small analysis window. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 718–725, 2007. c Springer-Verlag Berlin Heidelberg 2007
Junction Detection and Multi-orientation Analysis Using Streamlines
719 AA
B B B B
A A
r
• (a)
(b)
CC CC (c)
Fig. 1. (a) Artificial Y-junction with orientation field (gray overlay) and three streamlines (green). (b) Orientation field with a streamline starting from the black dot in the center. (c) Sketch of a Y-junction (dashed lines) with some streamlines around the junction center. The circle labeled with r denotes the distance from the center at which the orientation field is sampled and the labeling of the streamlines is performed. The letters denote the (double) labeling of the streamlines.
2
Method
Our method comprises five steps. The first step is to find a suitable vector field which describes the local structure in a consistent way, see Fig.1(a-b). Second, this vector field is used to calculate streamlines at each pixel position. Third, a streamline divergence metric is introduced which yields a high value for multimodal structures and a low value on simple linear structures. Fourth, a suitable threshold is applied to detect the multimodal structures. After merging fragmented responses, the center point of the junction is determined. Fifth, we procede with the analysis of the streamlines in a region around the junction center, see Fig. 1(c). To that end we define a circle with radius r centered at the junction. A suitable streamline should cross this circle twice, i.e. once for each of the unimodal structures it connects. The points where the streamlines intersect with the circle defines the new endpoints. The set of all endpoints will give rise to clusters, i.e. one cluster for each unimodal structure in the junction. These clusters are analyzed and each streamline is assigned two labels, one for each unimodal structure it belongs to. For each cluster we compute the average direction over the endpoints of the streamlines that belong to the same cluster. Step One: Vector Field A suitable vector field in this framework follows the underlying local structure in the image and is smooth and continuous. The vector field should be well defined on both even and odd structures, i.e. respectively along the bottom of the valley and along the edge of linear structures. We assume, without loss of generality, a white background and dark foreground throughout this paper. The vector field in this paper is created by means of a set of rotated quadrature filters which are combined into a tensor representation. An eigensystem
720
F.G.A. Faas and L.J. van Vliet P (s
)−
P (0
)
) (s Q
dpq
−
w
dsep
) (0
distance
Q
P (−s)
Q(−s)
P (s)
α
P (0)
Q(0)
dmax = Q(s)
streamline labels (a)
(b)
2w sin 12 α
(c)
Fig. 2. (a) Typical dendrogram for a X-sing, depicting the cluster distance measure (vertical) composed of individual streamlines (horizontal). (b) Sketch of the distance measure dpq between the streamlines P and Q at a distance s from the respective starting points P (0) and Q(0). (c) Sketch of the intrinsic scale (here denoted by dmax ) of the intersection of two lines of width w.
analysis gives the vector field. The tensor is constructed from quadrature filters as described by [9]. The quadrature filter is constructed by means of the generalized Hilbert transform Fi = G(|u|, σF ) H(u · ni ) (u · ni )2
(1)
with G an isotropic Gaussian transfer function, H the Heaviside function and the last factor a quadratic cosine term to select the radial polynomial as well as the angular response. The orientation vectors ni are defined as ni = [cos( π3 i), sin( π3 i)]T with i = 0, 1, 2. The 2D tensor T is defined as T=
|qi |(ni niT − I)
(2)
i=1,2,3
with qi the response of the ith quadrature filter in the spatial domain. Now the vector field is given by the eigenvector belonging to the smallest eigenvalue. Step Two: Streamlines A streamline is a line which is everywhere tangent to the local flow field. Mathematically this can be stated as dx ∝ u(x) and x(s0 ) = x0 ds
(3)
with u the vector field and x(s) the parameterized streamline. Further, point x0 denotes a point on the streamline, i.e. the point of origin. In here a streamline is defined as a curve which is tangent to the local structure. As even structures have no direction, the flow vector is equally well described by two antipodal vectors, creating possible phase jumps between neighboring points. To solve this phase jump problem locally , i.e. at position x(s) on the streamline, the flow field is
Junction Detection and Multi-orientation Analysis Using Streamlines
721
given by one of the two antipodal vectors which produces the smallest angle with respect to the tangent vector of the streamline at x(s). As such the flow field is defined with respect to a local point of origin and the propagation direction of the streamline. Further, the start direction, i.e. u(so ) of the streamline is ambiguous as there is no history to determine the tangent vector. This is not a problem as the curve of interest is centered around the point x0 and extends along the structure in both directions for a distance l resulting in a streamline of length 2l with parameter s running from −l to l along the curve, i.e. from one end to the other end. Here s is the position along the contour measured from the center point. The streamlines are implemented by means of a first order method. Step Three: Streamline Divergence Metric Two streamlines originating from points close together, on say one arm of the junction, can end up in two different arms separated by a significant distance, dpq , see Fig. 2(b). Our junction detector is based on this observation. For each pixel the streamlines in a neighborhood are analyzed and the maximum separation between the reference streamline and the streamlines in the neighborhood are determined. We compute the distance between two streamlines at a distance ld (measured along the streamline) away from the pixel to be inspected. Summing over the line distances in the neighborhood yields a high value in the proximity of a junction. The streamline, P , through point p has to be compared to the test streamline, Q, through point q, where q lies in the neighborhood of p. The first step in computing the maximum separation between two streamlines is to align them with respect to the streamline at the point of interest p. This is necessary because the ’positive’ direction of the streamline is ambiguous along line structures in the image. After alignment the two streamlines in negative direction converge and in the positive direction diverge. Now the distance between the two streams P and Q is defined as the distance between the points on the streams with a distance ld to their respective origin measured in the aligned positive direction; dpq = |(P (ld ) − Q(ld )) − (P (0) − Q(0))|
(4)
where the second term is the translation vector between the streamline centers, see Fig. 2(b). Now we sum the distances between the streamline at p and those in the local neighborhood N to get a ’streamline divergence’ measure dp at p; dp = dpq (5) ∀q∈N
in which the size of neighborhood N reflects the intrinsic scale of the constituting unimodal structures. The intrinsic scale is defined by the width of the line segments or the length of the edge slope. It is by definition larger or equal to the support of the overall Point Spread Function of optics and filters and can be measured by the method of Dijk [2].
722
F.G.A. Faas and L.J. van Vliet
Step Four: Detection of Junctions The aforementioned ’streamline divergence’ measure also gives reasonably high responses in the (noisy) background. We suppress these points by multiplying with the certainty of the Hessian matrix which is close to zero in the background. A measure based on the tensor T is rejected as its response is not selective and not localized enough due to the quadrature filters. Therefore we use the more localized certainty measure based on the Hessian matrix as presented in [1] to select our regions of interest, basically the valley regions. This certainty measure can be stated as 2 8 2 2 + f2 − f f c = |f20 , f21 , f22 | = fxx (6) xx yy + fxy yy 3 3 where f2i , with i ∈ {0, 1, 2}, denote the spherical harmonics of second order and {fxx, fxy , fyy } the second order spatial derivatives. This certainty deviates from the norm of second order derivatives, i.e c = |fxx , fxy , fyy |, as it corrects for the fact that the second order spatial derivatives do not form an orthogonal basis [1] while the spherical harmonics of the second order do. Now an isodata (twomeans) threshold [10] on the measure c is performed resulting in regions with a pronounced second order structure. As we are only interested in the valleys, the following restriction is introduced |λ1 | > |λ2 | ∧ λ1 > 0 where λi is the ith eigenvalue of the Hessian matrix with λ1 > λ2 . The regions for which one of the restrictions fails is put to zero in the response image. Of course, other nonlinear weightings can also be applied to suppress the background. Now we can apply a simple threshold to the cleaned response image (just above the distinct background peak of the corresponding histogram) to get points where the streamlines show a high degree of divergence. Right at the junctions eddies may occur due to a lack of translational invariance. Hence we get a relatively low response at the centers of these junctions. However close to these vortexes the divergence of the streamlines is very high. Therefore we include a step in which we merge the divergence responses in a neighborhood equal to the intrinsic scale of the underlying structure. Now we have identified the junctions we use the center of gravity of those regions as the position of the junctions. The localization can be further refined by finding the points of minimum distance to the found center point on all streamlines in the junction region. From these points a new center of gravity can be calculated and if necessary this can be iterated. In our implementation the last step of refining the localization is not performed as the localization was good enough for our aim of demonstrating the algorithm. Step Five: Direction Estimation In the metric subsection we aligned streamlines on a pair by pair basis. In this paragraph we will cluster them such that they are labeled with the two labels of the two unimodal structures they connect. This is done by a simple hierarchical clustering method and applying a threshold at dsep to the dendrogram, see Fig. 2(a)). In Fig. 1(c) a junction is shown with a number of streams each with two labels, i.e. A, B or C. We define a circle with radius r centered around the
Junction Detection and Multi-orientation Analysis Using Streamlines
723
junction center. Now from all streamlines the intersections with this circle are determined, i.e. each streamline should cross this circle twice. Based on these points we will label our streamlines. r should be chosen such that the circle is larger than the size of the mixed zone and on the other hand as small as possible to avoid mixing with neighboring junctions. The longest ’diameter’, dmax , of the mixed zone depends on the intrinsic scale of the constituent unimodal structures and the smallest angle α between them, see Fig. 2(c). Note that we have directional information due to the fact that the orientation ”vectors” can be oriented such that they point away from the junction center. These vectors can therefore be averaged in each cluster i.e. no structure tensor is necessary as all vectors are mapped in the same direction, i.e. away from the junction. This average is used as direction estimate. The directional information of the streamlines is sampled at r, see Fig. 1(c).
3
Results
Fig. 3(a) and (b) show respectively a small part of the vascular system in the retina of an eye and a deformed miniaturized claydike model with a superimposed grid. The overlays in both images show the locations of the detected crossings (red crosses), the circles on which the directions are measured and the measured orientations (colored lines). The minimum distance between clusters is set to dsep = 1 which means that clusters separated by a distance smaller than dsep in the dendrogram are merged. Further we set r and ld to identical values, i.e. r = ld = 7 and r = ld = 8 for respectively the retina and claydike image. These values roughly correspond to the scale of the crossings in the images. Further at a distance ld from the junction center the streamlines are in general parallel to the unimodal structures. As can be seen all crossings are detected. The direction estimate for one of the arms of left middle junction in the retina image is a bit off. This is a consequence of the very low contrast, i.e. the influence of the noise on the vector field is more severe if the contrast is low. Fig. 3(c-d) show respectively the ’streamline divergence’ before and after background suppression using the second order certainty measure.
4
Discussion and Conclusions
Preliminary testing of our method shows excellent detection and characterization for a modest window size. The junction detection is dependent on the local structure and as such relatively independent of the local contrast. Furthermore, the angular selectivity can be increased by increasing the labelling distance r. Another nice aspect of the method is that it automatically selects a mode, i.e. it selects the number of unimodal structures at the junction. The implementation used can easily be improved e.g. by means of higher order streamline methods. This should decrease the influence of noise and should refine the streamlines. Furthermore one could think of averaging the orientation information over a larger area to obtain a more robust estimate, .i.e. at this
724
F.G.A. Faas and L.J. van Vliet
(a)
(b)
(c)
(d)
Fig. 3. (a) Blowup of a retina image (courtesy National Eye Institute, U.S. National Institute of Health). (b) Image of a deformed miniaturized claydike model with a superimposed grid. Courtesy of GeoDelft, The Netherlands. In overlay: the dashed circles denote the analysis window and the colored lines at the circle denote the measured orientations at their intersection with the circle. With r = ld = 7 and r = ld = 8 or respectively image a and b. (c) The distance metric of the image in b. (d) The distance metric for the image b after setting the points where the Hessian is uncertain to zero.
moment the streams are only sampled were they cross a circle of radius r which makes the orientation susceptible to noise. We state that the labeling of the streamlines can be performed at approximately the same distance from the junction center as the detection of the junctions, i.e. ld ≈ r, because both parameters are directly related to the intrinsic size of the junctions. As shown in Fig. 2(c) we can relate the minimum angle of separation α to the width w of the unimodal structures. As such as a rule
Junction Detection and Multi-orientation Analysis Using Streamlines
725
of thumb we state that r ≤ sin2w1 α which ensures that the entire mixed zone is 2 enclosed. As such the parameters are basically reduced to an angular resolution parameter α and the width of the unimodal structures. Optionally one could perform the clustering and the direction measurements at different positions along the streamlines. As the intrinsic scale of junctions can differ over the image one could make r and ld position dependent [5]. This could also minimize interaction of structures in close proximity of each other. After inspection of the individual steps of the algorithm we estimate that the time complexity of the algorithm is approximately a factor of 10 higher than that of a eigensystem analysis of the gradient structure tensor. Further the data complexity is low and only widely accepted tools are used.
References 1. Danielsson, P.-E., Lin, Q., Ye, Q.-Z.: Efficient detection of second-degree variations in 2D and 3D images. J. Vis. Comm. Image Represent. 12, 255–305 (2001) 2. Dijk, J., van Ginkel, M., van Asselt, R.J., van Vliet, L.J., Verbeek, P.W.: A new sharpness measure based on gaussian lines and edges. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 149–156. Springer, Heidelberg (2003) 3. Faas, F.G.A., van Vliet, L.J.: 3d-orientation space; filters and sampling. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 36–42. Springer, Heidelberg (2003) 4. Freeman, W.T., Adelson, E.H.: The design and use of steerable filters. IEEE Trans. Patt. Anal. Mach. Intell 13(9), 891–906 (1991) 5. Garding, J., Lindeberg, T.: Direct computation of shape cues using scale-adapted spatial derivative operators. Int. J. Comput. Vis. 17, 163–191 (1996) 6. van Ginkel, M., Verbeek, P.W., van Vliet, L.J.: Orientation selectivity for orientation estimation. In: Proceedings of the 10th Scandinavian Conference on Image Analysis Lappeenranta, Finland, vol. I, pp. 533–537, June 9-11 (1997) 7. Granlund, G.H.: In search of a general picture processing operator. Computer Graphics and Image Processing 8, 155–173 (1978) 8. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. of the fourth Alvey Vision Conference, pp. 147–151 (August 31–September 2 1988) 9. Knutsson, H.: Representing local structure using tensors. In: The 6th Scandinavian Conference in Image Analysis, pp. 244–251, Oulu, Finland, June 19-22 (1989) 10. Ridler, T.W., Calvard, S.: Picture thresholding using an iterative selection method. IEEE Trans. Syst. Man. Cyb. 8, 630–632 (1978)
Decomposing a Simple Polygon into Trapezoids Fajie Li and Reinhard Klette Computer Science Department, The University of Auckland Auckland, New Zealand
Abstract. Chazelle’s triangulation [1] forms today the common basis for linear-time Euclidean shortest path (ESP) calculations (where start and end point are given within a simple polygon). This paper provides an alternative method for subdividing a simple polygon into “basic shapes”, using trapezoids instead of triangles. The authors consider the presented method as being substantially simpler than the linear-time triangulation method. However, it requires a sorting step (of a subset of vertices of the given simple polygon); all the other subprocesses are linear time. Keywords: computational geometry, simple polygon, Euclidean shortest path, rubberband algorithm.
1
Introduction
This paper proposes a decomposition of any simple polygon into trapezoids. Subsequent algorithms, such as calculating a Euclidean shortest path (ESP), or further processing for pattern recognition or shape analysis purposes, may have benefit from this. In [2] the authors claimed to have an O(n log n) rubberband algorithm for calculating ESPs in simple polygons. Actually, this needs to be corrected: the algorithm given in [2] has worst-case complexity O(n2 ). However, using the trapzoid decomposing as given in this paper it is possible to calculate step sets more efficiently [which allows to modify the algorithm given in [2] into one of O(n log n) time complexity]. Section 2 provides necessary definitions and theorems. Section 3 presents the trapzoid decomposing algorithm and examples and a time analysis. Section 4 concludes the paper.
2
Basics
Let Π • be the (topological) closure of a simple polygon Π, ∂Π be its frontier, V (Π) the set of vertices of Π, and E(Π) the set of edges of Π. For v ∈ ∂Π, vx and vy are the x- and y-coordinates of v, respectively. Let v1 , v2 , v3 ∈ ∂Π. Vertices are ordered either by a clockwise or counter-clockwise scan through ∂Π. Let ρ(v1 , Π, v2 ) be the polygonal path from v1 to v2 , contained in ∂Π. Let ρ(v1 , Π, v2 , Π, v3 ) be the polygonal path from v1 to v2 and then to v3 around ∂Π. Let u, v, w ∈ ∂Π such that uy = vy = wy ; consider ρ(u, Π, v, Π, w) ⊂ ∂Π. Let Π be a simple polygon obtained by adding a ‘closing edge’ uw to ρ(u, Π, v, Π, w), W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 726–733, 2007. c Springer-Verlag Berlin Heidelberg 2007
Decomposing a Simple Polygon into Trapezoids
727
Fig. 1. Illustration of the given definitions
denoted by Π = uw + ρ(u, Π, v, Π, w). Π is called an up- (down-) polygon with respect to v if, for each p ∈ Π • , py ≥ (≤) vy . For example, in Figure 1, Πl and Πr are up-polygons with respect to v1 . Π is a down-polygon with respect to v1 . Let Π, Π be two simple polygons with Π • ⊂ Π • ; let {v0 , v1 , v2 , . . . , vn−1 } and {v0 , v1 , v2 , . . . , vn−1 } be such orders of the vertices of Π and Π , respectively, that the following becomes true: For a sufficiently small ε > 0, we have de (v0 , v0 ) = ε for the Euclidean distance between v0 and v0 , edge vi vi+1 is parallel to edge vi vi+1 , and vi vi bisects the angle vi−1 vi vi+1 , where i = 0, 1, 2, . . . , n, and the addition or subtraction of indices is taken mod n. In this case, Π is called an inner simple polygon of Π, denoted by Π(v0 , ). – In Figure 2, u0 u1 u2 u3 u4 u5 u0 is an inner polygon of v0 v1 v2 v3 v4 v5 v0 . Let vi−1 , vi , vi+1 and vi+2 be four consecutive vertices of Π. If (vi−1 )y < (vi )y , (vi )y = (vi+1 )y and (vi )y > (vi+2 )y [or (vi−1 )y > (vi )y , (vi )y = (vi+1 )y and (vi )y < (vi+2 )y ], and the point (0.5 · (vi x + vi+1 x ), vi y + ε) [or (0.5 · (vi x + vi+1 x ), viy − ε)] is inside Π, then vi vi+1 is called an up- (down-) stable edge of Π with respect to the xy-coordinate system used for representing Π. If (vi−1 )y < (vi )y , (vi )y = (vi+1 )y and (vi )y > (vi+2 )y , and the point(0.5 · (vix + vi+1 x ), viy − ε) is in Π, then vi vi+1 is called a maximal edge of Π with respect to the chosen xy-coordinate system. Let vi−1 , vi and vi+1 be three consecutive vertices of Π. If (vi−1 )y < (vi )y , (vi )y > (vi+1 )y [or (vi−1 )y > (vi )y , (vi )y < (vi+1 )y ], and there exist points ui ∈ V (Π(v0 , ε)) [this is the set of vertices of Π(v0 , ε)] such that vi−1 vi vi+1 does not (!) contain ui , then vi is called an up- (down-) stable point of Π with respect to the xy-coordinate system used for representing Π.
728
F. Li and R. Klette
Fig. 2. Example of an inner polygon
In Figure 1, v2 , v3 , v5 , v6 , v7 and v8 are up-stable points. v1 , v4 , v9 , v10 , v11 , v12 and v13 are down-stable points. If (vi−1 )y < (vi )y , (vi )y > (vi+1 )y [or (vi−1 )y > (vi )y , (vi )y < (vi+1 )y ] and there exist points ui ∈ V (Π(v0 , ε)) [this is the set of vertices of Π(v0 , ε)] such that vi−1 vi vi+1 does (!) contain ui , then vi is called an maximal (minimal) point of Π with respect to the chosen xy-coordinate system. In Figure 1, u1 is a maximal point and u3 is a minimal point. – As common, a simple polygon is called monotonous if it has both a unique up- and downstable point or edge. In Figure 1, u1 u2 u3 u4 u5 u6 u7 u8 u1 is a monotonous simple polygon. Since the discussion of our algorithm in the case of an up- or down-stable edge is analogous to that of an up- or down-stable point, we will just detail the case of up- or down-stable points. Let v be an up-stable point of Π. Let Sv be the set of minimal points u of Π such that there exists a polygonal path from v to u around ∂Π and uy < vy . Let u ∈ Sv such that uy = max{uy : u ∈ Sv }. If there exists a point w ∈ ∂Π such that the segment u w ⊂ Π • and wy = uy , then u w is called a cut edge of Π. The polygonal path ρ(v, Π, u ) is called a decreasing polygonal path from v to u w. v is called an up-stable point with respect to the cut edge u w, and u w is called a cut edge with respect to the up-stable point v. u is called a closest minimal point with respect to v around the frontier of Π. In Figure 1, v8 v9 u9 u3 is a decreasing polygonal path from v8 to u3 u4 , and u3 is a closest minimal point with respect to v8 . If u, w ∈ ∂Π such that uy = vy = wy , ux < vx < wx , and uv, vw ∈ Π • , then u and w are called the left and right intersection points of v, respectively. – In Figure 1, v1 l and v1 r are the left and right intersection points of v1 , respectively. Let v be a maximal point of Π. Let Sv be the set of down-stable points u of Π such that there exists a polygonal path from v to u around ∂Π and uy < vy . Let u ∈ Sv such that uy = max{uy : u ∈ Sv }. u is called a closest down-stable point with respect to v around the frontier of Π.
Decomposing a Simple Polygon into Trapezoids
729
In Figure 1, v9 is a closest down-stable point with respect to the maximal point above the segment v9 u9 .
3
The Algorithm
This section describes the decomposition algorithm and its three subroutines, which are Procedures 1–3. Procedure 1 is used to update the left or right intersection points of up-stable points. It will be applied by Procedure 2 to remove up-stable points. Procedure 3 applies Procedures 1 and 2 to compute the left and right intersection points of all up-stable points, from bottom to top. This procedure will be called in Step 3 of the main algorithm which processes both up- and down-stable points, from top to bottom. In Procedure 1, let Π be a simple polygon such that V (Π) does not have any up-stable point. Procedure 1: update left or right intersection points 1. Decompose Π into a set of trapezoids, denoted by TΠ (note: straightforward, because there is no up-stable point). 2. For each edge e ∈ E(Π). 2.1. If e is a cut edge, then do the following: 2.1.1. Let ve be the up-stable point with respect to e. 2.1.2. Find a trapezoid t ∈ TΠ such that the bottom edge of t and edge e have a common end point in V (Π). 2.1.3. Let ρe be the decreasing polygonal path from ve to e. 2.1.4. Find a stack of trapezoids, denoted by Te (⊆ TΠ ), such that for each t ∈ Te , t• ∩ ρe = ∅. 2.1.5. Let te be the topmost trapezoid in Te . 2.1.6. Compute the intersection points between y = ve y with those two edges on the left and right side of te . 2.1.7. Update the left and right intersection points of ve by comparing the results of Step 2.1.6 with the initial left and right intersection points of ve . Let I be an interval of real numbers, and MI be a subset of maximal points of Π such that for each element v ∈ MI , vy ∈ I. Suppose MI is sorted according to y-coordinate decreasingly. Procedure 2: remove up-stable points 1. Let MI = {v1 , v2 , . . . , vk }. 2. For each i ∈ {1, 2, . . . , k}, do the following: 2.1. Find a closest down-stable point with respect to vi around the frontier of Π, denoted by ui . 2.2. Find a point wi ∈ ∂Π such that ρ(ui , Π, vi , Π, wi ) is the shortest polygonal path in ∂Π such that uiy = wiy . 2.3. Update Π by replacing ρ(ui , Π, vi , Π, wi ) by the edge ui wi .
730
F. Li and R. Klette
2.4. Let Πi = ui wi + ρ(ui , Π, vi , Π, wi ). 2.5. Now let Πi be the input of Procedure 1 and update the left and right intersection points of all possible up-stable points. Procedure 3: compute all left or right intersection points 1. Let U = {v1 , v2 , . . . , vk } be the sorted set of up-stable points of Π such that v1 y < v2 y < · · · < vk y . 2. For each i ∈ {1, 2, . . . , k}, do the following: 2.1. Find a closest minimal point with respect to vi by following the frontier of Π, denoted by ui . 2.2. Let I = [a, b] where a = ui y and b = vi y . 2.3. Compute MI . 2.4. If MI = ∅, then apply Procedure 2 and go to Step 2.1. 2.5. Otherwise, find a point wi such that ρ(ui , Π, vi , Π, wi ) is the shortest polygonal path of ∂Π with ui y = wi y . 2.6. Set initial left and right intersection points of vi as follows: 2.6.1. If vi−1 y = vi y = vi+1 y then let the initial left and right intersection points of vi be vi−1 and vi+1 , respectively. 2.6.2. Otherwise, find two points wi l , wi r such that ρ(wi l , Π, ui , Π, vi , Π, wi r ) is the shortest polygonal path of ∂Π with wi ly = vi y = wi ry . 2.6.3. Let the initial left and right intersection points of vi be wil and wir , respectively. 2.7. Update Π by replacing ρ(ui , Π, vi , Π, wi ) by the edge ui wi . 2.8. Now let Π be the input for Procedure 1; update the left and right intersection points of all possible up stable points. Main Algorithm: Decomposition Algorithm 1. Let S = ∅ and T = ∅. 2. Compute the set of up- and down-stable points, denoted by V . 3. Apply Procedure 3 to compute the left and right intersection points, for all up-stable points in V . 4. Sort V for decreasing y-coordinates. 5. For each i ∈ {1, 2, . . . , k}, with k = |V |, do the following: 5.1. Case 1. vi is a down-stable point. 5.1.1. Find two points vi l , vi r ∈ ∂Π such that vi ly = vi y = vi r y and vi lx < vix < vir x . 5.1.2a. Let Πl = vi l vi + ρ(vi l , Π, vi ) (up-polygon). 5.1.2b. Let Πr = vi vi r + ρ(vi , Π, vi r ) (up-polygon). 5.1.2c. Let Π = vil vir + ρ(vil , Π, vir ) (down-polygon). 5.1.3. Let S = S ∪ {Πl , Πr }. 5.1.4. Let Π = Π . 5.1.5. Let i = i + 1 and go to Step 5. 5.2. Case 2. vi is an up-stable point. 5.2.1a. Let vil , vir be the left and right intersection points of vi , respectively.
Decomposing a Simple Polygon into Trapezoids
731
5.2.2b. Let Πl = vi l vi + ρ(vi l , Π, vi ) (down-polygon). 5.2.2c. Let Πr = vi vi r + ρ(vi , Π, vi r ) (down-polygon). 5.2.2d. Let Π = vil vir + ρ(vil , Π, vir ) (up-polygon). 5.2.3. Let S = S ∪ {Π }. 5.2.4. Let Π = {Πl , Πr }. 5.2.5. Let i = i + 1 and go to Step 5. 6. For each j ∈ {1, 2, . . . , n}, with n = |S|, do the following: 6.1. For each (monotonous) polygon Πj ∈ S, decompose it into a stack of trapezoids, denoted by Tj . 6.2. Let T = T ∪ Tj . 7. Output T . 3.1
Time Complexity of the Algorithm
Lemma 1. The set of up- (or down-, maximal) stable points of Π can be computed in O(n), where n = |V (Π)|. Proof. For a sufficiently small number > 0, the “start” vertex v0 of an inner polygon Π(v0 , ) can be computed in O(n), where n = |V (Π)| (see [3]). For three consecutive vertices u, v, w ∈ V (Π) such that ux < vx < wx , if uy < vy , vy > wy and vy > vy , then v is a up stable point (if uy < vy , vy > wy and vy < vy , it follows that v is a maximal point; if uy > vy , vy < wy and vy < vy , then v is a down-stable point).
Lemma 2. Procedure 1 can be computed in O(n log n), where n = |V (Π)|. Proof. If Π is monotonous, then (obviously) it can be decomposed into a stack of trapezoids in O(|V (Π)|). Otherwise, by assumption, Π can only have a finite number of down-stable points. Analogous to Step 5.1 in the decomposition algorithm, Π can be decomposed into a set of trapezoids in O(|V (Π)|). Thus, Step 1 can be computed in O(|V (Π)|). All steps following Step 2, except Step 2.1.4, can be computed in O(1). Step 2.1.4 can be computed in O(nlogn), where n = |V (Π)|: Let Sd be the set of all down-stable points of Π. A1. Sort Sd according to y-coordinates; we obtain Sd = {v1 , v2 , . . . , vk } such that v1 y ≤ v2 y ≤ v3 y ≤ . . . ≤ vk y . A2. Let m = min{|vi−1 y − viy | : |vi−1 y − viy | > 0, i = 2, 3, . . . , k}, and M = max{|vi−1 x − vi x | : |vi−1 x − vi x | > 0, i = 2, 3, . . . , k}. A3. Transform Π by rotating it by angle θ anticlockwise about the origin (i.e., for each point p = (x, y) ∈ Π • , update it by point (x , y ), where x = x cos θ − y sin θ, y = x sin θ + y cos θ). Without loss of generality, assume that m = |v1 y − v2 y | > 0. We have that |v1y − v2y | = |(v1 x − v2 x ) sin θ + (v1 y − v2 y ) cos θ|
≥ |(v1 y − v2 y ) cos θ| − |(v1 x − v2 x ) sin θ| ≥ m cos θ − M sin θ → m(θ → 0)
732
F. Li and R. Klette
Therefore, there exists a sufficiently small angle θ > 0 such that, after rotating, each down-stable point is still a down-stable point, and all down-stable points have unique y-coordinates. This implies that, for each trapezoid t ∈ TΠ , there exist at most two trapezoids t1 , t2 ∈ TΠ such that t1 and t2 has a common vertex with t, respectively. (In this case, t1 and t2 have a common vertex which is a down-stable point. Since Step A2 can be computed in O(|Sd |) and Step A3 can be computed in O(1), it follows that Step 2.1.4 can be computed in O(k), where k is the number of vertices of the decreasing polygonal path from ve to e, after sorting (Step A1) in O(|Sd |log|Sd |).
Lemma 3. Procedure 2 can be computed in O(n log n) time, where n is the number of vertices of the original simple polygon Π. Proof. By Lemma 1, Step 1 can be computed in O(n log n), where n is the number of vertices of the original simple polygon Π. Step 2.1 can be computed in O(nu ), where nu is the number of vertices of ρ(ui , Π, vi ). Step 2.2 can be computed in O(nw ), where nw is the number of vertices of ρ(vi , Π, wi ). Steps 2.3 and 2.4 can be computed in O(1). By Lemma 2, Step 2.5 can be computed in O(ni log ni ), where ni = |V (Πi )|. Thus, Step 2 can be computed in O(n log n) altogether, where n is the number of vertices of Π.
Lemma 4. Procedure 3 can be computed in O(n log n) time, where n is the number of vertices of the original simple polygon Π. Proof. By Lemma 1, Step 1 can be computed in O(n log n), where n is the number of vertices of the original simple polygon Π. Step 2.1 can be computed in O(nu ), where nu is the number of vertices of ρ(ui , Π, vi ). Step 2.2 can be computed in O(1). Step 2.3 can be computed in O(|MI |). By Lemma 4, Step 2.4 can be computed in O(nu log nu ), where nu = |V (ρ(ui , Π, vi ))|. Step 2.5 can be computed in O(nw ), where nw is the number of vertices of ρ(ui , Π, wi ). Step 2.6.1 can be computed in O(1). Step 2.6.2 can be computed in O(ni ), where ni = |V (ρ(ui , Π, wi l ))| + |V (ρ(wi , Π, wi r ))|. Step 2.6.3 can be computed in O(1). Step 2.7 can be computed in O(1). By Lemma 2, Step 2.8 can be computed in O(n log n), where n is the number of the vertices of the updated Π. Therefore, Procedure 3 can be computed in O(n log n), where n is the number of the vertices of the original simple polygon Π.
Theorem 1. The given decomposition algorithm has time complexity O(n log n), where n is the number of vertices of the original simple polygon Π. Proof. Step 1 can be computed in O(1). By Lemma 1, Step 2 can be computed in O(n log n), where n is the number of vertices of the original simple polygon Π. By Lemma 4, Step 3 can be computed in O(n log n), where n is the number of the vertices of the original simple polygon Π. Step 4 can be computed in O(|V | log |V |). Step 5.1.1 can be computed in O(ni ), where ni = |V (ρ(vi l , Π, vi ))| + |V (ρ(vi , Π, vi r ))|. Steps 5.1.2a – 5.1.5 can be computed in O(1). Thus, Step 5.1 can be computed in O(ni ), where ni = |V (ρ(vil , Π, vi ))| + |V (ρ(vi , Π, vir ))|. Step 5.2 can be computed in O(1). By Lemma 3, Step 6.1
Decomposing a Simple Polygon into Trapezoids
733
Fig. 3. Output of the decomposition algorithm for the polygon of Figure 1
can be computed in O(|Πj |). Thus, Step 6 can be computed in O(n), where n is the number of the vertices of the original simple polygon Π. Step 7 can be computed in O(1). Therefore, the decomposition algorithm can be computed in O(n log n), where n is the number of the vertices of Π.
4
Conclusions
The paper described an O(n log n) time algorithm for the decomposition of any simple polygon into trapezoids. (See Figure 3 for an example.) Of course, these can then be further divided into triangles. The algorithm was described on these eight pages (compared to 40 journal pages of paper [1]). (For a longer version of this paper, with some examples illustrating processing steps, see CITR-TR-199.) The given algorithm will allow to improve work reported in [2]. - The authors conjecture that the O(nlogn) sorting step in the given algorithm can also be replaced by incorporating some type of a (linear-time) scan, and future studies will show.
References 1. Chazelle, B.: Triangulating a simple polygon in linear time. Discrete Computational Geometry 6, 485–524 (1991) 2. Li, F., Klette, R.: Finding the shortest path between two points in a simple polygon by applying a rubberband algorithm. In: Chang, L.-W., Lie, W.-N. (eds.) PSIVT 2006. LNCS, vol. 4319, pp. 280–291. Springer, Heidelberg (2006) 3. Sunday, D.: Algorithm 3: Fast winding number inclusion of a point in a polygon. (last visit: April 2007) See www.geometryalgorithms.com/Archive/algorithm 0103/
Shape Recognition and Retrieval: A Structural Approach Using Velocity Function Hamidreza Zaboli, Mohammad Rahmati, and Abdolreza Mirzaei Department of Computer Engineering, Amirkabir University of Technology {zaboli, rahmati, mirzaei}@aut.ac.ir Abstract. In this paper a new method for matching and recognition of shapes extracted from images, based on the structure and skeleton of the shape, is proposed. In this method, a function called “velocity function” derived from the radius function is introduced and values of this function are calculated for skeletal points. At this point, each skeletal curve is transformed into a vector containing values of the velocity function calculated along the curve. Then a tree-like structure is extracted from the skeleton of the shape so that each leaf of the tree indicates a skeletal curve of the skeleton and a velocity vector as its descriptor. These vectors are matched in fine-grain level using dynamic programming method and in coarse-grain level, the tree-like structures are matched using a greedy approach. At the final stage, the difference between the shapes is determined by a distance value resulted from the two levels matching process. This distance is used as a measure for shape matching and recognition. Experiments are performed on two standard binary image database in the presence of various transformations. Results confirm the efficiency and high recognition rate of our method. Keywords: Velocity function, radius function, shape matching, shape recognition, skeleton.
1 Introduction In the computer vision field, object recognition involves two key steps: image segmentation and recognition [1]. Often object recognition deals with shape recognition from a database containing special patterns which have been extracted from shapes based on special criteria. Diversity and difference between various methods are derived by different criteria and view points. From a general point of view, these criteria and approaches fall into two main groups: contour-based methods and content-based methods [2]. Contour-based methods use only boundary information of a shape for recognition, whereas content-based methods use all points of the shape. Moreover, the contentbased methods are also divided into global and structural methods depending on whether shape content is decomposed into parts or not [2]. In this research, we present a method which falls into the structural content-based methods. The methods which lie in this area, either decompose internal regions of the shape [3,4,5] or generate skeleton of the shape [6,7,8,9]. In the both approaches some features are extracted from the obtained components or the skeleton. Then matching these features leads to recognition of the shapes. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 734 – 741, 2007. © Springer-Verlag Berlin Heidelberg 2007
Shape Recognition and Retrieval: A Structural Approach Using Velocity Function
735
An important feature which may be extracted from the skeleton is the values of the radius function on different points of the skeleton. Values of this function on the skeleton describe different parts of the shape. Siddiqi et al. in [6,7] have used this function to classify skeletal points into four types. In this paper, we define a new function called “velocity function” and we use it as a basic factor for shape description. Due to the values of the radius function for each shape are represented at a shape style and scale, it can not be a perfect feature for all cases. In fact we need variations of this function to describe a shape. These variations are equal to the first order differential of the radius function and we call it velocity function. An advantage of applying the velocity function is that the function is scale invariant. This characteristic comes implicitly with the differential of the function without any computation overhead. We use this function (velocity function) and its properties for describing and representing the different parts of shapes. The paper is organized as follows: in section 2, the velocity function is defined and values of this function are calculated for points on the skeleton of shape. In section 3, the skeleton of the shape is generated and then a tree-like structure is extracted from it. In section 4, the method for comparison of the structures is presented and distance between the structures and the shapes is computed as a single value. In section 5, experiments are presented and finally in section 6, the paper is concluded.
2 Definition of the Velocity Function on Skeleton In order to define the velocity function, it is necessary to introduce the radius function on the skeleton of a shape. Assume S denotes the skeleton of a shape and consider point p where p ∈ S . Radius function R(p) is the radius of the maximal inscribed disc touches the boundary of the shape[2,10,11]. Movement of locus of p(x,y) on the S, cause that the value of the radius function to vary. Note that different parts of a shape containing convexities and concavities are generated by the variations of the radius function. Therefore, we define velocity function V(p) as the first order differential of the radius function R(p) on the points of the skeleton as follows:
∀p1, p2 ∈S, p1(x1, y1) ≠ p2 (x2, y2 ), || p2 − p1 ||< ε :V( p2 ) =
R( p2 ) − R( p1) || p2 − p1 ||
Fig. 1. A part of a shape used to define the velocity function V(p). With respect to the direction, R ( p2 ) > R ( p1 ) ⇒ V2 > 0 .
736
H. Zaboli, M. Rahmati, and A. Mirzaei
Where p1 and p2 are points on the skeleton S and ||p2-p1|| is a distance measure between p1 and p2 1. Fig. 1 shows this definition. Therefore, for all points on the skeleton, the velocity function can be defined and can be calculated. For calculating these values on the skeletal curves, we consider each curve segment2 separately originated at a branch point. Then we traverse each curve segment from the branch point to the other end of the curve segment (a terminal point or another branch point) and calculate the values of the velocity function for points on the curve segment. In this way, we may have a vector containing values of the velocity function for each curve segment and we call it as “velocity vector”. For the curves with two branch points at the two ends, we should have two velocity vectors each of which originated at one of the two distinct branch points. These velocity vectors will be used for comparison of different parts of shapes.
3 Skeleton and Extracting Structure Skeleton of a shape is a set of thinned curves and lines [10]. There are various types of the skeletons describing shapes. Medial axis is a type of skeleton and is defined as the locus of centers of maximal disks that fit within the shape [2,10,11,12]. We use Blum medial axis [13] as the skeleton of shapes. Fig. 2(a-c) illustrates shape of a dog and its medial axis. In fig. 2(c), the skeleton is decomposed into seven labeled curve segments (skeletal curves) with numbers 1 to 7. As shown in fig. 2(d), in the skeleton decomposition phase, we consider twice the curve segments which their end points terminate to branch points, e.g. the curve segments number 3 and 5. 3.1 Extracting Tree-Like Structure from the Skeleton We treat the skeleton of a shape as a tree-like structure such that the branch points are considered as children of the root of tree (branch nodes) and the skeletal curves are considered as leaves of the tree. We use this structure to describe shape in coarsegrain level. According to this structure, we define three rules for extracting a tree-like structure from the skeleton. Using these rules, we may have a tree for each skeleton. The rules are as follows: 1. ∀bp ∈ S , Add _ as _ Child (bp, root ) 2. ∀bp ∈ S , ∀curve ∈ S , Origin(curve) = bp ⇒ Add _ as _ Child (curve, bp) 3. ∀curve, Origin(curve) = null ⇒ create dummy branch point bp , Add _ as _ Child (bp, root ) , Add _ as _ Child (curve, bp ) 1
2
In our application ||p2-p1|| is equal to the length of skeletal curve segment between p1 , p2. Although this distance may be used as Euclidean distance between p1 , p2. Note that we use terms “curve segment” and “skeletal curve” interchangeably as a skeletal curve segment of the skeleton which connects two branch points or a branch point to a terminal point or vice versa. Also there is no branch point between the two end points of the curve segment.
Shape Recognition and Retrieval: A Structural Approach Using Velocity Function
737
• Where “bp” is a branch point, “curve” is a skeletal curve branching out from a “bp” and terminating at another “bp” or a terminal point. “S” is the skeleton of the shape. “root” is the root of the tree. • Procedure Add_as_Child(a,b) adds “a” to the list of children of “b”. • Function Origin(curve) returns branch point which “curve” branching out from. For curves with no branch point at their two ends, it returns “null”. In fig. 2(e) each branch point is placed as child of the root of the tree. However this is not actually a tree, but we represent it as a tree to have a unique and integrated structure for a shape. As seen in fig. 2(e), each vector describes a skeletal curve. Therefore, we have a velocity vector for each leaf in the tree.
Fig. 2. (a) Shape of a dog. (b) Boundary and skeleton of the shape. (c) The skeleton with labeled skeletal curves. (d) Decomposed medial axis into components consisting of a branch point and some skeletal curves branching out from it. (e) Extracted tree-like structure from the skeleton. Figure (a-e) shows steps for extracting tree-like structure from a given shape.
4 Comparison and Matching of the Structures As mentioned earlier, the descriptor structure has two levels: coarse-grain level and fine-grain level. Therefore comparison and matching of the structures should be performed in the two levels, such that in the fine-grain level, the velocity vectors are matched and in the coarse-grain level, subtrees and branch nodes are matched. 4.1 Matching the Velocity Vectors and Computing Fine-Grain Level Distances Before any matching process, all of the velocity vectors must be normalized. The goal of this normalization is to abstract and transform the velocity vectors into attributed strings containing characteristics of the velocity vectors. In this normalization phase, the values of vector V are transformed into elements of the set {+,0,-}. Sequential values of the velocity vector are grouped together depending on whether they belong to positive values or negative values or they are equal to zero and finally, an attributed string is generated. Indeed, this grouping is based on sign function. Each element of the attributed string, has three attributes: symbol (+ , – or 0), length as number of the values corresponding to an element of the string and sum (summary of the values). As a result, for each skeletal curve and velocity vector, we may have an attributed string. We call these attributed strings as “velocity strings”. For matching and computing the distances (as difference) between two velocity strings, dynamic programming method is used. In this matching process, the three attributes must be taken into account. We define three edit operations in our dynamic
738
H. Zaboli, M. Rahmati, and A. Mirzaei
⎧ ⎫ ⎧ i.sum ⎫ ⎪ ⎪ ⎪ ⎪ i.length i.length Delete ( i ) : Cost ( i , j − 1) + + ⎪ ⎪ ⎪ ⎪ U .length U .length ⎛ t.sum ⎞⎪ ⎪ ⎪ ⎪ ∑ ⎜ ⎟⎪ t . length ⎪ ⎪ ⎪ ⎠ t =U .1 ⎝ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ i.sum ⎪ ⎪ ⎪ ⎪ i.length ⎪ ⎪if (i.symbol ≠ j.symbol ) ⇒ Min ⎪⎨Insert (i) : Cost (i − 1, j ) + i.length + ⎪ ⎬ U .length ⎪ ⎪ U .length ⎛ t.sum ⎞⎪ ⎪ ⎪ ⎪ ∑ ⎜ ⎟ t.length ⎠ ⎪ ⎪ t =U .1 ⎝ ⎪ ⎪ ⎪ ⎪ Cost (i, j ) = ⎨ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪Change(i, j ) : Cost(i − 1, j − 1) + ( Delete(i) + Insert ( j )) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩⎪ ⎭⎪ ⎪ ⎪ ⎛ i.sum ⎞ ⎛ j.sum ⎞ ⎪ ⎪ ⎜ ⎟−⎜ ⎟ ⎪ i . length j . lengt h ⎪ i.length − j.length ⎝ ⎠ ⎝ ⎠ ⎪if (i.symbol = j.symbol ) ⇒ Cost (i − 1, j − 1) + ⎪ + Max{i.length , j.length} ⎧⎛ i.sum ⎞ ⎛ j.sum ⎞⎫ ⎪ ⎪ Max , ⎨ ⎬ ⎜ ⎪ i.length ⎟⎠ ⎜⎝ j.length ⎟⎠ ⎪ ⎩⎝ ⎭⎭ ⎩
Fig. 3. Definition of the edit operations delete, insert and change. i,j are elements in the velocity strings U,V respectively and indicate corresponding groups of the values in their velocity vectors. Note that in the case of matching two identical symbols of two strings, a little cost is associated to them.
programming method with respect to the attributes of the elements. These edit operations are delete, insert and change and a cost is assigned to each operation. The edit operations are applied to one of the two strings to transform it to another string. At last, sum of the costs of the edit operations used for the transformation, specifies the distance between the two velocity strings. Fig. 3 illustrates the edit operations and their costs. Hence we may calculate the distance between each two velocity strings U,V as minimum cost of the used edit operations for transforming one of them to another using dynamic programming. We denote this cost by Distance(U,V). 4.2 Matching Descriptor Structures in the Coarse-Grain Level Distance between two tree-like structures T1 , T2 , extracted from skeletons S1 , S2 is defined as the sum of the minimum distances between each two corresponding nodes and is denoted by Distance(T1 ,T2). In fact, for computing the distance between two tree-like structures, we use a topdown approach based on a greedy algorithm, so that for each branch node and its subtree of the tree T1 , we find a branch node of the tree T2 such that the distance between the two branch nodes and their subtrees is minimal. Next, the candidate nodes and their children are deleted from the trees and the distance between them is considered as cost of the matching. Finally sum of these costs gives Distance(T1,T2). As a result we have relation (1) as follows: Distance(T1 ,T2 ) =
∑ Min{Distance(a,b)}
(1)
a∈T1 ,b∈T2
Where a and b are the branch nodes. Now it is necessary to define the distance between two branch nodes. The distance between two branch nodes a and b is defined
Shape Recognition and Retrieval: A Structural Approach Using Velocity Function
739
as the sum of minimum distances between each two velocity strings and is denoted by Distance(a,b). Similarly, to find corresponding velocity strings of the two branch nodes, a greedy approach is used. So we have relation (2) for computing distance between each two branch nodes as follows: Distance(a,b) =
∑ Min{Distance(u,v)}
(2)
u∈a ,v∈b
in relation (2), Distance(u,v) is the distance between two velocity strings u and v computed in section 4.1.
5 Experimental Results The measure presented in the earlier sections for determining distance between two shapes, may be used for shape recognition and retrieval. To achieve this goal, we compare each given shape with shapes of a database. The shapes which have minimum distances with the given shape, are recognized as similar shapes and are retrieved from the database. In our experiments, we use two standard dataset: kimia image dataset containing about 4000 sample binary images and MPEG-7 dataset. First experiment is looking for a given shape in the dataset and retrieving the same or similar shapes. Fig. 4(a) illustrates a set of 56 sample shapes categorized in 7 groups. Each group contains 8 shapes of a similar class. For each group, the first shape at the left is the query shape and other shapes are seven closest matches from the dataset. As seen in this figure, our method calculates distances and classifies similar shapes properly. Also we compare our method with another skeleton-based method called “shock graph” [6] on the kimia and MPEG-7 shape datasets. In these experiments several groups of shapes in various categories in the presence of different transformations were evaluated. These transformations are scale transformation, rotation, occlusion and missing parts. Results of the comparisons in the presence of different transformations are represented in Fig. 4(b). As seen in Fig. 4(b) at point A, there is no transformation and consequently the recognition rates are high for both of the two methods. At point B, shapes are scale transformed; also the recognition rates for both methods are high. This is because both methods are based on the topology of the skeleton and any scale transformation does not change the topology, connections and structure of the skeleton. At point C, shapes were rotated. As seen at this point, the recognition rates are almost high. This is because like the previous case, the skeleton (its topology and structure) is invariant under rotation. In points D and E, shapes are evaluated in the presence of occlusion and missing parts, respectively. In these two cases, we observe a decrease in the recognition rates of the methods. This decrement for our proposed method, is because the skeleton of a shape is modified in the presence of occlusion and missing parts. In the two cases, some skeletal curves and sometimes, some branch points are added to / deleted from the skeleton, respectively. Although as seen in Fig. 4(b) this decrement is about 10 to 20 percent and our method is still performs superior. In the third experiment, we categorized shapes of MPEG-7 dataset into three categories animals, plants and objects and we evaluated recognition rate of our
740
H. Zaboli, M. Rahmati, and A. Mirzaei
(a)
(b)
Fig. 4. (a) 56 shapes classified into 7 classes, each class contains 8 shapes, retrieved from the kimia and MPEG-7 datasets. (b) Recognition rates of our method and shock graph method in the presence of various transformations. A: no transformation, B: scale- transformation, C: rotation, D: occlusion, E: missing parts. Our method is denoted by '+'.
method. In this experiment, shapes are randomly chosen from the dataset and then are classified into the three categories. Fig. 5 shows first nine random shapes selected from the dataset which are categorized properly into the three categories. We evaluated all shapes of the dataset and categorized them using our method. Table 1 summarizes the results of this evaluation and shows recognition rate of the method for recognizing shapes of each class. As seen in the table 1, highest recognition rate belongs to class of animals with 88.63 and then class of objects and plants. High recognition rate of the method for animals is due to the structure of the skeleton of the animals are close to each other. Therefore recognition of the animals is simpler than other classes (like objects) which include wide range of the skeletal structures. Table 1. Results of classifying of the shapes of MPEG-7 dataset into the three classes animals, plants and objects. Recognition rate for animals class is higher than other classes.
Fig. 5. nine random selected shapes from MPEG-7 dataset. Our method properly classified the shapes into the three classes.
6 Conclusion and Future Work In this paper, we proposed a new skeleton-based method for shape recognition and retrieval problem. In this method we introduced the “velocity function” as the first
Shape Recognition and Retrieval: A Structural Approach Using Velocity Function
741
order differential of the radius function to describe the skeletal curves of the different parts of a shape. To determine the distance between each two given shapes, matching of the tree-like structures was performed in two levels: coarse-grain level and finegrain level, based on a greedy approach and dynamic programming method, respectively. Experiments confirm high performance of our method in the presence of different transformations. Although we used a greedy approach for the matching process, however there are other optimized algorithms e.g. Hungarian algorithm [15] that may be used for the matching.
References 1. He, L., Han, C.Y., Everding, B., Wee, W.G.: Graph matching for object recognition and recovery. Pattern Recognition 37, 1557–1560 (2004) 2. Zhang, D., Lu, G.: Review of shape representation and description techniques. Pattern Recognition 37, 1–19 (2004) 3. Kim, D.H., Yun, I.D., Lee, S.U.: A new shape decomposition scheme for graph-based representation. Pattern Recognition 38, 673–689 (2005) 4. Saber, E., Xu, Y., Tekalp, A.M.: Partial shape recognition by sub-matrix matching for partial matching guided image labeling. Pattern Recognition 38, 1560–1573 (2005) 5. Yu, L., Wang, R.: Shape representation based on mathematical morphology. Pattern Recognition Letters 26, 1354–1362 (2005) 6. Siddiqi, K., Shokoufandehs, A., Dickinsons, S.J., Zucker, S.W.: Shock Graphs and Shape Matching. International Journal of Computer Vision 35(1), 13–32 (1999) 7. Siddiqi, K., Kimia, B.B.: Toward a Shock Grammar for Recognition, Technical Report LEMS 143, LEMS, Brown University (1995) 8. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys (CSUR) 33(1), 31–88 (2000) 9. Sebastian, T.B., Klein, P.N., Kimia, B.B.: Recognition of Shapes by Editing Shock Graphs, ICCV 2001, pp. 755–762 (2001) 10. Lam, L., Lee, S.-W., Suen, C.Y.: Thining Methodologies – A Comprehensive Survey. IEEE Trans. PAMI 14(9) (1992) 11. van Eede, M., Marcini, D., Telea, A., Sminchisescu, C., Dickinson, S.: Canonical Skeletons for Shape Matching. ICPR 2, 64–69 (2006) 12. Dimitrov, P., Phillips, C., Siddiqi, K.: Robust and Efficient Skeletal Graphs. ICCV 1, 417– 423 (2000) 13. Blum, H.: A Transformation for extracting new descriptors of Shape. In: Whaten-Dunn, W.(ed.) pp. 362–380. MIT Press, Cambridge (1967) 14. Pelillo, M., Siddiqi, K., Zucker, S.W.: Matching Hierarchical Structures Using Association Graphs. IEEE Trans. PAMI 21(11) (1999) 15. Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics 5(1), 32–38 (1957)
Improved Morphological Interpolation of Elevation Contour Data with Generalised Geodesic Propagations Jacopo Grazzini and Pierre Soille Spatial Data Infrastructures Unit Institute for Environment and Sustainability European Commission - DG Joint Research Centre via E. Fermi 1, TP 262, I-21020 Ispra, Italy {jacopo.grazzini,pierre.soille}@jrc.it
Abstract. The objective of this work is to show how geodesic propagation techniques known in mathematical morphology can be employed to reconstruct terrain surfaces, and more generally grid data, from contour maps. We present the so-called generalised geodesic distance that was briefly introduced in a previous article. The shortest paths defined by the generalised geodesic distance can be used to linearly interpolate a given set of contour lines. We extend this concept by incorporating the knowledge of slope values over the contours and perform either Hermite interpolation with this distance or linear interpolation, but with a new weighted geodesic propagation function. Experiments are carried out over synthetic elevation data for which they result in nicely interpolated surfaces. The methodology is generic in the sense that it could be used to reconstruct any 2D or 3D digitised shape from its boundaries.
1
Introduction
Interpolation of grid-based contour data for the reconstruction of 2D or 3D surfaces is a common issue in image processing and computer graphics, and arises primarily in applications related to medical imaging digitisation of objects and geographic information systems [1]. It has been in particular intensively studied for terrain modelling and analysis, as interpolation is often necessary to recover terrain surfaces from topographic contour maps [2,3]. Owing to their simplicity, contour maps have been employed as one of the major forms to record terrain information. The interpolation of the elevation values between the contour lines enables to reconstruct numerical representations of the Earth’s surface known as Digital Elevation Models (DEMs): a DEM is a dense spatially referenced grid where each point represents a ground height value [4]. Although DEMs can now be directly derived using remote sensing, a great deal of elevation measures of the Earth’s surface is still in the form of topographic maps. Indeed, for historical landscapes, they are the only available source of information. Moreover, contour maps enable the retrieval of the true W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 742–750, 2007. c Springer-Verlag Berlin Heidelberg 2007
Improved Morphological Interpolation of Elevation Contour Data
743
terrain elevation and not the elevation of (natural or man-made) objects above the ground as measured by satellites. In addition, with the increasing availability of high-resolution elevation data from larger areas, low cost representation and storage of DEMs with sparse data like contours are also preferred [5]. Therefore, there is still a need for generating terrain surfaces from contours. The usual process for deriving DEM from grid-based contour lines involves the interpolation of elevation values for every pixel in the surface in function of the values of neighbouring pixels where contours exist [2]. Various algorithms have been proposed for this task in the literature, that consist either in interpolating onto a regular grid or in constructing a triangulation of points on the contour lines (see [3] for an in-depth review). Most of them have been unsatisfactory, as they either do not preserve minor ridges and valleys, or else they are poor at modelling slopes. The approach we propose in this article is a continuation of previous efforts led in the field of mathematical morphology (MM) [6,7] as it is intended to preserve the geomorphological properties of the terrain surface. It is interesting for a variety of reasons. First, MM is generally well suited to the processing of elevation data [4]: this is not surprising because in MM, any image is viewed as a topographic surface, the greylevel of a pixel standing for its height. Second, the newly introduced geodesic propagations tend to respect the natural topography of terrain surfaces and mimic, in a way, the hydrological runoff processes [8]. Third, our interpolation incorporates higher-order information regarding terrain slopes allowing for the generation of smooth realistic surfaces [9]. Finally, the technique enables efficient numerical implementations [4]. Noticeable improvements are achieved in the DEMs reconstructed by the proposed methods over previous works. The rest of the paper is organised as follows. In Sec. 2, some background notions and seminal works on contour interpolation are presented. The methodologies using generalised geodesy are detailed in Sec. 3. Experiments are led on synthetic data and compared with some well-established method in Sec. 4. Conclusions including hints about other possible applications are given in Sec. 5.
2
Background Notions
2.1
Formulation of the Contour Interpolation Problem
The problem of contour interpolation can be stated as follows [10]. Given a pair of adjacent contours Cl , Cu with elevations hl = hCl < hu = hCu , delimiting a connected inner-space Ω, we aim at reconstructing a surface h over the image space that verifies: h(p) = hl if p ∈ Cl ,
h(p) = h ∈ [hl , hu ] if p ∈ Ω,
h(p) = hu if p ∈ Cu
(1)
so that no artifial oscillations are created between Cl and Cu which are again the contour lines of h for the elevations hl and hu . Eq. (1) can naturally be extended to the case of n > 2 contours Ci , with elevations hi , partitioning the plane into n + 1 disjoint open sets Ωi (Fig. 1, left: a synthetic map with n = 10
744
J. Grazzini and P. Soille
Fig. 1. Left: synthetic image displaying contour lines; the elevations of the contours are represented in the greyscale range [0,255]. Right: corresponding complemented Euclidean distance field from the contours lines.
contour lines on 7 levels). The special case of summits (i.e. areas with Cu = ∅) or pits (Cl = ∅) is not discussed in the following, but can be easily treated by extrapolation from contours [10]. 2.2
Univariate Approaches
Following concepts known in hydrology and considering the importance of runoff processes in the topographic evolution of geologic surfaces [8], some techniques attempt to derive DEMs using appropriate interpolation function along straight lines in the direction of the steepest slope. Indeed, given a point p with elevation h, its closest point ql on the contour Cl with elevation hl < h can be reached by descending at each step in the direction given locally by the steepest slope; similarly, for the closest qu ∈ Cu [10]. Considering the piecewise linear segment (ql , p, qu ), the estimation of the elevation h(p) can be reduced to an univariate interpolation problem between (ql , hl ) and (qu , hu ). In this context, linear interpolation provides the simplest univariate scheme [11]: h(p) =
hl d(p, Cu ) + hu d(p, Cl ) d(p, Cl ) + d(p, Cu )
∀p ∈ Ω
(2)
with d(p, CL ) = minq∈CL (p) p − q = p − qL the distance (to be defined, e.g. standard Euclidean distance) from p to CL , L ∈ {l, u}. Alternatively, steepest slopes can be explicitly estimated. In [10], lower and upper slope functions sl and su are diffused from their estimated values on the contours to the inner-space regions. Introducing wL (p) = d(p, CL ) + tL d(p, CLc ) where tL = sL (p) (d(p, Cl ) + d(p, Cu ))/(hu − hl ), for L ∈ {l, u} and Lc ∈ {l, u}/ L, the height at any p ∈ Ω is computed through a monotone Hermite interpolation: h(p) =
hu d(p, Cl ) wl (p) + wu (p) d(p, Cu ) hl d(p, Cl ) wl (p) + d(p, Cu ) wu (p)
∀p ∈ Ω
(3)
Improved Morphological Interpolation of Elevation Contour Data
745
Fig. 2. From left to right: Euclidean, generalised and weighted generalised distances d(p, Cl ), τf (p, Cl ), and τg (p, Cl ) resp. from lower to upper contours computed over the contour map of Fig. 1.
with d(p, CL ) as before. Other authors propose to perform higher order interpolation along the steepest slope path, but this generally lead to height values outside the valid interval [hl , hu ], i.e. not fulfilling Eq. (1). 2.3
Morphology-Based Methods
MM provides us with a good theoretical framework for terrain modelling and analysis [4] and it is very well suited to the generation of grid DEMs from discrete contours. A classical approach consists in creating new intermediate contours between each pair of adjacent contour lines using basic morphological transforms: in [12], dilation and erosion operators are recursively applied while in [13], the skeleton extraction technique is used. However, the continuity of surfaces between the inner areas is in general not considered. Another family of MM algorithms is based on front propagations of Euclidean or geodesic distances. In [6], linear interpolation is performed along the shortest geodesic paths going through the interpolated pixels: in practice, Eq. (2) is applied with the geodesic distance dΩ (p, CL ) from p to CL (Fig. 2, left) - i.e. the minimal geometric length of a path (qL , p) linking p and a point qL ∈ CL entirely included in Ω [14] instead of the Euclidean distance d(p, CL ). This is a fast method but the created DEMs are not smooth on contour lines. Moreover, the approximated steepest slope path (ql , p, qu ) tends to stick to the concavities of the contour lines rather than following valley lines. In [15], the notion of generalised geodesic time was suggested in order to alleviate this problem. We currently propose to adopt and extend this latter approach for improving the interpolation method.
3 3.1
Interpolation with Generalised Geodesic Propagations Geodesic Time Function
Remember [14] that the geodesic distance between two points of a connected set is the length of the shortest path(s) linking these points and remaining in
746
J. Grazzini and P. Soille
Fig. 3. Influence zones of the lower contours Cl within the upper connected innerspace Ω (left); labels were randomly assigned. Estimated upper slopes sl over the whole image (right).
the set. This idea has been generalised to grey tone geodesic masks in [7]. The time τf (P) necessary for travelling on a path P defined on a discrete greyscale image f (the geodesic mask) is computed as the sum of the intensity values of the points on the path. The geodesic time τf (p, q) separating two points p and q of the geodesic mask f is then equal to the smallest amount of time of all paths linking these points and remaining within the definition domain of f . This definition can be again easily extended [7] to the geodesic time τf (p, Y ) between a point p and a reference set Y . Algorithms based on priority queue data structures enable an efficient implementation of geodesic time [4,6]. 3.2
Linear Interpolation Using Geodesic Time
A new geodesic distance was suggested in [15] in order to avoid artifacts occurring in interpolated surfaces when geodesic paths become tangent to the contours. It was proposed to use the geodesic time with the complement of the Euclidean distance to the contours as the geodesic mask. The process starts with the creation, on a regular grid, of the Euclidean distance field from the contour lines, followed by its complementation over Ω (Fig. 1, right): c
f (p) = [d(p, Cl ∪ Cu )] = max{d(x, Cl ∪ Cu )} − d(p, Cl ∪ Cu ) ∀p ∈ Ω. x∈Ω
(4)
The time geodesic path defined with the previous mask f follow low Euclidean distance values (Fig. 2, middle). Univariate interpolation is then performed, like in [6], along these paths: Eq. (2) is used for reconstructing the unknown surface h, the distances d(p, Cl ) and d(p, Cu ) being replaced with the lower and upper geodesic times, resp. τf (p, Cl ) and τf (p, Cu ). Likewise the standard geodesic interpolation, flat peaks and troughs can be avoided provided that the elevation and location of the peaks and troughs is mapped in the image of contour lines [4].
Improved Morphological Interpolation of Elevation Contour Data
3.3
747
Hermite Interpolation Using Geodesic Time
Recent methods [10,16,17] showed evidence on the need for incorporating secondorder information in the interpolation process mainly through the estimation of slopes based on the contour lines (see also Sec. 2.2). We propose to adopt an approach somewhat similar to that of [10]: slopes are first computed over the existing isolines, and then estimated over the inner-space using geodesic distances and geodesic influence zones [18]. Indeed, the computation of the distance field in Sec. 3.2 also enables to know for each pixel p ∈ Ω which is (are) the closest pixel(s) ql on the contour Cl . Reciprocally, for every pixel ql of the contour Cl , it is possible to know which inner pixels p ∈ Ω are closer to it than to any other pixel of Cl . Namely, the set of such pixels p constitutes the (upper) influence zone of ql inside the inner-space Ω (Fig. 3, left). Similarly, for a pixel qu ∈ Cu , it is possible to determine its lower influence zone inside Ω. Therefore, after computing the geodesic distance fields τf , one-sided slopes sl (ql ) and su (ql ) can be estimated at every pixel ql ∈ Cl taking into account the nearest lower and upper contours of Cl , and similarly for every qu ∈ Cu . For any p ∈ Ω the slopes sL (p) are then defined as a weighted arithmetic mean of the slopes at its closest points ql ∈ Cl and qu ∈ Cu : sL (p) =
τf (p, Cl ) sL (qu ) + sL (ql ) τf (p, Cu ) τf (p, Cl ) + τf (p, Cu )
∀p ∈ Ω
(5)
for L ∈ {l, u}. Lower and upper slope grids of sl and su are then generated over the image space (Fig. 3) and the elevation of the unknown surface can be computed using the Hermite interpolation of Eq. (3) like in [10]. 3.4
Linear Interpolation Using Weighted Geodesic Time
Instead of using local slopes for performing a higher-order interpolation scheme (see Sec. 3.3), we also propose another approach that consists in incorporating directly the knowledge about derivatives into the computation of the distance fields. Namely, using the lower and upper slopes (Fig. 3, right) defined in Eq. (5), we first introduce the weighted geodesic distance fields: gL (p) = sL (p) · f (p) = sL (p) · [d(p, Cl ∪ Cu )]c
∀p ∈ Ω
(6)
for L ∈ {l, u}, and then define the propagation function τgL as the time function computed using gL as a geodesic mask. The idea is to propagate faster from the contour pixels where the slope is steep (Fig. 2, right). Finally, like in Sec. 3.2, the reconstruction of the unknown surface is performed with a linear interpolation using the propagation fields τg (·, CL ) = τgL .
4
Results and Discussion
We intend to evaluate the different proposed methodologies by visual inspection, which is generally ”a good judge of an interpolation method” [19], as the desirable characteristics of an interpolation algorithm are not obvious, since
748
J. Grazzini and P. Soille
Fig. 4. From top down, left to right: synthetic contour map with elevation in the range [0, 255], excerpts (square box) of the hillshade representation of the surfaces interpolated using the methods described in [12], [6], Sec. 3.2 and Sec. 3.3 resp., and 3D representation of the whole surface interpolated using the method of Sec. 3.4
different quality criteria may conflict with each other. A synthetic contour map composed of polylines representing the isocontours was interpolated using the geodesic functions described in Sec. 3. It appears that the three proposed approaches perform better than the ’medial-line’ technique of [12], which is also derived from MM and implemented in a commercial geospatial software (see Sec. 2.3), or than the original ’geodesic’ algorithm [6] (see Sec. 3.2). In particular, the last two approaches (see Sec. 3.3 and 3.4) give rather good results as they create smooth surface transitions at the contour boundaries (Fig. 4, bottom middle and right: the results being very similar, we present two different possible representations). Besides their efficiency, the algorithms lead to contour line interpolation without flat zones whereas we can observe with ’geodesic’ and, above all, ’medial-line’ methods that the surface between two contour lines has the tendency to drop quickly in a terraced effect (Fig. 4, top). Indeed, the additional information contained in the slopes ensure that the reconstructed DEM is continuous across the contours. The methods also faithfully reconstruct summits and pits (Fig. 4, bottom right). One of the major problems occur on the medial axis, i.e. the set of points that have more than one closest points on CL (i.e. where influence zones of two pixels from the same contour encounter [18]) as the transition between slope values might be very strong on this line. However, for smooth terrains where the overall slope variations are not so strong, we
Improved Morphological Interpolation of Elevation Contour Data
749
consider this rather an advantage as it enables to model ridges and valleys as sharp feature lines. We also observe that when computing the weighted generalised geodesic time, the crest lines are generally moved towards the contours of maximal slopes, which is a desirable property.
5
Conclusion
In this paper, we explore the use of generalised geodesic distance/propagations functions for terrain surface interpolation from contours. We adopt the morphological approach suggested in a former paper on this subject, and we propose some possible extensions allowing for improved surface reconstruction. The basic idea is to reduce the surface reconstruction problem to an univariate interpolation problem, assuming that the elevation varies monotonically along the geodesic paths. The originality lies in the definition of new generalised geodesic functions, to be combined with classical interpolation schemes. The proposed methodology is generic in the sense that the introduced concepts could be applied in other interpolation techniques where distance fields are required. One may also think to the more general problem of 3D reconstruction from 2D cross-sections. In addition, this opens new fields of application for geodesic distances/propagations. New operators in MM, based on adaptive structuring elements, could be defined using the proposed geodesic functions. They could be also useful in geospatial applications where special care needs to be given to the geomorphological properties of the underlying surface, e.g. for introducing new distances for kriging techniques of variables defined over a terrain surface.
References 1. Schafer, R., Rabiner, L.: A digital signal processing approach to interpolation. Proc. of IEEE 61(6), 692–702 (1973) 2. Gold, C.: Surface interpolation, spatial adjacency and GIS. In: Raper, J. (ed.) 3D Applications in GIS, pp. 21–35. Taylor & Francis, London, UK (1989) 3. Carrara, A., Bitelli, G., Carla, R.: Comparison of techniques for generating digital terrain models from contour lines. Int. J. Geog. Inf. Sc. 11, 451–473 (1997) 4. Soille, P.: Advances in the analysis of topographic features on discrete images. In: Margaria, T., Yi, W. (eds.) ETAPS 2001 and TACAS 2001. LNCS, vol. 2031, pp. 175–186. Springer, Heidelberg (2001) 5. Grazzini, J., Chrysoulakis, N.: Extraction of surface properties from a high accuracy DEM using multiscale remote sensing techniques. In: Proc. of EnviroInfo, pp. 352–356 (2005) 6. Soille, P.: Spatial distributions from contour lines: an efficient methodology based on distance transformations. J. Vis. Comm. Im. Repres. 2(2), 138–150 (1991) 7. Soille, P.: Generalized geodesy via geodesic time. Patt. Recogn. Lett. 15(12), 1235– 1240 (1994) 8. Hutchinson, M.: Calculation of hydrologically sound digital elevation models. In: Proc. of ISSDH, pp. 117–133 (1988) 9. Kidner, D.: Higher-order interpolation of regular grid digital elevation models. Int. J. Rem. Sens. 24(14), 2981–2987 (2003)
750
J. Grazzini and P. Soille
10. Hormann, K., Spinello, S., Schr¨ oder, P.: C1 -continuous terrain reconstruction from sparse contours. In: Proc. of VMV, pp. 289–297 (2003) 11. Lee, J., Chung, S.: Reconstruction of 3D terrain data from contour map. In: Proc. of MVA, pp. 281–284 (1994) 12. Barrett, W., Mortensen, E., Taylor, D.: An image space algorithm for morphological contour interpolation. In: Proc. of Grap. Inter, pp. 16–24 (1994) 13. Thibault, D., Gold, C.: Terrain reconstruction from contours by skeleton construction. GeoInformatica 4, 349–373 (2000) 14. Lantu´ejoul, C., Maisonneuve, F.: Geodesic methods in image analysis. Patt. Recogn. 17, 177–187 (1984) 15. Soille, P.: Generalized geodesic distances applied to interpolation and shape description. In: Serra, J., Soille, P. (eds.) Mathematical Morphology and its Applications to Image Processing, pp. 193–200. Kluwer Academic Publishers, Boston, MA (1994) 16. Chai, J., Miyoshiy, T., Nakamae, E.: Contour interpolation and surface reconstruction of smooth terrain models. In: Proc. of IEEE Vis, pp. 27–33 (1998) 17. Dakowicz, M., Gold, C.: Extracting meaningful slopes from terrain contours. In: Proc. of ICCS, pp. 144–153 (2002) 18. Pr´eteux, F.: Watershed and skeleton by influence zones: a distance-based approach. J. Math. Imag. Vis. 1, 239–255 (1992) 19. Gousie, M., Franklin, W.: Constructing a DEM from grid-based data by computing intermediate contours. In: Proc. of ISAGIS, pp. 71–77 (2003)
Cylindrical Phase Correlation Method Jakub Bican and Jan Flusser Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Pod vod´ arenskou vˇeˇz´ı 4, 18208 Prague 8, Czech Republic
[email protected] Abstract. We present a new image registration algorithm that employs Phase Correlation Method (PCM) to align rotated and translated 3D images. First, we use PCM and cylindrical coordinate transform to estimate rotation angle that aligns two images that are mutually rotated around known axis. An improvement of this technique is given to eliminate influence of noise and image differences in non-ideal conditions. Finally, an iterative optimization procedure is proposed which uses these techniques in rigid body registration tasks. We study the performance of the algorithm in simulated experiment on MRI brain image.
1
Introduction
Image registration plays a major role in multiframe image processing. The purpose of image registration is to geometrically align two or more images differing by the imaging time, viewpoint, sensor modality and/or the subject of the images. Among many areas where the image registration is employed (such as remote sensing, computer vision), is the medical image processing one of the most important. Image registration methods are usually classified into two main groups [14]. Feature based methods incorporate a feature selection step to detect a set of control points, a feature matching step to find the correspondences between the two sets of control points and a transform model estimation step to determine the parameters of selected transformation from the correspondences. A very popular in medical image registration – especially in tomographic brain image registration – is a optimization scheme that aims to find an extreme of similarity or dissimilarity measure on a multidimensional space of parameters of a selected transform model by some numerical optimization process. Methods based on mutual information are the state of the art among these approaches. Fourier methods form a special group of approaches based on Phase Correlation Method (PCM). PCM was first introduced by Kuglin and Hines [10] as a fast and robust method for estimation of inter image shifts. The method was extended by De Castro and Morandi [2] to register translated and rotated images and later by Reddy and Chatterji [12] to register translated, rotated and scaled images. They use a log-polar transform of shift invariant spectral magnitudes to turn rotation and scaling to translation that is handled analogically by PCM. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 751–758, 2007. c Springer-Verlag Berlin Heidelberg 2007
752
J. Bican and J. Flusser
This approach is not applicable for 3D image registration as there is no mapping that converts rotations to translations in 3D. Keller et al. [9] introduced an algorithm for the registration of rotated and translated 3D volumes based on Pseudopolar Fourier transform. Their approach uses the pseudopolar representation of spectral magnitudes to find the rotation axis and estimate the rotation angle without using interpolation. Other authors aim to refine the precision of PCM to subpixel level. Foroosh et al. [4] estimate the subpixel shifts by analyzing the polyphase decompostition of cross-power spectrum. In [13], Stone et al. first eliminate the effect of aliasing and then use least-squares fit to 2-D phase difference data. As this is a difficult task, authors of [5,6] separate the shift estimation in every dimension by SVD or high-order SVD of cross-power spectrum. An improvement of robustness of this approach is given in [8]. All of the approaches mentioned above estimate the rotations on the shift invariant spectral magnitudes of the images. In case of images with a low structural nature of spectral magnitudes (e.g. many medical imaging modalities) we find this difficult and unreliable as the most information that is important for the PCM (e.g. image edges) has been lost as the image phase was discarded from the rotation estimation. In this paper, we present a new way of employing PCM to register 3D images. Our method is based on cylindrical coordinate mapping of image in spatial domain and iterativelly uses PCM to estimate the update of rigid body transform with respect to some transform component: rotation or translation. We experimentally study the method’s performace in the case of tomographic brain image registration tasks.
2
Notation and Definitions
Image is a real, discrete and bounded function. In image registration, moving image fM is registered with reference (or fixed) image fR . Discrete Fourier transform (or spectrum) F (ω) = F {f (x)} of image f (x). 3D rotation rotates IR3 around some central point. In this paper we represent rotations by euler angles R(α, β, γ), axis and angle Rv (α) or by a 3 × 3 matrix M (R). Further representations and rotation properties can be found for example in [1,3] or any standard computer graphics textbook. Rigid body transform T can be viewed as a rotation R followed by a translation t. In words of matrix representation, it can be implemented as one multiplication of vector by rotation matrix and one addition of translation vector to rotated result. Rigid body transforms (as well as rotations) can be composed to get new rigid body transforms (T2 ◦ T1 )(x) := M (R2 )M (R1 )x + M (R2 )t1 + t2 . Cylindrical coordinate transform Cyl (defined by origin and axis of a cylinder) assigns each triplet of cartesian coordinates (x, y, z) with a new triplet of cylindrical coordinates (α, r, z). The first cylindrical coordinate corresponds
Cylindrical Phase Correlation Method
753
to the angle, the second to the radius of the cylinder and the third to the height of the cylinder.
3
Phase Correlation Method
Phase correlation method [10] takes advantage of Fourier shift theorem that relates the phase information of spectrums of an image and its shifted copy. If fM is shifted copy of fR then according to the shift theorem are the Fourier transforms of the images related as follows FM (ω) F−1 = F−1 ei(ωx Δxx +ωy Δxy +ωz Δxz ) = δ (x + Δx) . FR (ω) Quotient of spectrums FM and FR is in practise (if fM is not exactly the shifted copy of fR ) computed as Corr(x) =
FM FR∗ , |FM | |FR |
so that PCM computes the correlation of whitened images (images with |F | = 1). Thus, locating a peak in a correlation surface Corr results in offset Δx that can be used to align the two images at pixel level.
4
Cylindrical Phase Correlation Method
In this section, we propose new image registration algorithms. First, we describe a technique of using PCM to find a rotation angle in case the two volumes are just rotated around a known axis. Then, we give some improvements that increase the performance of the technique in non-ideal conditions. Finally, we use the technique in an iterative algorithm that registers volumes differing by rigid body transform. 4.1
Finding Rotation with Known Axis
Now consider the two 3D images fR and fM are related by a rotation. In 2D case, it is possible to convert the rotation around some central point to a translation by polar transforming the images with the origin of the polar coordinates located in the center of the rotation. The angle of rotation can be then determined by using 2D version of PCM [12] described above. The direct generalization of this approach for 3D case by using spherical coordinates does not work as these coordinates do not convert 3D rotation to a translation. Let’s represent the rotation by axis v and angle α and assume that the rotation axis v is known. The angle α can be determined by first transforming the images to cylindrical coordinates with the axis of the cylinder aligned with v and then using PCM to find the shift between the transformed images offset = PCM (Cylv (fR ), Cylv (fM )) .
754
J. Bican and J. Flusser
The rotation angle directly corresponds with the offset in the first (angular) cylindrical coordinate, while this is the only offset found by PCM Rv (offsetα ) . In fact, this technique uses exactly the polar-mapping trick of [12] in every plane orthogonal to the rotation axis v, but in all these planes together. As well, it meets (in case of angular direction) the presumption of Fourier transform and PCM that the image can be periodically extended beyond its bounds. 4.2
Improving the Performance
The approach described in previous section has two main drawbacks. The first one is caused by performing computations in discrete domain: when making cylindrical transform of the images, it is necessary to use higher order interpolation, because the cylindrical transform (alike a polar transform) is sampling the space very non-uniformly. The second drawback is that the voxels of original volume located near the axis of the cylinder have much greater impact than the voxels located at the perimeter. If the angular and radial coordinate is sampled so that the perimeter of the cylinder is not subsampled and no information is lost, every voxel near the axis is stretched (or interpolated) to several voxels, while the voxels at the perimeter are resampled approximately one-to-one. Moreover, the PCM gives the same significance to well sampled voxels at the perimeter as to resampled voxels originating from the voxels near the axis, which are also highly affected by interpolation error. These drawbacks led us to develop technique which computes PCM separately for every layer of the cylinder defined by fixed radius. Every such layer has different angular resolution that suitably samples the original data: layer at radius r is in angular direction sampled by 2πr samples, i.e. with resolution (or 2π spacing) 2πr = 1r radians. Corresponding layers from reference and moving image are registered by PCM which results in a correlation surface that gives a degree of match for each angle. Correlation surfaces from all layers are then combined into one and the angle with the highest value in the combined correlation surface
Fig. 1. MRI brain image
Cylindrical Phase Correlation Method
755
is the resultant parameter of the registration. When combining, each surface is weighted by its radius which corresponds to the number of pixels it contains. This gives to each original voxel same impact on the result. function IRE(v,ReferenceImage,MovingImage) rmax := "maximal radius of data" height := 2*rmax GlobalCorrelation := zeros(floor(2*pi*rmax),height) for r = 1:rmax ReferenceLayer := Cyl(v,ReferenceImage)[r,:,:] MovingLayer := Cyl(v,MovingImage)[r,:,:] LayerCorrelation := PCM(ReferenceLayer,MovingLayer) GlobalCorrelation += r*resize(LayerCorrelation, size(GlobalCorrelation)) end offset := find_max(GlobalCorrelation) IRE := offset[0] end The input to the procedure are the two images and the rotation axis v. When resampling the correlation surface, we use the nearest neighbour interpolation for that the interpolation error is largest for layers close to cylinder’s axis, which have small impact due to weighting of the layers. 4.3
Rigid Body Registration
Rigid body transform is a transform that combines rotations and translations. Finding optimal parameters of rigid body transform (six parameters in 3D) is a very usual task of image registration [14] (intra-subject studies, multimodal registration, etc). As it was mentioned in the introduction, there is a class of registration methods that employ a numerical optimization process to find the optimum of similarity measure on a space of parameters of a transformation model. Our algorithm uses above described procedures to find parameters of rigid body transform so that the PCM metric – the correlation of whitened images – reaches its maximum. The optimization runs in iterations. Each iteration aims to improve the metric with respect to some subset of parameters. Such optimization resembles some well known optimizers (e.g. Powell’s direction set method [11]) and is sometimes called alternating optimization. Let’s start with identity transform T := id and a set of three linear independent axes. x, y and z is a good selection. Repeat following iterations to compute transform update Tupd : Odd iterations compute PCM to estimate the shift between the reference volume fR and moving volume transformed by actual transform T (fM ) Tupd := PCM (fR , T (fM )) .
756
J. Bican and J. Flusser
Even iterations estimate the rotation component with respect to one of the axes by above procedure Tupd := IRE{x|y|z} (fR , T (fM )) . Axes cyclically alternate as the algorithm advances so that for example in iterations 2, 4, 6, 8, 10 . . . are used axes x, y, z, x, y . . . respectively. After each iteration, the current transform T is updated by new Tupd T = T ◦ Tupd . This iterational process is stopped if there was no non-zero update found in last six iterations (no transform parameter can be further optimized), or if the maximum number of iterations is met (time limit) or if the actual result is good enough (e.g. algorithm is stopped by operator). The convergence of our algorithm faces similar problems as usual optimization techniques. It is not ensured that the reached optimum will be the global optimum of the similarity metric and that it is approached in some well defined time limit. But in constrast to other methods is our method optimal in every step (PCM and IRE find global optimum with respect to given parameter subset). Furthermore, in case of PCM, the optimum is found over three shift parameters in one step. Hence, the method should be able to register images with low spatial correlation and with high initial misregistration, which are usual properties that limit other methods. Again, this does not guarantee the convergence to global optimum.
5
Experimental Results
In this section, we present one of our experiments that study the method’s performance. We created an experimental implementation of the method in C++ as a module to Insight Toolkit ITK [7]. We aim to study the influence of initial misregistration level to the rigid body registration result and the number of iterations needed to converge the algorithm. We use CPCM to register randomly rotated and shifted MRI brain image (Fig. 1) with the original. The volume size is 128×128×40 with 1.8×1.8×4.58 mm voxel size. The degree of misregistration as well as the registration error is measured by a fixed set of eight points that uniformly sample the reference image’s volume. The error (or misregistration) is then measured as a mean euclidean distance of these points in moving image to their original counterparts in reference image. This could be understood as a mean distance of every point of a volume to its transformed counterpart. We continuously generated random transforms, so that there was at least one hundred of different transforms for each 1 mm level of initial misregistration. For each misregistration level, the results are the mean values over all transforms that introduced misregistration of that level. The graph in Fig. 2a shows
Cylindrical Phase Correlation Method (a)
(b) 80
Reg. result (total) Reg. result (error < 10 mm) Iterations (total) Iterations (error < 10 mm)
70 60
Downgrades Large error (error > 10 mm) Early stops
70 60
50
% of Cases
Mean Error [mm] / No. of Iterations
80
40 30 20
50 40 30 20
X: 100 Y: 1.686
10 0
757
0
20
40 60 80 100 Misregistration [mm]
120
10 140
0
0
20
40 60 80 100 Misregistration [mm]
120
140
Fig. 2. Influence of initial misregistration on the mean error after registration (a), number of iterations (a) and the failure statistics (b)
two alternative views of the results. First, we filtered only those results that succesfully converged under some reasonable error (here 10 mm – explained below). Then the graph also plots values that include all results. Figure 2b shows the statistics of three kinds of failures: 1. The method converged but the final error was larger than the initial misregistration (the alignment was downgraded), 2. The final misregistration error was larger than 10 mm (this also includes case 1 above), 3. The method reached the iteration limit (120 iterations) and was stopped (usually the error in this situation does not decrease under 10 mm so it is included in case 2 too). The results can be interpreted such that until the misregistration is up to about 100 mm, the method converges to the pixel level precision with at least 90% reliability and the number of iterations (i.e. time) behaves approximately logarithmic to the misregistration. As misregistration grows over 100 mm (which is approximately the radius of the volume), the failure rate increases and method’s performance decreases mainly due to cases in which method converged to some false position. We should point out that these results and trends does not depend on the specific value of reasonable error mentioned above. We use value of 10 mm that is one order higher than the pixel size and is still resonable small, but we could use values 5-45 mm without any significant effect on the graphs.
6
Conclusion
We presented a new image registration algorithm that is able to geometricaly align mutually translated and rotated pair of 3D images. The method iteratively
758
J. Bican and J. Flusser
recovers translational component of misalignment by Phase Correlation Method and rotational component by applying the same method on cylindrically mapped images. The experimental results on medical data expose method’s usableness even for large initial misalignment, which is a common limitation for example for methods based on Mutual information. On the other hand, current method is able to register images only with pixel-level precision.
Acknowledgement The authors would like to thank to Helena Trojanov´ a from Clinic of Nuclear Medicine of Teaching Hospital Kr´alovsk´e Vinohrady, Prague, Czech Republic for kindly providing the MRI data and to Czech Ministry of Education for support under the project No. 1M6798555601 (Research Center DAR).
References 1. Baker, M.J.: Maths - rotations, http://www.euclideanspace.com 2. de Castro, E., Morandi, C.: Registration of translated and rotated images using finite Fourier transform. IEEE Trans. Pattern Analysis and Machine Intelligence 9(5), 700–703 (1987) 3. Elliot, J.P., Dawber, P.G.: Symmetry in Physics. MacMillan, London (1979) 4. Foroosh, H., Zerubia, J.B., Berthod, M.: Extension of phase correlation to subpixel registration. IEEE Trans. Image Processing 11(3), 188–200 (2002) 5. Hoge, W.S.: A subspace identification extension to the phase correlation method. IEEE Trans. Medical Imaging 22(2), 277–280 (2003) 6. Hoge, W.S., Westin, C.F.: Identification of translational displacements between ndimensional data sets using the high-order svd and phase correlation. IEEE Trans. Image Processing 14(7), 884–889 (2005) 7. Ibanez, L., Schroeder, W., Ng, L., Cates, J.: The ITK Software Guide. Kitware, Inc. 2nd edn (2005) ISBN 1-930934-15-7, http://www.itk.org/ItkSoftwareGuide.pdf 8. Keller, Y., Averbuch, A., Miller, O.: Robust phase correlation. In: International Conference on Pattern Recognition, pages II, 740–743 (2004) 9. Keller, Y., Shkolnisky, Y., Averbuch, A.: Volume registration using the 3-D pseudopolar Fourier transform. IEEE Trans. Image Processing 54(11), 4323–4331 (2006) 10. Kuglin, C.D., Hines, D.C.: The phase correlation image alignment method. In: Proc. Int. Conf. on Cybernetics and Society, pp. 163–165. IEEE, Orlando (1975) 11. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes in C, 2nd edn. Cambridge University Press, Cambridge (1992) 12. Reddy, B.S., Chatterji, B.N.: An FFT-based technique for translation, rotation, and scale-invariant image registration. IEEE Trans. Image Processing 5(8), 1266– 1271 (1996) 13. Stone, H.S., Orchard, M.T., Chang, E.C., Martucci, S.A.: A fast direct Fourierbased algorithm for subpixel registration of images. IEEE Trans. Geoscience and Remote Sensing 39(10), 2235–2243 (2001) 14. Zitov´ a, B., Flusser, J.: Image registration methods: a survey. Image and Vision Computing 21(11), 977–1000 (2003)
Extended Global Optimization Strategy for Rigid 2D/3D Image Registration Alexander Kubias1 , Frank Deinzer2 , Tobias Feldmann1 , and Dietrich Paulus1 1
University of Koblenz-Landau, Koblenz, Germany {kubias, tfeld, paulus}@uni-koblenz.de 2 Siemens Medical Solutions, Forchheim, Germany
[email protected] Abstract. Rigid 2D/3D image registration is a common strategy in medical image processing. In this paper we present an extended global optimization strategy for a rigid 2D/3D image registration that consists of three components: a combination of a global and a local optimizer, a combination of a multi-scale and a multi-resolution approach, and a combination of an in-plane and an out-of-plane registration. The global optimizer Adaptive Random Search is used to provide several coarse registration results on a low resolution level that are refined by the local optimizer Best Neighbor on a higher resolution level. We evaluate the performance and the precision of our registration algorithm using two phantom models. We could approve that all three components of our optimization strategy lead to an significant improvement of the registration. Keywords: Image Registration, Global Optimization, Multi-Resolution.
1
Introduction and Related Work
As mentioned in [1], image registration is a very common problem in medical image processing and, thus, automatic image registration is a very important component in current medical imaging systems [2]. In 2D/3D image registration, a pre-operative volume (e.g. CT or MRT) is registered with an intra-operative X-ray image. Thus, the preoperatively acquired volume can be used for intraoperative therapy guidance [3]. In this paper, we are only interested in intensitybased methods for the 2D/3D image registration. These methods consist of two parts, namely the DRR (Digitally Reconstructed Radiograph) generation and the computation of a similarity measure between the DRR and the X-ray image. As our main contribution, we present an extended global optimization strategy for a rigid 2D/3D image registration. Our optimization strategy is a combination of three single components: First, a global optimizer and a local optimizer are combined. Nowadays, local optimizers are used frequently for image registration tasks as they are easy to implement and cost-efficient. In contrast, global optimizers are rarely used because of their high computational costs. However, by using a global optimizer, one avoids being trapped into a local optimum and, thereby, missing the global optimum. In our optimization method, the optimizer W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 759–767, 2007. c Springer-Verlag Berlin Heidelberg 2007
760
A. Kubias et al.
Fig. 1. Differentiation of in-plane and out-of-plane transformations. In this example, a translation along the Y-axis or the Z-axis is an in-plane translation, otherwise an outof-plane translation. A rotation around the X-axis is an in-plane rotation, otherwise an out-of-plane rotation.
Adaptive Random Search (ARS) [4] provides coarse registration results that are refined by the local optimizer Best Neighbor (BN) [5]. Despite its simplicity, the local optimizer Best Neighbor can achieve robust registration results [6] and, thus, it is frequently used [7]. Secondly, we combine a multi-resolution and a multi-scale approach. As a main difference between both approaches, images are used in different resolution levels in a multi-resolution approach, whereas images are only in a single resolution level in a multi-scale approach. Thus, in a multi-resolution approach images are implicitly smoothed by reducing their size. Instead, in a multi-scale approach images must be explicitly smoothed, e.g. by a Gaussian filter [8]. In our multi-resolution approach, the global optimizer is executed on a low resolution level, whereas the local optimizer runs on a higher resolution level [9]. In our multi-scale approach, DRRs are created from volumes with fewer slices and, thereby, smoothed. A similar method was proposed in [6], but not combined with a multi-resolution approach. By using a multi-resolution and a multi-scale approach, the runtime of the registration algorithm [10] and the number of local optima can be significantly reduced [2]. Thirdly, we do not estimate the six parameters of the rigid 2D/3D image registration problem in parallel because in a six-dimensional parameter space, many evaluations are needed in order to obtain a precise registration result. Instead, we split this six-dimensional problem into two three-dimensional problems by first estimating the in-plane parameters and, then, estimating the out-of-plane parameters. As mentioned in [11], the effects of the translations and rotations in the projected image depend on the axis according to which the translation or rotation is performed, if a perspective projection is used. These different effects are especially visible for translations. Thereby, in-plane and out-of-plane translations must be distinguished. A translation, which moves an object parallelly to the image plane, is an in-plane translation. In contrast, a translation, which
Extended Global Optimization Strategy
761
moves an object orthogonally to the image plane, is called an out-of-plane translation [11]. In Fig. 1 translations along the Y-axis and the Z-axis are in-plane translations and a translation along the X-axis is an out-of-plane translation. Rotations are treated in a similar way. In Fig. 1 the rotation around the X-axis is an in-plane rotation. The two other rotations are out-of-plane rotations. We present a new method for distinguishing in-plane and out-of-plane transformations and estimating them separately. Other approaches that cope with these problems were proposed in [7] and in [12]. In section 2 we describe the basics of the rigid 2D/3D image registration and the global optimization strategy that was used for the registration. In section 3 we show and discuss the results of our implementation. For the sake of performance, we implemented our 2D/3D image registration algorithm completely on the GPU. Finally, in section 4 we conclude and present some ideas for our future work.
2 2.1
Extended Global Optimization Strategy 2D/3D Image Registration
In 2D/3D image registration, a preoperative volume (e.g. CT or MRT) is registered with an intra-operative X-ray image. In this paper we restrict ourself to rigid image registration where the volume can only be translated and rotated according to three coordinate axes. This transformation is given by the parameter vector x = (tx , ty , tz , rx , ry , rz )T . Thereby, the parameters tx , ty , tz represent the translation in mm along the X-, Y- and Z-axis, whereas the parameters rx , ry , rz describe the rotation vector r = (rx , ry , rz )T , which is internally represented in Rodrigues notation [13]. ⎛ ⎞ 1000 p˜p = P ·p˜w = K· ⎝ 0 1 0 0 ⎠ ·D·p˜w (1) 0010 In order to generate a DRR from a volume, each point of the volume p˜w in the world coordinate system (WCS) [13] is mapped onto a point p˜p in the pixel coordinate system (PCS) [13] (1). The points p˜w and p˜p are given in homogeneous coordinates [13]. The projection matrix P contains the matrix K, which consists of the intrinsic camera parameters, a further matrix, which reduces the number of dimensions, and the matrix D, which consists of the extrinsic camera parameters [14]. Prior to the DRR generation, a transformation according to the parameter vector x is executed on the volume. Therefore, (1) is extended to p˜p = P ·T ·p˜w
.
(2)
In (2), the transformation matrix T , which models the translations and rotations of x, is performed onto the points of the volume. Thereby, T is the matrix
762
A. Kubias et al.
representation of x by using the Rodrigues formula in [14] and extending it with the translation of x. Afterwards, the DRR IDRR (x) is produced from the transformed volume (according to the current transformation x) and compared to the X-ray image IFLL by applying a similarity measure S(IFLL , IDRR (x)). The used X-ray image IFLL was acquired by a X-ray C-arm-system before. By using a certain optimization technique, the similarity between the DRR and the X-ray image will be increased until the optimal parameter vector x∗ is found x∗ = argmax S(IFLL , IDRR (x)) . (3) x
2.2
In-Plane and Out-of-Plane Registration
As in-plane transformations cause the crucial movements in the projected image and out-of-plane transformations do not affect the projected image very much, it is useful to distinguish between them. We propose a new optimization strategy, in which the six-dimensional registration problem is divided into a three-dimensional in-plane registration and into a three-dimensional out-of-plane registration. In a first step, an in-plane registration is executed in which only the three in-plane parameters are estimated. All six parameters were initialized with zeros before. These parameters are easier to estimate as they cause major changes in the projected image. In the meanwhile, the three out-of-plane parameters remain unchanged. In the second step, the out-of-plane registration is executed in which only the three out-of-plane parameters are determined. The starting point for the out-of-plane registration is the parameter vector that is returned by the in-plane registration. As the out-of-plane transformations cause only small effects in the projected image, it is difficult to obtain a stable estimation for them. Nevertheless, their estimation can significantly be improved if the in-plane transformations are already properly determined. The identification of the in-plane and out-of-plane transformations is not trivial because it depends on the point of view from which the volume is looked at. In contrast to the world coordinate system (WCS), it is known for the camera coordinate system (CCS) that translations along the X-axis and the Y-axis are in-plane translations and that the rotation around the Z-axis is an in-plane rotation. Naturally, the three other transformations are out-of-plane transformations. Thus, a solution is found for determining the in-plane and out-of-plane transformations independent of the current point of view from which the volume is gazed. In order to exploit the benefits of the CCS, the WCS is rotated by R so that it has the same orientation as the CCS except for a translation. This rotation, which is expressed by the rotation matrix R, can be computed from the projection matrix P by using the QR-decomposition [14]. In the rotated WCS, the transformation T is performed onto the volume. Finally, the rotated WCS is transformed into its original position by the inverse rotation R−1 . Thus, (2) must be amended to p˜p = P ·R·T ·R−1 ·p˜w
.
(4)
Extended Global Optimization Strategy
2.3
763
Multi-resolution and Multi-scale Registration
As mentioned before, a combination of a multi-resolution and a multi-scale approach is applied in our optimization strategy. Therefore, an X-ray image, whose original size was 10242 pixels, is scaled to different resolutions with 5122, 2562 , 1282 and 642 pixels. The scaling is done by a mean value filter with a 2 × 2 mask. Certainly, the DRR is created in the same size as the X-ray image. The multi-scale approach is achieved by using different resolutions of a CT volume for the DRR generation. A similar multi-scale approach was described in [6], but not combined with a multi-resolution approach. The scaling of the CT volume is done by computing the mean value in a neighborhood of 2 × 2 × 2 voxels. For generating DRRs in a high resolution, we use a CT volume with 2563 voxels. If the DRR should be produced in a lower resolution, a CT volume with 1283 voxels or, even, 643 voxels is used. Thus, an artificial smoothing of the DRR is achieved. Our final optimization strategy is divided into a global optimization step and a local optimization step. First, the global optimization step is executed which achieves a coarse alignment of the X-ray image and the CT volume on the lowest resolution level, in which the X-ray image has 642 pixel and the CT volume has 643 voxels. Therefore, the global optimizer ARS creates ninit initial parameter vectors, of which nnew parameter vectors are optimized over niter iterations. One of the parameters ninit , nnew and niter will be determined within the experiments. As described in section 2.2, only the in-plane transformations are estimated at first. Subsequently, the out-of-plane transformations are determined. The starting point for this global out-of-plane registration is the result of the global in-plane registration. After the convergence of the global optimizer, the nlocal best parameter vectors of the global optimizer are used as starting points for the local optimization. The local optimization is executed on a higher resolution level. Therefore, an X-ray image with 2562 pixels and a CT volume with 2563 voxels are used. In contrast to the global optimization, the local optimizer estimates all six parameters in parallel. Otherwise, new local optima could be generated. Each of the nlocal parameter vectors are locally optimized one after the other. The best of these nlocal parameters vectors is returned as the result of the registration algorithm. For the sake of performance, we do not use higher resolution levels for the CT volume or the X-ray image. Nevertheless, the optimization process could be continued on a higher resolution level in order to obtain even better results.
3
Results and Discussion
In order to evaluate the performance and the precision of our algorithm, we use a GeForce 7800 GS AGP with 256 MB. The GPU programming is done entirely in OpenGL and Cg. The CPU source code is executed on a single Pentium 4 CPU with 3.2 GHz. In the following, we will perform several experiments in order to determine reasonable values for the proposed parameters from section 2.3 and in order to
764
A. Kubias et al.
Fig. 2. Phantom model of a head and a thorax
show the benefits of our optimization strategy. For the experiments, we use the two phantom models whose X-ray images are shown in Fig. 2. The CT volumes of these phantom models were acquired by an X-ray C-arm-system. Then, the X-ray image of the head is registered to the volume of the head and the X-ray image of the thorax is registered to the volume of the thorax. As the ground truth position is known for both phantom models, the registration error can be measured in mm using the euclidean distance in the image. From these single distances the 95th, 90th and 75th percentiles are computed. As the ground truth position is given by the parameter vector x∗ = (0, 0, 0, 0, 0, 0), we add an artificial transformation in order to obtain the starting point for our registration. The artificial transformation translates the volume 14mm along the X-axis, 15mm along the Y-axis and 16mm along the Z-axis and rotates it 9 degrees around the X-axis, 10 degrees around the Y-axis and 11 degrees around the Z-axis. The resulting starting point is used in all experiments. Nowadays, registration problems are often solved using a local optimizer. Therefore, in our first experiment we compare a local optimization strategy (6dim local) to a global optimization strategy (6-dim global). In both strategies, the six parameters of the parameter space are estimated in parallel. We also compare these two approaches to our global in-plane and out-of-plane optimization strategy (in-out global). In our optimization strategy, the parameters from section 2.3 are set to ninit = 10000, nnew = 100, niter = 100 and nlocal = 10. One of these values is determined in the subsequent experiment. The global optimizer is executed on the lowest resolution level, in which the X-ray image has 642 pixels and the CT volume has 643 voxels, and the local optimizer is executed on a higher resolution level, in which the X-ray image has 2562 pixels and the CT volume has 2563 voxels. As shown in Table 1, our global optimization strategy was about six times slower than the local optimization strategy. Compared to this local approach, the registration error of our optimization strategy was about 90% smaller. The local optimizer was often trapped into a local optimum and, thus, could not achieve precise registration results. Compared to the six-dimensional global optimization strategy, our strategy provided slightly worse results for the head. As the head
Extended Global Optimization Strategy
765
Table 1. We compared our optimization strategy (in-out global) to a local and a global optimization strategy, in which all six parameters were estimated in parallel. For each approach, the registration error and the runtime is given. Our strategy achieved about 90% smaller registration errors than the local optimizer. The measured values were averaged over 50 registration runs. registration results 6-dim local 6-dim global in-out global
95th percentile 90th percentile head thorax head thorax 41.7mm 51.3mm 36.4mm 44.9mm 3.4mm 31.3mm 2.4mm 24.9mm 5.0mm 4.9mm 4.0mm 3.9mm
75th percentile head thorax 30.3mm 36.6mm 1.0mm 4.6mm 2.6mm 2.7mm
runtime head thorax 91s 93s 343s 229s 547s 579s
Table 2. We could approve for our approach (in-out global) that the registration error and the computation time depend on the number of initial parameter vectors. The parameters were set to nnew = 100, niter = 100 and nlocal = 10. The measured values were averaged over 50 registration runs. ninit 95th percentile 90th percentile head thorax head thorax 103 5.2mm 5.0mm 4.1mm 4.0mm 104 5.0mm 4.9mm 4.0mm 3.9mm 105 4.8mm 4.8mm 3.8mm 3.8mm
75th percentile head thorax 3.1mm 2.8mm 2.6mm 2.7mm 2.7mm 2.7mm
runtime head thorax 463s 540s 547s 579s 1822s 2252s
can be easily registered, an improvement of the registration cannot be accomplished by using the in-plane and out-of-plane separation. However, our strategy achieved clearly better results for the registration of the thorax. The registration error was about 80% smaller than the registration error of the six-dimensional global optimization strategy. As mentioned before, the parameters ninit , nnew , niter and nlocal have to be determined for our global in-plane and out-of-plane optimization strategy. Within this paper, we will only present the experiment in which the value of the parameter ninit is determined. The values of the parameters nnew = 100, niter = 100 and nlocal = 10 were also reasonably chosen in order to enable an efficient and precise registration process. We analyzed how much the performance and the precision of the registration is affected by the number of initial parameter vectors. As shown in Table 2, with an increasing number of initial parameter vectors the precision of the registration was improved. We decided to use ninit = 104 initial parameter vectors as they provided a better registration result than ninit = 103 initial parameter vectors and were just a little bit worse than ninit = 105 initial parameter vectors. Compared to ninit = 103 initial parameter vectors, the runtime of the registration process was only a little bit longer. However, we did not use ninit = 105 initial parameter vectors as they needed much more computation time and achieved only slightly better registration results.
766
4
A. Kubias et al.
Conclusions and Future Work
In this paper we presented an extended global optimization strategy for a rigid 2D/3D image registration that consists of three components: a combination of a global and a local optimizer, a combination of a multi-scale and a multiresolution approach and a combination of an in-plane and an out-of-plane registration. The global optimizer Adaptive Random Search was used to provide several coarse registration results that were refined by the local optimizer Best Neighbor. We evaluated the performance and the precision of our registration algorithm using two phantom models. We could approve that all three components of our optimization strategy lead to an improvement of the registration. Compared to a local optimizer, the registration error of our optimization strategy was about 90% smaller. In the future, we will extend our global optimization algorithm to feature-based similarity measures. Additionally, we will analyze other global optimizers and integrate them in our optimization strategy.
References 1. Goecke, R., Weese, J., Schumann, H.: Fast Volume Rendering Methods for Voxelbased 2D/3D Registration - A Comparative Study. In: Proceedings of International Workshop on Biomedical Image Registration ’99, pp. 89–102 (1999) 2. Khamene, A., Chisu, R., Wein, W., Navab, N., Sauer, F.: A Novel Projection Based Approach for Medical Image Registration. In: Pluim, J.P.W., Likar, B., Gerritsen, F.A. (eds.) WBIR 2006. LNCS, vol. 4057, pp. 247–256. Springer, Heidelberg (2006) 3. Russakoff, D.B., Rohlfing, T., Mori, K., Rueckert, D., Adler Jr., J.R., Maurer Jr., C.R.: Fast Generation of Digitally Reconstructed Radiographs using Attenuation Fields with Application to 2D-3D Image Registration. IEEE Transactions on Medical Imaging 24(11), 1441–1454 (2005) 4. Zhigljavsky, A.: Theory of Global Random Search. Kluwer Academic, Edmonton (1991) 5. Törn, A., Zilinskas, A.: Global Optimization. Springer, Berlin (1989) 6. Wein, W.: Intensity Based Rigid 2D-3D Registration Algorithms for Radiation Therapy. Master’s thesis, Technische Universität München (2003) 7. Russakoff, D., Rohlfing, T., Shahidi, R., Kim, D., Adler, J., Maurer, C.: IntensityBased 2D-3D Spine Image Registration Incorporating One Fiducial Marker. In: Ellis, R.E., Peters, T.M. (eds.) MICCAI 2003. LNCS, vol. 2878, pp. 287–294. Springer, Heidelberg (2003) 8. Adelson, E.H., Anderson, C.H., Bergen, J.R., Burt, P.J., Ogden, J.: Pyramid Methods in Image Processing. RCA Engineer 29(6), 33–41 (1984) 9. Unser, M., Aldroubi, A., Gerfen, C.: A Multiresolution Image Registration Procedure Using Spline Pyramids. In: Proc. of the SPIE Conference on Mathematical Imaging: Wavelet Applications in Signal and Image Processing, vol. 2034, pp. 160–170. SPIE Press, San Diego, CA (1993) 10. Thevenaz, P., Ruttimann, U.E., Unser, M.: Iterative Multi-Scale Registration Without Landmarks. In: Proc. of the 1995 International Conference on Image Processing, p. 3228. IEEE Computer Society, Washington, DC, USA (1995)
Extended Global Optimization Strategy
767
11. Penney, G.P.: Registration of Tomographic Images to X-Ray Projections for Use in Image Guided Interventions. PhD thesis, King’s College, London (1999) 12. Skerl, D., Likar, B., Pernus, F.: A Protocol for Evaluation of Similarity Measures for Rigid Registration. IEEE Transactions on Medical Imaging 25(6), 779–791 (2006) 13. Trucco, E., Verri, A.: Introductory Techniques for 3–D Computer Vision. Prentice Hall, New York (1998) 14. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000)
A Fast B-Spline Pseudo-inversion Algorithm for Consistent Image Registration Antonio Trist´an and Juan Ignacio Arribas Laboratorio de Procesado de Imagen Universidad de Valladolid. Spain
[email protected],
[email protected] http://www.lpi.tel.uva.es
Abstract. Recently, the concept of consistent image registration has been introduced to refer to a set of algorithms that estimate both the direct and inverse deformation together, that is, they exchange the roles of the target and the scene images alternatively; it has been demonstrated that this technique improves the registration accuracy, and that the biological significance of the obtained deformations is also improved. When dealing with free form deformations, the inversion of the transformations obtained becomes computationally intensive. In this paper, we suggest the parametrization of such deformations by means of a cubic B-spline, and its approximated inversion using a highly efficient algorithm. The results show that the consistency constraint notably improves the registration accuracy, especially in cases of a heavy initial misregistration, with very little computational overload. Keywords: Free form image registration, consistent registration, B-splines, inverse transformation.
1
Introduction
Elastic image registration is an important and challenging issue in medical imaging, and it has been successfully applied to a number of problems of great interest over the recent years [1]. The anatomy being imaged involves quite often deformable tissues, such as the breast [2,3], the brain [4], the kidney, and so on [1], and there lies the importance of fully deformable models for image registration. The idea of consistent image registration has been recently introduced in [5], and further explored in [6], although the convenience of this technique has been pointed out by an increasing number of authors [4,7,8,9,10]. The idea is that when one has to register a pair of images, that is, to deform a model to match a given scene, it is useful to estimate as well the inverse transformation, that is, the one that matches the scene to the model, constraining both transformations to be the inverse of each other. This helps to reduce the presence of local minima, usually yielding more biologically meaningful results [5,6,7], and even more accurate and robust estimates of the deformation [9,10]. This is especially well suited for elastic registration, where local minima may become especially problematic. Consistent W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 768–775, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Fast B-Spline Pseudo-inversion Algorithm
769
image registration is then an interesting issue in medical imaging, where elastic registration becomes more useful each day. Although it is a relatively recent idea, a number of algorithms have been developed considering this technique. The common way to impose the consistent constraint has consisted in adding a penalising term to the cost function [4,6,10]. In [9], authors propose a physical deformation model in which the forces that deform the image are symmetrised following the action and reaction principle. The method here proposed is different in the sense that it requires the effective inversion of the geometric transformation obtained; similar approaches can be found in [5,7], where authors iteratively invert the deformation at each point, or in [8], where direct and inverse optical flow are rectified in each iteration with a linear estimate of the inverse transformation. The use of a penalising term is well suited for variational approaches [6,10], but the effective inversion of the transformation may be a better choice in non-iterative algorithms. However, this inversion is computationally intensive with elastic registration, as is the case in [5], where a fixed point technique must be performed at each pixel. The main contribution of our paper is the development of a technique that provides a very efficient way to impose the consistent constraint using a parametric elastic transformation (a cubic B-spline, BS); to the best of our knowledge, it is the first attempt to apply the direct inversion methodology with a parametric deformation. Although our algorithm does not result in an exact inversion of the deformation, but in a pseudo-inversion instead, experiments demonstrate that it substantially improves the registration accuracy, with a fair computational overload. In section 2 we briefly introduce the use of B-splines as interpolators. Section 3 is devoted to the pseudo-inversion algorithm. In section 4 we show some results that demonstrate the usefulness of the algorithm, and finally, in section 5, we give some final considerations and conclude.
2
B-Spline Representation of the Deformation
The use of BS in image registration has been widely validated and widespread [2,3], being their main advantage their very little computational load. Compared to the classical Thin Plate Splines (TPS, [11]), they show the drawback of its worse regularity properties; however, the hierarchical interpolation technique presented in [12] overcomes in practice this difficulty. Here, we are going to center our attention in the two-dimensional case, although the extension to 3-D is immediate. Let J be an image with a compact support Ω ≡ [0, X]×[0, Y ] ⊂ R2 ; the deformation at a point (x, y) ∈ Ω is then given by the tensor product: 3 3 x + k=0 l=0 Bk (s) Bl (t) Φx(i+k)(j+l) x 3 3 = , (1) y y + k=0 l=0 Bk (s) Bl (t) Φy(i+k)(j+l) where i = M x/X − 1, j = N y/Y − 1, s = M x/X − M x/X, t = N y/Y − N y/Y 1 , and Φx,y ij are the interpolation parameters to compute, one for each 1
Here x represents the greatest integer minor than x.
770
A. Trist´ an and J.I. Arribas
and every of the 2 × (M + 3) × (N + 3) grid nodes (note that there is a grid for each dimension), with M + 3 and N + 3 the number of nodes in each dimension, being verified that −1 ≤ i ≤ M + 1, −1 ≤ j ≤ N + 1, and 0 ≤ s, t < 1. Bi (·), i = 0, 1, 2, 3 are the BS kernels, which are third degree polynomials [12], in such a way that BS interpolators are piecewise polynomials, with the nice property that they are second order differentiable. Besides, they have another interesting property, their locality: a change in one of the spline coefficients Φx,y ij affects only to a vicinity of the corresponding node (i, j). In practice, this means that to interpolate a new control point (x, y) ∈ Ω, only a few (16, in the 2D case) grid nodes have to be changed. This property is used in [12] to hierarchically interpolate the control points of the deformation in a least square sense. This is the same approach followed in [2], and in our own work as well.
3
Pseudo-inversion Algorithm
A BS does not need to be invertible, and in case it is, its inverse transformation is not necessarily another BS. To impose the consistent constraint, we define the pseudo-inverse BS Ψ of a given BS Φ as a BS with the same support and grid nodes as Φ; we use these grid nodes as control points (the values to interpolate) and then we force the displacements at the grid points of Ψ to equal exactly the inverse deformation of that given by Φ at these points. This means that the inversion is exact at the grid nodes, and only approximated elsewhere. To obtain the inverse transformation at the grid nodes, we follow an approach similar to [5], which in fact is a variant of the fixed point method. From eq. (1), the deformation at a point x ≡ (x, y) ∈ Ω can be written in a compact form: x = g (x) = x + f (x) .
(2)
For a given point x0 ∈ Ω, finding the value of the inverse transform g−1 (x0 ) implies to find the point xt ∈ Ω such that g(xt ) = xt + f (xt ) = x0 , and then it will be verified that g−1 (x0 ) = x0 + f −1 (x0 ) = xt , where in this case f −1 is not the inverse of f but simply a convenient and comfortable notation. Thus, for each grid node x0 , we are interested in finding the point xt such that: x0 − f (xt ) = xt ⇔ hx0 (xt ) = xt ,
(3)
what reduces to find a fixed point in function hx0 , defined for each grid node. To do so, starting from the initial approximation x0t , the updating rule is: n
xn+1 = hx0 (xnt ) = x0 − f (xnt ) ⇔ f −1 (x0 ) = −f (xnt ) . t ∞
(4)
If series {xnt }n=0 converges toward a non infinite value, this value is xt . Unlike in [5], (4) is only used at the grid nodes; at the same time, the successive evaluations are exact here due to the continuum nature of BS, while in [5] interpolation is needed, thus convergence here will be faster.
A Fast B-Spline Pseudo-inversion Algorithm
771
Once the deformation is known at the grid nodes, it only remains to compute the parameters Φ of the BS. At grid nodes s = t = 0, [12], and we can write: 2 2 x x −x k=0 l=0 αk αl Φ(i+k)(j+l) = 2 2 , (5) y y − y k=0 l=0 αk αl Φ(i+k)(j+l) where α2 = α0 = 16 and α1 = 23 are the result of evaluating the BS kernels at 0. Thus, (5) defines 2 linear systems of (M + 3) × (N + 3) equations with (M + 3) × (N + 3) unknowns. Nevertheless, each value of the displacement f −1 (xi , yj ) affects only to nine Φx coefficients and another nine Φy , so the coefficient matrix is highly sparse. The generalisation to three or more dimensions is immediate; in fact, the tensorial character of the interpolation via BS permits that the systems of equations may be written by blocks, as shown in eq. (6): ⎛
⎞ ⎛ x,y,... ⎞ ⎛ −1 ⎞ fx,y,... (x−1 ) Φ−1 α1 Λ α2 Λ 0 0 ··· 0 0 ⎜ α0 Λ α1 Λ α2 Λ 0 · · · 0 ⎜ x,y,... ⎟ ⎜ −1 (x0 ) ⎟ 0 ⎟ ⎜ ⎟ ⎜ Φ0x,y,... ⎟ ⎜ fx,y,... ⎟ −1 ⎜ 0 α0 Λ α1 Λ α2 Λ · · · 0 ⎟ ⎟ ⎜ fx,y,... 0 ⎟⎜ (x1 ) ⎟ ⎜ ⎜ Φ1 ⎟=⎜ ⎟, ⎜ .. ⎟ .. . .. ⎟ ⎜ .. ⎟ ⎜ .. .. .. .. ⎝ . ⎠ . . . .. . . ⎠⎝ . ⎠ ⎝ . −1 Φx,y,... 0 0 0 0 · · · α0 Λ α1 Λ fx,y,... (xM+1 ) M+1
(6)
where the vector unknowns Φ and the values of displacements f −1 (xi ) are the result of fixing the first of the indices of the grid, i, and the d-dimensional matrices have been arranged like column vectors. By the own nature of the BS [12], the extreme nodes in each dimension (i = −1, i = M + 1, for x coordinate) correspond to points not included in Ω, so its inversion is not done following (4). Instead, we make f −1 (x−1 ) = f −1 (x0 ) and f −1 (xM+1 ) = f −1 (xM ). Besides, matrix Λ must present the same block-tridiagonal structure, implying that the value of the parameter at y is only affected by the value of displacement at the same y and at the next adjoining two, so that the system can be solved in a recursive way. Starting with the first dimension we may write, for both two systems: α0 ΛΦn−1 + α1 ΛΦn + α2 ΛΦn+1 = fn−1 , −1 < n < M + 1,
(7)
where we have adopted the notation fn−1 = f −1 (xn ) and extended the n subindex to the boundary values Φ−2 and ΦM+2 . We aim to find a solution in the form: Φn+1 = Σn Φn + θn , −2 ≤ n ≤ M,
(8)
and substituting (8) into (7), one obtains: α0 ΛΦn−1 + α1 ΛΦn + α2 Λ (Σn Φn + θn ) = fn−1 ⇒ α0 α1 1 −1 −1 Φn−1 + Φn + Σn Φn + θ n = Λ fn ⇒ α2 α2 α2 α1 1 −1 −1 ID + Σn Φn = −Φn−1 + Λ fn − θn . α2 α2
(9)
772
A. Trist´ an and J.I. Arribas
Identifying terms in (9), the following recursion is obtained: Σn−1 = − θn−1 =
α1 ID + Σn α2 α1 ID + Σn α2
−1 , −1 < n ≤ M + 1 −1
1 −1 −1 Λ fn − θn , −1 < n ≤ M + 1, (10) α2
At the same time, if ΣM+1 can be written like γ · Id , it is easily proved by induction that Σn = γn · Id , and thus the system solution follows: Φn = γn−1 Φn−1 + θn−1 , −1 ≤ n ≤ M + 1 1 γn−1 = − α1 , −1 ≤ n ≤ M + 1 α2 + γn 1 −1 −1 θn−1 = −γn−1 Λ fn − θn , −1 ≤ n ≤ M + 1. α2
(11)
Matrix Λ is again block-tridiagonal, hence the calculus of Λ−1 fn−1 may be done by applying the algorithm in a recurrent fashion to the successive dimensions until reaching the last one, in which Λ will simply be 1. Regarding the first term in the recursion, we have experimentally proved that the best choice is to force ΦM+2 = ΦM+1 , and thus we set γM+1 = 1 and θM+1 = 0.
4
Results
To validate the proposed technique, we have chosen a simple and easy to implement registration algorithm, based on the work by Xiao et. al. [3]. It is a classical block-matching algorithm, where a grid (note that this grid is different to that of the BS interpolator) is superimposed to the images to register; then a block of pixels is placed at each node and moved to the surrounding positions to find the corresponding block in the other image that best match it. The quality of the match is measured in [3] as the linear correlation between the blocks, but we use Mutual Information (MI, [13]) instead, since this allows us to perform multimodal registration. As in [3] a Bayesian framework is used to smooth the estimated displacements at each node. The last step in [3] is the interpolation of the displacements by a gradient-descent based optimization of the BS coefficients. We use the technique suggested in [12] instead, since it is more efficient and it has been successfully tested [2]. Finally, we swap the roles of each image in the block-matching step, and so we can estimate the inverse BS transformation; then we use the pseudo-inversion algorithm and average the obtained BS coefficients with the coefficients estimated for the direct BS transformation. This procedure is enclosed in a multirresolution framework, which is known to result in more robust and accurate estimates [1]2 . 2
We have used 5 resolution levels; for the block matching, block-size was 11 × 11 pixels, and a block was centered at every six pixels. BS grid had size 64 × 64 nodes.
A Fast B-Spline Pseudo-inversion Algorithm
773
Fig. 1. Example of the test deformations. From left to right: original PD slice (with the random displacements). T1 slice deformed with a 3.6 bending energy TPS. Result of matching the deformed T1 slice to the original PD slice. All images have a size of 181 × 217 pixels.
Data used to test the algorithm is that available at the BrainWeb3 [14]. It consist of an MRI phantom volume of the brain, comprising PD, T1, and T2 modalities. The use of synthetic images allows us to use a gold standard: over the starting images (slices taken from the MRI volume), we establish 12 uniformly distributed landmarks, and for each of them a random vector is generated that represents the deformation at that point. The deformation is then interpolated with a TPS [11], so the results are not biased by the type of interpolation. This process may generate deformations of very diverse nature, so we have parametrised the extent of the deformation depending on its bending energy [11]. Besides, the use of known deformations allows the computation of the exact error at each image point as the Euclidean distance between the displacement vector estimated and its real value. An example of the kind of deformations generated, and the registration result, is given in Fig. 1. Results shown comprise first the successful registration rate percentage, where we took the criterion of considering a successful registration whenever the mean registration error is bellow the unit (i. e. the minimum achievable displacement); we have used all possible combinations of MRI images (PD, T1, and T2) and for each successful registration we have computed the mean error and the 90% confidence interval. Results ares shown in Table 1, for a total of 100 experiments with 3.6 bending energy (as in Fig. 1), with and without the consistent constraint. By comparing each pair of cases, it remains evident that the consistent constraint notably improves not only the robustness (the percentage of successful registrations) but the accuracy (mean error) as well. As a final result, we show in Fig. 2 the behaviour of the mean registration error as the deformation becomes larger (in terms of its bending energy), for the PD/T2 case; we have chosen it as the most adverse case, as shown in Table 1, and even so the advantage of the consistent constraint is evident for reasonably large deformations. 3
http://www.bic.mni.mcgill.ca/brainweb
774
A. Trist´ an and J.I. Arribas
Table 1. Registration performance in terms of the successful registration rate, and the mean value and 90% confidence interval of the error of the estimated displacement in pixels, both without and with the consistent constraint Scene Model Without consistent constraint With consistent constraint PD PD 98%(0.55 ± 1.13) 100%(0.51 ± 1.04) PD T1 83%(0.78 ± 1.68) 91%(0.63 ± 1.50) PD T2 86%(0.77 ± 1.63) 95%(0.78 ± 1.57) T1 PD 80%(0.82 ± 1.81) 90%(0.75 ± 1.65) T1 T1 97%(0.61 ± 1.31) 99%(0.56 ± 1.18) T1 T2 84%(0.82 ± 1.82) 90%(0.75 ± 1.67) T2 PD 83%(0.79 ± 1.66) 95%(0.70 ± 1.52) T2 T1 84%(0.79 ± 1.71) 92%(0.73 ± 1.61) T2 T2 99%(0.55 ± 1.14) 99%(0.51 ± 1.05)
Mean registration error vs. bending energy of the deformation (PD/T2) 2 Without the consistent constraint
Mean registration error in pixels
With the consistent constraint 1.75
1.5
Bending energy = 3.6 1.25
1
0.75
0.5 0
1
2
3
4
5
6
7
8
9
10
11
12
Bending energy
Fig. 2. Mean registration error versus bending energy of the deformation to recover, with and without the consistent constraint. Curves are a third degree polynomial fit of 400 randomly generated samples of the PD/T2 case.
5
Conclusion
BS are able to represent very complex and large deformations, with little computational load and nice smoothness properties when combined with hierarchical interpolation [12]; all these facts make them an interesting choice in elastic image registration [2,3]. We have presented an efficient algorithm to find a BS transformation that approximates the inverse transformation of another given BS, which allows us to introduce the concept of consistent image registration and all its benefits in the algorithms that make use of BS interpolation. Compared to [5], we have to invert exactly the transformation at only the 2 × M × N BS grid nodes, instead of at all image pixels, drastically reducing the
A Fast B-Spline Pseudo-inversion Algorithm
775
computational effort. Once the transformation is inverted at these points, the algorithm presented in section 3 is able to very efficiently compute the BS parameters. The main drawback of this technique is that the inversion is exact for the grid nodes, but only approximated elsewhere. However, the results demonstrate that it is accurate enough for our purposes, and they confirm the hypothesis that the consistent constraint produces more accurate and robust estimates of the deformation, as it has been previously suggested [9,10]. Acknowledgments. Authors want to thank Dr. R. C´ ardenes and Dr. J. CidSueiro for their comments. This work was supported by grant numbers TEC0406647-C03-01, from the Comisi´on Interministerial de Ciencia y Tecnolog´ıa, Spain, and FP6-507609 SIMILAR Network of Excellence from the European Union.
References 1. Zitova, B., Flusser, J.: Image registration methods: a survey. Image and Vision Computing 21(11), 977–1000 (2003) 2. Rueckert, D., Sonoda, L., Hayes, C., Hill, D., Leach, M., Hawkes, D.: Nonrigid registration using free-form deformations: application to breast MR images. IEEE Trans. on Medical Imaging 18(8), 712–721 (1999) 3. Xiao, G., Brady, J., Noble, J., Burcher, M., English, R.: Nonrigid registration of 3-D free-hand ultrasound images of the breast. IEEE Trans. on Medical Imaging 21(4), 405–412 (2002) 4. Shen, D., Davatzikos, C.: HAMMER: hierarchical attribute matching mechanism for elastic registration. IEEE Trans. on Medical Imaging 21(11), 1421–1439 (2002) 5. Christensen, G., Johnson, H.: Consistent image registration. IEEE Trans. on Medical Imaging 20(7), 568–582 (2001) 6. Johnson, H., Christensen, G.: Consistent landmark and intensity-based image registration. IEEE Trans. on Medical Imaging 21(5), 450–461 (2002) 7. Cachier, P., Rey, D.: Symmetrization of the non-rigid registration problem using inversion-invariant energies: application to multiple sclerosis. In: Medical Image Computing and Computer-Assisted Intervention - MICCAI’2000, pp. 472–481 (2000) 8. Thirion, J.P.: Image matching as a diffusion process: an analogy with Maxwell’s demons. Medical Image Analysis 2(3), 243–260 (1998) 9. Rogelj, P., Kovacic, S.: Symmetric image registration. Medical Image Analysis 10(3), 484–493 (2006) 10. Zhang, Z., Jiang, Y., Tsui, H.: Consistent multi-modal non-rigid registration based on a variational approach. Pattern Recognition Letters 27(7), 715–725 (2006) 11. Bookstein, F.: Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans. on Pattern Anal. and Machine Intell. 11(6), 567–585 (1989) 12. Lee, S., Wolberg, G., Shin, S.: Scattered data interpolation with multilevel Bsplines. IEEE Trans. on Visualization and Computer Graphics 3(3), 228–244 (1997) 13. Viola, P., Wells, W.M., I.: Alignment by maximization of mutual information. In: Proc. of Fifth International Conference on Computer Vision, 1995, pp. 16–23 (1995) 14. Kwan, R.S., Evans, A., Pike, G.: MRI simulation-based evaluation of imageprocessing and classification methods. IEEE Trans. on Medical Imaging 18(11), 1085–1097 (1999)
Robust Least-Squares Image Matching in the Presence of Outliers Patrice Delmas, Georgy Gimel’farb, Al Shorin, and John Morris Department of Computer Science, Tamaki Campus, The University of Auckland, Auckland, Private Bag 92019, New Zealand {p.delmas,g.gimelfarb,j.morris}@auckland.ac.nz,
[email protected] Abstract. Although interfering (outlying) details complicate image recognition and retrieval, ‘soft masking’ of outliers shows considerable promise for robust pixel-by-pixel image matching or reconstruction from principal components (PC). Modeling the differences between two images or between an image and its PC estimate (obtained as a projection onto a subspace of PCs) with a mixed distribution of random noise and outliers, the masks are produced by a simple iterative Expectation-Maximisation based procedure. Experiments with facial images (extracted from the MIT face database) demonstrate the efficiency of this approach.
1
Introduction
Least-squares image matching and image reconstruction from principal components (PC) found by PCA are widely used in computer vision and image retrieval. For PCA, all images in a database must be co-registered to reduce the number of characteristic PCs representing the database and thus optimize the PCA process. However, many undesirable local features, e.g. occlusions in faces caused by hair, glasses, scarves, etc., are not easily handled. A pixel affected by one of these interfering artefacts is usually classified an outlier to emphasize its divergence from the probability model describing ‘valid’ signal deviations. Outliers degrade image matching, recognition and reconstruction accuracy. Outlier definition depends on a particular recognition or reconstruction problem. For instance, visual occlusions may not be the only outliers. In facial images, background is likely to vary considerably and thus its inclusion in the matching and reconstruction process is a bad strategy. At present, face databases usually contained cropped images which completely eliminate backgrounds e.g. the MIT database [1] (see Fig. 1). However, it is more natural to regard background pixels as outliers and automatically eliminate them. This paper preserves the traditional least-squares framework as developed in image matching and PCA but attempts to eliminate outliers by soft masking of suspicious pixels. Each pixel is weighted by a mask value ranging from 0 to 1. Random ‘valid’ and outlying signal variations are modelled with a mixture of their probability distributions. The masks are inferred with an Expectation-Maximisation (EM) algorithm resembling earlier model identification schemes [7,8]. The resulting iterative image reconstruction and matching algorithms are faster and more flexible than more conventional M-estimator or EM-based methods [11,12]. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 776–783, 2007. c Springer-Verlag Berlin Heidelberg 2007
Robust Least-Squares Image Matching in the Presence of Outliers 01
03
05
07
09
11
13
777
15
17
19
e7 0.27 94.0
e8 0.22 96.5
e9 0.15 98.1
Eigenfaces and eigenvalues
c λj ×10−7 νj (%)
e1 4.5 50.3
e2 1.4 66.7
e3 0.79 75.6
e4 0.64 82.7
e5 0.40 87.2
e6 0.33 90.9
Fig. 1. MIT face database [1] subset (10 image pairs were used - only the first image for each face is shown here), its centroid c and first nine grey-coded normalised eigenfaces ej with eigenvalues λj and cumulative relative variances νj = Λ1 ji=1 λi ; Λ = 19 i=1 λi , represented by the j top eigenfaces; j = 1, . . . , 19
Section 2 introduces our EM-based soft masking for image reconstruction with a simple probability model of errors and outliers. Section 3 shows how a similar model may be used for symmetric EM-based least-squares matching of two noisy images in the presence of outliers. Experiments validating these algorithms are described in Section 4.
2
Robust Image Reconstruction
ˆ , of an original image, g, using characteristic PCs for a The reconstruction, g certain image database is inevitably imprecise if g is not in the database or if only a small number of these PCs are kept to represent an image. Let {gj : j = 1, . . . , n} be a set of n grayscale images, each containing p pixels. Each image, gj , is represented by a column vector, gj = [gj1 , . . . , gjp ]T , where gji is the signal for pixel, i, in the image, j, in the mean-deviation form: gji = gji − ci where gji is the original grey value and ci = n1 nj=1 gji is the mean grey value for pixel, i, over the set. Let the database images be represented by a p × n matrix, G = [g1 . . . gn ] usually p n. The tractable PCA [2,3] finds the n − 1 top-rank eigenvectors, {ej : j = 1, . . . , n − 1}, with (possibly) non-zero eigenvalues for the p × p 1 covariance matrix, S = n−1 GGT , by computing the like n − 1 eigenvectors and eigenvalues, {(ej , λj ) : j = 1, . . . , n − 1}, of the n × n matrix, S = GT G. 1 Because G(GT G)ej = λj Gej , the PCs and eigenvalues for S are ej = |Ge | Gej j
and λj = |Gej |λj ; j = 1, . . . , n − 1, respectively. PCA-based image reconstruction projects an input image, g, onto a linear subspace of m eigenvectors, 1 ≤ m ≤ n − 1, ordered by decreasing eigenvalues: ˆ= g
m j=1
yj ej where yj = g ej ≡ T
p i=1
gi eji
778
P. Delmas et al.
We assume reconstruction error, ε = g−ˆ g, is modeled by an independent random field of pixel errors, εi = gi − gˆi . Each error, ε, represents either signal noise which we define here to include all noise inherent to lossy reconstruction - or an outlier. The error probability model is a mixture of a noise distribution, N (ε|σ), with unknown variance, σ 2 , and an outlier distribution, U(ε): Pr(ε) = ρN (ε|σ) + (1 − ρ)U(ε)
(1)
where 1 − ρ is an unknown prior probability of outliers. We also assume discrete errors, ε ∈ E = {−Q + 1, . . . , −1, 0, 1, . . . , Q − 1}, where Q is the number of grey values (typically, Q = 256). For simplicity, let N (ε|σ) be a discrete distribution stemming from the zero-centred normal distribution of noise and U(ε) be a uniform distribution of differences: N (ε|σ) =
1 − ε22 e 2σ ; Zσ
U(ε) =
1 2Q + 1
Q−1
where Zσ =
δ2
e 2σ2
(2)
δ=−Q+1
The probability model of Eq. (2) implies that ‘signal noise’ can be separated from outliers by masking out pixels with significantly larger than expected errors in the reconstructed images. The masks are built by an EM-based iterative procedure of model identification: similar procedures have been used [7,8]. Let γ = [γ1 , . . . , γp ]T be a mask for data vector, g. A binary mask, γi ∈ {0, 1}, excludes (γi = 0) or retains (γi = 1) signals, i.e. converts g into g◦ where gi◦ = gi γi . To fit the conventional EM framework for the maximum likelihood identification of the error model, the mask is replaced with its mathematical expectation as a soft mask [10]. The value γi ∈ [0, 1] converges to the probability of the error, εi , being induced by signal noise (Hastie et al. [10] refer to this as a ‘responsibility’): the smaller the value, the more likely the error is an outlier. The robust EM-based reconstruction algorithm is then: Input: An image, g, and m eigenvectors, ej ; j = 1, . . . , m. Initial step t = 0: Reconstruct g with the unit mask, p m [0] [0] [0] γ = [γi = 1 : i = 1, . . . , p] so that g = j=1 gi eji ej . i=1
Set the prior, ρ[0] = 0.5, to avoid singularities. [t] [t−1] Iteration t = 1, 2, . . .: Reset the mask and prior for the errors, εi = gi −gi : p [t] 2 [t−1] [t] εi γi ρ[t−1] N εi |σ[t] [t] 2 ; σ[t] = i=1 p ; γi = [t] [t] [t−1] ρ[t−1] N εi |σ[t] + (1 − ρ[t−1] )U εi γ i=1
ρ[t] =
1 p
p i=1
i
[t]
γi , and update the reconstruction: g
[t]
=
p m j=1
i=1
[t] γi gi
[t] [t−1] + 1 − γi gi eji
ej
Robust Least-Squares Image Matching in the Presence of Outliers [t]
[t−1]
Stopping rule: Terminate if maxi=1,...,p |γi − γi and θi are fixed empirical thresholds.
3
779
| ≤ θw or t > θi where θw
Robust Symmetric Least-Square Matching
In this section, we consider robust matching of images distorted by having dif1 and g 2 , be derived ferent contrast (gain) and offset values. Let two images, g = ( from an ideal unknown template, g gi : i = 1, . . . , p) with differing uniform contrast (a1 and a2 ) and offset (b1 and b2 ) values perturbed by independent per pixel errors caused either by noise or outliers. For the pixels affected by noise only, i ∈ {1, . . . , p}, g1i = a1 gi + b1 + ε1i ; g2i = a2 gi + b2 + ε2i
(3)
where the errors ε1i and ε2i have a centred normal distribution with the same variance. Outliers have uniformly distributed signal values and thus vary with no relation to the template transformations, so that conventional least-square matching may not properly work. To derive the robust version, let a soft mask, γi ∈ [0, 1], be applied, as before, to the pair ( g1i , g2i ). The maximum likelihood signal dissimilarity, D12 = 2 )), combines non-outliers and outliers. Let εi denote the residmin (− ln Pr( g1 , g θ
ual error, or difference between the signals for the pixel, i, after their contrast and offset variations are eliminated. Then D12 = −
p
[γi (ln N (εi )) + (1 − γi ) ln U(εi )] =
i=1
1 Φ12 + ν ln(Zσ ) + Ψ12 (4) 2σ 2
p p where ν = i=1 γi , Ψ12 = i=1 (1 − γi )U(εi ), and Φ12 is the minimum total }: squared error with respect to the model parameters, θ = {a1 , a2 , b1 , b2 , g γi ε21i + ε22i i=1
p p 2 2 = min γi ( g1i − a1 gi − b1 ) + ( g2i − a2 gi − b2 ) = εˆ2i θ i=1 i=1 2 = 12 S11 + S22 − (S11 − S22 )2 + 4S12
Φ12 = min
p
θ
p p Here, Sjk = i=1 γi gˆji gˆki ; j, k = 1, 2, where gˆji = gji −μj and μj = ν1 i=1 γi gji , 2 and εˆ2i = εˆ21i + εˆ22i = (α2 gˆ1i − α1 gˆ2i ) is the squared residual per pixel matching error, where
1 S − S 1 S − S 11 22 11 22 α21 = 1+ ; α22 = 1− 1 1 2 2 [(S11 − S22 )2 + 4S 2 ] 2 [(S11 − S22 )2 + 4S 2 ] 2 12
2
12
1 2ν Φ12 ,
The estimated noise variance is σ = and the local minimum of Φ12 is obtained by the EM-based iterative procedure that re-evaluates the soft masks and model parameters, similar to Section 2.
780
P. Delmas et al. DF01-01
DF01-02
DF01-03
DF01-04
DF01-05
DF01-06
DF01-07
DF01-08
DF01-09
DF01-10
DF01-11
DF01-12
DF01-13
DF01-14
(a)
(b)
(c)
(d)
(a)
(b)
(c)
(d) Fig. 2. Conventional (b) and robust (c) PCA-based reconstruction using the 10 top eigen-faces; row (a) - distorted version of face (#1); row (c) - grey-coded soft masks (black - 0; white - 1)
The robust symmetric matching algorithm is as follows: 1 and g 2 . Input: Two images g [0] Initial step t = 0: Match the images with the mask γ [0] = [γi = 1 : i = [0] [0] 1 [0] 2 1, . . . , p] in order to find Φ12 , σ[0] = 2p Φ12 , D12 (the conventional matching [0]
[0]
[0]
[0]
score), α1 , α2 , μ1 , μ2 . Set the prior ρ[0] = 0.5. Iteration t = 1, 2, . . .: Reset the mask and prior for the current residual errors [t] [t−1] [t−1] [t−1] [t−1] εi = α2 g1i − μ1 − α1 gi2 − μ2 :
Robust Least-Squares Image Matching in the Presence of Outliers
[t]
γi
[t] p ρ[t−1] N εi |σ[t−1] ν[t] [t] = ; ν[t] = γi ; ρ[t] = [t] [t] p ρ[t−1] N εi |σ[t−1] + (1 − ρ[t−1] )U εi i=1 [t]
[t]
[t] [t] [t] 1 2ν[t] Φ12 , D12 , α1 , [t−1] [t] D12 | ≤ θr D12 or t >
2 and update Sjk ; j, k = 1, 2, Φ12 , σ[t] = [t]
Stopping rule: Terminate if |D12 − are the fixed thresholds.
4
781
[t]
[t]
[t]
α2 , μ1 , and μ2 . θi where θr and θi
Experimental Results and Conclusions
A subset of the MIT database [1] containing two images with different lighting for each of the ten persons (in total, 20 normalised 200 × 200 pixel greyscale images) was used - see Fig. 1. The top seven out of 19 eigenfaces built with the tractable PCA [2,3] represent 94% of the overall variance of the original signals. Reconstruction. 14 distorted versions of the original face image (#1 in Fig. 1) were created: Fig. 2 shows the distorted faces, the two reconstructions and the the soft masks used by the robust PCA-based reconstruction. The latter is slightly better than the conventional one for relatively small occlusions (DF01-01 – DF0104), notably outperforms it for moderate occlusions (DF01-05 – DF01-10), but both fail for large distortions (DF01-11 – DF01-14). MIT FDB 01
DV1
DV1-B
DV2
DV2-B
D12 · 10−5 2nd match
C1 01: 0.40 C2 02: 1.08
01: 1.11 02: 1.32
01: 1.414 02: 1.415
01: 1.479 02: 1.495
Soft mask Fig. 3. Robust matching of images derived from MIT face #1 (Fig. (1)): closest (C1 ) and second closest (C2 ) scores, D12 (Eq. (4), and grey-coded soft masks
Matching. Figure 3 illustrates robust image-to-image matching. The contrast and offset values of face (#1) were changed significantly to produce images DV1 and DV2. The bottom section of the face was occluded in images DV1-B and DV2-B. Further trials are shown in Figure 4: various features (hair, glasses, beards, etc) were added to three faces from the MIT database (#1, #3 and #9). The corrupted faces were then matched against the database (Fig. 1). Soft
782
P. Delmas et al. M01-a
M01-b
M01-c
M01-d
M01-e
M01-f
D12 01: 1.03
01: 1.35
01: 1.39
01: 1.30
01: 0.96
01: 1.08
M01-g
M01-h
M01-i
M01-j
M01-k
M01-l
D12 01: 0.80
01: 1.02
01: 1.51
01: 1.50
01: 1.85
01: 1.73
M03-a
M03-b
M03-c
M03-d
M03-e
M03-f
D12 03: 1.61
03: 1.60
03: 1.27
03: 1.90
03: 1.49
03: 1.50
M03-g
M09-a
M09-b
M09-c
M09-d
M09-e
D12 03: 1.83
09: 1.51
09: 1.72
09: 1.80
09: 1.79
09: 1.99
Fig. 4. Closest match for the made-up faces ‘01’, ‘03’ and ‘09’ from the MIT database in Fig. 1: the robust score D12 (Eq. (4)) and grey-coded soft masks
Robust Least-Squares Image Matching in the Presence of Outliers
783
masking of outliers led to a significant increase in recognition and reconstruction accuracy when compared to conventional matching. In 13 out of 24 cases in Fig. 4, best matches using conventional distance measures were incorrect. Conclusions. Iterative readjustment of the mask within an EM framework results in large movements across the parameter space, i.e. a more global search roughly in the direction of the global minimum of the matching score [10]. As a result, our reconstruction and matching algorithms are robust to outliers and faster than other alternatives; thus they hold promise for several applications. These results were obtained on the MIT database [1] with a uniform background. Other face databases with non-trivial backgrounds and more realistic outliers than the artefacts generated here will be considered in the future work. Acknowledgement. This work was partially supported by the University of Auckland Research Committee (UARC) Grant 3607167/9343.
References 1. MIT face database [on-line] (accessed August 24, 2006), http://vismod.media.mit.edu/pub/images/ 2. Turk, M., Pentland, A.: Face recognition using eigenfaces. In: Proc. CVPR’91, Lahaina, Maui, Hawaii, June 3–6, 1991, pp. 586–591. IEEE CS Press, Los Alamitos (1991) 3. Turk, M.: Eigenfaces and beyond. In: Zhao, W., Chellappa, R. (eds.) Face Processing: Advanced Modeling and Methods, pp. 55–86. Academic Press (Elsevier), San Diego (2006) 4. Lai, S.-H.: Robust image matching under partial occlusion and spatially varying illumination change. Computer Vision and Image Understanding 78, 84–98 (2000) 5. Huber, P.J.: Robust Statistics. John Wiley & Sons, New York (1981) 6. Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection. John Wiley & Sons, New York (1987) 7. Gimel’farb, G., Farag, A.A., El-Baz, A.: Expectation-Maximization for a linear combination of Gaussians. In: Proc. 17th ICPR, Cambridge, UK, August 22 - 27, vol. 3, pp. 422–425. IEEE CS Press, Los Alamitos (2004) 8. Frey, B.J., Jojic, N.: A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE Trans. on Pattern Analysis and Machine Intelligence 27, 1392–1416 (2005) 9. Hasler, D., Sbaiz, L., S¨ usstrunk, S., Vetterli, M.: Outlier modeling in image matching. IEEE Trans. on Pattern Analysis and Machine Intelligence 25, 301–315 (2003) 10. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Heidelberg (2001) 11. De la Torre, F.D., Black, M.J.: Robust principal component analysis for computer vision. In: Proc. 8th ICCV’01, Vancouver, Canada, July 9–12, 2001, pp. 362–369. IEEE CS Press, Los Alamitos (2001) 12. Skocaj, D., Bischof, H., Leonardis, A.: A robust PCA algorithm for building representations from panoramic images. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 761–775. Springer, Heidelberg (2002)
A Neural Network String Matcher Abdolreza Mirzaei, Hamidreza Zaboli, and Reza Safabakhsh Department of Computer Engineering, Amirkabir University, Tehran 15914, Iran {amirzaei, zaboli, safabakhsh}@aut.ac.ir
Abstract. The aim of this work is to code the string matching problem as an optimization task and carrying out this optimization problem by means of a Hopfield neural network. The proposed method uses TCNN, a Hopfield neural network with decaying self-feedback, to find the best-matching (i.e., the lowest global distance) path between an input and a template. The proposed method is more than ‘exact’ string matching. For example wild character matches as well as character that never match may be used in either string. As well it can compute edit distance between the two strings. It shows a very good performance in various string matching tasks. Keywords: String matching, Parallel, Chaotic Neural Network, TCNN, and Optimization using Hopfield NN.
1 Introduction The general string-matching problem is to find shift s for which the pattern X appears in Text. The most straightforward approach in matching string is to test each possible shift s in turn, which is hardly optimal. In string-matching-with-errors problem the goal is to find the location in text where the pattern X is close to the substring or factor of text. This measure of closeness is chosen to be an edit distance. The edit distance is the minimum number of fundamental operation needed to transform the test string into prototype string. Thus the string-matching-with-error problem finds use in the same type of problem as basic string matching. The only difference being that there is a certain tolerance for match [1]. The string matching problems just outlined are conceptually very simple. The challenge arises when the problems are large. Survey and comparison of well known string matching algorithms can be found in [2], [3], [4]. Although several linear time algorithms have been developed in the last two decades, they all need a preprocessing step [4], [5], [6], [7]. Automata-based methods also need a preprocessing step to create automaton for each input pattern [4], [5], [8]. Among the methods proposed for string-matching-with-errors we can mention dynamic programming methods [9], generalized Boyer-Moore algorithm [10], and automata-based method [5], which all these algorithms require extra time and space for preprocessings. Since the VLSI technology has been developed rapidly, hardware approaches for string matching have also been proposed [11], [12], [13], [14]. Since software algorithms which use preprocessings with look up table methods can not be applied to design special W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 784–791, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Neural Network String Matcher
785
purpose hardware string matcher [15], we need algorithms suitable for hardware implementation. It is known that neural networks are capable of yielding high-quality solutions to large-scale problems [16]. This paper uses chaotic Hopfield neural network for string matching. The advantage of this method over traditional methods includes massive parallelism and convenient hardware implementation. Another advantage is that the procedure of Hopfield network is more general for application and a range of string matching problems can be solved by it. The remainder of this paper is organized as follows. Section 2 reviews the theoretical basis of edit distance algorithms. Section 3 introduces the Hopfield neural network and its application in optimization. Section 4 gives a description of proposed method and Section 5 discusses the performance of the developed technique. Section 6 derives the main conclusions from this work.
2 Problem Description We start with the calculation of edit distance between two strings. In section 5 we will see that the proposed algorithm can be used to solve other problems too. Edit distance between x and y describes how many fundamental operations are required to transform x into y . Theses fundamental operations are: • Substitutions: A character in x is replaced by the corresponding character in y . • Insertion: A character in y is inserted into x , thereby increasing the length of x by one character. • Deletion: A character in x is deleted, thereby decreasing the length of x by one character. The calculation of edit distance between x and y can be visualized by aligning the character of x along the X-axis and the character of y along the Y-axis. Now calculation of edit distance is equivalent to finding the best-matching (i.e. the lowest global distance) path between x and y . Let C be a m × n matrix of integers associated with a cost or distance and let δ (.,.) denotes a generalization of Kronecker delta function, having value 1 if the two arguments (characters) match and 0 otherwise. The variables m and n are equal to the length of x and y denoted by x and y . The basic equations for calculation of edit distance are
C[0 ,0 ] = 0 C[i,0 ] = i for i = 1 to m
(1) (2)
C[o, j ] = j for j = 1 to n C[i, j ] = min[C[i − 1, j ] + 1, C[i, j − 1] + 1, C[i − 1, j − 1] + 1 − δ (x[i ], y[ j ])]
(3)
insertion
deletion
(4)
nochange/ exchange
Equation 4 is used to find the minimum cost in each entry of C column by column. As it can be seen from this equation the only valid paths to a given point must pass
786
A. Mirzaei, H. Zaboli, and R. Safabakhsh
through a point from the left and/or below it. The final distance C[m, n] gives us the overall matching distance for the x and y patterns.
3 The Transiently Chaotic Neural Network A Hopfield neural network consists of a set of neurons, where the output of each neuron is connected to the input of all other neurons except itself. The weights on these connections are fixed and are constrained to be symmetrical. The network is characterized by an energy function, which decreases with the changes in neural states with time. Once started, the network relaxes towards a stable state corresponding to a local minimum of its energy function. The work of Hopfield and Tank [17] showed that neural networks can be applied to solving combinatorial optimization problems. They suggested that a nearoptimal solution of traveling salesman problem (TSP) can be obtained by finding a local minimum of an appropriate energy function, which is implemented by a neural network. Typically, the network energy function is made equivalent to the objective function, which is to be minimized, while each of the constraints of the optimization problem is included in the energy function as penalty terms. Clearly, a constrained minimum of the optimization problem will also optimize the energy function. Unfortunately, a minimum of the energy function does not necessarily correspond to a minimum of the objective function due to the fact that there are likely to be several terms in the energy function, which contribute to many local minima. Furthermore, even if the network does manage to converge to a feasible solution, its quality is likely to be poor compared to other techniques, since the Hopfield network is a descent technique and converges to the first local minimum it encounters. We use the transiently chaotic neural network (TCNN) proposed by Chen and Aihara[18]. TCNN, a hopfield neural network with decaying self-feedback, has been developed as a new approach to extend the problem solving ability of the standard hopfield neural network. The difference equations describing the dynamics of the TCNN are as follows xij (t ) = f ( yij (t )) =
1 1+ e
− y ij ( t ) / ε
(5)
⎛ ⎞ yij (t + 1) = kyij (t ) + α ⎜ ∑∑ wijkl xkl (t ) + I ij ⎟ − zij (t )( xij (t ) − I 0 ) ⎝ k l ⎠
(6)
zij (t + 1) = (1 − β ) zij (t )
(7)
where xij (t ) is the output of neuron ij , yij (t ) is the internal state of neuron ij , wijkl is the connection weight from neuron ij to neuron kl ( wijkl = wklij ), I ij is the input bias of neuron ij , α is a positive scaling parameter for neuronal input, k is the damping factor of the nerve membrane ( 0 ≤ k ≤ 1 ), z (t ) is the self-feedback connection weight (refractory strength), β is the damping factor of z (t ) ( 0 ≤ β ≤ 1 ), I 0 is a positive
parameter and
ε is the steepness parameter of the output function( ε > 0 ).
A Neural Network String Matcher
787
4 String Matching Using Hopfield Neural Network To solve the string matching problem, the Hopfield network is organized as a n × m matrix of neurons, where n and m are the total number of character in the pattern and text. A neuron n Xi explores the hypothesis that characterX in pattern word corresponds to characteri in text. We will consider the constraints existing on the matching process (or that we can impose on that process) and use those constraints to come up with an efficient algorithm. The constraints we shall impose are straightforward and not very restrictive: • Every character in the pattern and the text must be used in a matching path. • Monotonic condition: the path will not turn back on itself, both the i and j indexes either stay the same or increase, they never decrease. • Continuity condition: The path advances one step at a time. Both i and j can only increase by 1 on each step along the path. • Local distance scores are combined by adding to give a global distance. We said that every character in the pattern and the text must be used in a matching path. This means that if we take a point (i, j ) in the matrix (where i indexes the pattern character, j the text character), then previous point must be (i − 1, j − 1) , (i − 1, j ) or (i − 1, j − 1) . According to the above mentioned constraints, we introduce the following energy function: A 2
∑ (1 − ∑ j =1 Pij )
B + 2
∑ (1 − ∑ i =1 Pij )
C + 2
∑ ∑ ∑ ∑ Pij . Pkl
D 2
∑ ∑ ∑ ∑ Pij . Pkl
E =
+
E + 2
2
n
m
i =1
2
m
n
j =1 n
m
n
m
(8)
i =1 j = 1 k > i l < j n
m
n
m
i = 1 j =1 k < i l > j
n
m
∑ ∑ Pij .d ij i =1 j = 1
where A , B , C , D and E are positive constants. Pij is the output of neuron nij that takes on values from 0 to 1. The first two terms basically enforce the uniqueness constraint. It can be seen that if more than one neuron is excited in one row or column of the network, the energy of the system tends to increase. This constraint is used (with small coefficient A and B ) in order to make the solution more monotonic (if one character is matched with two character, one of them must be deleted or inserted to make two string equal). The minimum energy is obtained only if one of the neurons in one column or row is excited. The third and forth terms are the inhibiting terms to implement the fact that the matching paths cannot go backwards in time. For any on neuron, they inhibit its upper left and lower right as shown in Figure 1.
788
A. Mirzaei, H. Zaboli, and R. Safabakhsh
Fig. 1. Inhibiting areas due to monotonicity constraint
The fifth term is the main term, which is used to enforce the similarity constraint. If the two characters are equal, the less the cost of the energy function will be, and the two characters are a correct matching pair with a high probability. d ij is given by
d ij = 1 − δ ( x[i ], y[ j ])
(9)
The objective now is to compute the network parameters, which include the external inputs of the network and the network connection weights. Comparing equation (8) with the energy function of a Hopfield neural network, given in equation (10), we can compute the properties. HH = −
1 I ij Pij ∑∑∑∑ Wij ,kl Pij Pkl − ∑∑ 2 i j k l i j
(10)
where each neuron has the output level of Pij (bounded by zero and one) and the bias current (or negative threshold) I ij . The weights, which determine the strength of the connections from ij to kl , are given by Wij ,kl .
5 Experimental Results To evaluate the proposed method, 2 types of experiment are performed. In the experiments the TCNN parameters used are: k = 0.9 ; I 0 = 0.65 ; z (1) = 0.08 ; β = 0.003 ; α = 0.015 which are the same as the ones used by Chen and Aihara[18]. In our experiments every neuron nij is initialized to a value inversely proportional to d ij . The N
criterion ∑
i , j =1
p ij (t + 1) − Pij (t ) < 5 ×10 −5
is used to see whether the network has converged
or not. At the end of each experiment, every neuron that has an output greater than a predefined threshold (0.5) is considered as fired. A post-processing step is performed for rows and columns with no fired neuron, in which the greatest activation in that row or column will be replaced by 1(the neuron is considered as fired). The first experiment is to illustrate the use of TCNN in exact string matching. In this case the number of insert operation (right move) before first character of the text shows a valid shift (i.e. variable s). In this experiment some random character are added to the pattern “book” to create the text with size of 10, 15, 20, 40, 60, 80, 100,
A Neural Network String Matcher
789
120 characters. Then the ability of network to find a valid shift (for which the pattern “book” appears in the text) is evaluated. Table 1 summarizes the results of 100 different runs in terms of rate of correct estimation of valid shift and the number of iterations for convergence. These results are shown in Figure 2. We note that for a constant value of β , with the increase in the text size the number of iteration for convergence asymptotically become constant and text size has not significant effect on it. On the other hand the accuracy of matching degrades with the increase of text size. This is due to the fact that using the same number of iteration for larger matching problems the network has not enough time to reach the optimal solution. So the value of β must be selected such that for larger problem the number of iteration for convergence increases. To assess the effect of β in network convergence and the quality of obtained paths, we analyzed the matching of a text with size 40 and the pattern “book” with β =0.001, 0.003, 0.005, 0.01, 0.015, 0.02 for 100 runs. The results of this experiment is shown in the Table 2 and Figure 3. In TCNN, the variable z (t ) corresponds to the temperature in the stochastic annealing process and β determines the cooling speed. By increasing β the chaotic phase of the network finishes earlier and the network converges with fewer iterations (it can be seen in figure 3(b)). However the searching capabilities of network and therefore the quality of solutions decrease in this case (see figure 3 (a)). So, in order to achieve acceptable results for matching large text, β must be selected small enough. The second experiment deals with the calculation of the edit distance between two strings. This experiment shows the capability of the approach to approximate Table 1. The results of using TCNN to find a valid shift with different text size Length of reference string 10 Rate of correct estimation Mean 100 of valid shift Std 0
15 100 0
20 100 0
40 97.8 4.91
60 96.6 4.27
80 87 7.42
100 79.8 11.01
Number of iterations for Mean 123.24 126.56 127.96 130.36 130.94 133.18 134 convergence Std 2.003 1.502 3.002 2.669 1.595 1.354 0.869
Numebr of iterations for convergence
100 Accuracy(%)
134.62 0.918
136
110
90
80
70
60 0
120 76.2 12.19
20
40
60 80 Text Size
(a)
100
120
134 132 130 128 126 124 122 120 0
20
40
60 80 Text Size
100
120
(b)
Fig. 2. The results of using TCNN to find a valid shift with different text size. (a)Rate of correct estimation of valid shift (b)Number of iterations for convergence.
790
A. Mirzaei, H. Zaboli, and R. Safabakhsh
Table 2. The results of Matching of a text with size 40 and the pattern “book” with different value of β
β Rate of correct estimation of valid shift Number of iterations for convergence
Mean Std Mean Std
0.001
0.003
0.005
0.01
0.015
0.02
98.2 3.8239 229.1900 29.5626
97.8 4.91 130.36 2.669
96.9 4.99 84.2300 1.5535
96.7 5.31 49.4900 0.4886
95.6 5.68 35.8400 0.2716
93.00 6.15 28.5300 0.2359
300
104 Number of iterarions for convergence
102
Accuracy(%)
100 98 96 94 92 90 88 86
0
0.005
0.01
0.015
0.02
250
200 150
100 50
0
0
0.005
0.01
Beta
0.015
0.02
Beta
(a)
(b)
Fig. 3. The results of using TCNN to find a valid shift with different vale for β parameter. (a)Rate of correct estimation of valid shift (b) Number of iterations for convergence(b).
matching (i.e. matching with error). To compute the edit distance, we count the extra ON neurons in rows and columns that have more than one neuron ON and add the results to the matching cost. Matching cost for two equal characters is zero. In order to provide a quantitative evaluation of the proposed method in calculation of the edit distance between two strings, the text “We are all pencils in the hand of God”* is corrupted by randomly substituting, deleting or adding a character from/to it, and then the modified text is matched against the original text. The result is considered as optimal if the edit distance computed from the network is the same as the one computed from the well known dynamic programming algorithm and also if the derived path is a valid path. The table 3 shows the ratio of finding the correct edit distance in 100 runs. This table indicates that if the distortion is moderate the network will find the optimal edit distance with a high probability. Table 3. The result of finding the edit distance between two strings, a fraction of a sentence is corrupted and then it is matched against the original one Distortion The ratio of finding the correct edit distance (%) *
Mother Teresa.
5% 100
10% 97
20% 91
30% 79
40% 67
A Neural Network String Matcher
791
6 Conclusion The main concept of this paper is a demonstration that string matching problem can be formulated as an optimization task. The advantage of this method which uses Hopfield network to solve the resulted optimization problem over traditional methods includes massive parallelism and convenient hardware implementation. Another advantage is that the procedure of Hopfield network is more general for application and a range of string matching problem can be solved by it.
References 1. Duda, R.O, Hart, P.E., Stork, D.G.: Pattern classification, 2nd edn. Wiley-Interscience publication, Chichester (2000) 2. Aho, V.: Algorithms for Finding Patterns in Strings. In: Leeuwen, J.V. (ed.) Handbook of Theoretical Computer Science, vol. A, pp. 257–297. Elsevier Science Publishers, Amsterdam (1990) 3. Jokinen, P., Tarhio, J., Ukkonen, E.: A Comparison of Approximate String Matching Algorithms. Software-Practice and Experience 26, 1439–1457 (1996) 4. Lecroq, T.: Experimental Results on String Matching Algorithms. Software-Practice and Experience 25, 727–765 (1995) 5. Baeza, R., Gonnet, G.H.: A New Approach to Text Searching. Comm. ACM 35, 74–81 (1992) 6. Cole, R., Hariharan, R., Paterson, M.: UriZwick: Tighter Lower Bounds on the Exact Complexity of String Matching. Siam J. Comput. 24, 30–45 (1995) 7. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. McGrow Hill, New York (1990) 8. Knuth, D.E., Morris, J., Pratt, V.: Fast Pattern Matching in Strings. SIAM J. of Comput. 6, 323–350 (1977) 9. Wagner, R.A.: The String-to-String Correction Problem. Journal of the ACM 21, 168–173 (1974) 10. Tarhio, J., Ukkonen, E.: Approximate Boyer-Moore String Matching. Siam J. Comput. 22, 243–260 (1993) 11. Cheng, H.D., Fu, K.S.: VLSI Architecture for String Matching and Pattern Matching. Pattern Recognition 20, 125–141 (1987) 12. Isenman, M.E., Shasha, D.E.: Performance and Architectural Issues for String Matching. IEEE Trans. on Computers 39(2), 238–250 (1990) 13. Mukherjee, A.: Hardware Algorithms for Determining Similarity between Two Strings. IEEE Trans. On Computers 38(4), 600–603 (1989) 14. Sastry, R., Ranganathan, N., Remedios, K.: CASM: A VLSI Chip for Approximate String Matching. IEEE Trans. on Pattern Analysis and Machine Intelligence 17, 824–830 (1995) 15. Park, J.H., George, K.M.: Parallel String Matching Algorithms Based on Dataflow. In: Proceedings of the 32nd Hawaii International Conference on System Sciences, vol. Track 3, IEEE, Los Alamitos (1999) 16. Tagliarini, G.A., Christ, J.F., Page, E.W.: Optimization using neural networks. IEEE Transaction On Computers 40(12), 1347–1358 (1991) 17. Hopfield, J.J., Tank, D.W.: ‘Neural’ Computation of Decisions in Optimization Problems. Biological Cybernetics 52, 141–152 (1985) 18. Chen, L., Aihara, K.: Chaotic simulated annealing by a neural network model with transient chaos. Neural Networks 8(6), 915–930 (1995)
Incorporating Spatial Information into 3D-2D Image Registration Guoyan Zheng MEM Research Center, University of Bern, Stauffacherstrasse 78, CH-3014, Bern, Switzerland
[email protected] Abstract. This paper addresses the problem of estimating the 3D rigid pose of a CT volume of an object from its 2D X-ray projections. We use maximization of mutual information, an accurate similarity measure for multi-modal and mono-modal image registration tasks. However, it is known that the standard mutual information measure only takes intensity values into account without considering spatial information and its robustness is questionable. In this paper, instead of directly maximizing mutual information, we propose to use a variational approximation derived from the Kullback-Leibler bound. Spatial information is then incorporated into this variational approximation using a Gibbs random field model. The newly derived similarity measure has a least-squares form and can be effectively minimized by a multi-resolution LevenbergMarquardt optimizer. Experimental results are presented on X-ray and CT datasets of a plastic phantom and a cadaveric spine segment. Keywords: mutual information, 3D-2D registration, X-ray, CT, Gibbs random field, Kullback-Leibler bound.
1
Introduction
3D-2D registration of a three-dimensional (3D) CT volume with two-dimensional (2D) X-ray images has shown great potential in a number of image-guided therapy applications. The reported techniques to achieve this registration can be split into two main categories: feature-based methods and intensity-based methods. Feature-based methods require a prerequisite segmentation stage which is errorprone and hard to achieve automatically. The errors in segmentation can lead to errors in the final registration. In contrast, intensity-based methods directly compare the X-ray image with the associated digitally reconstructed radiograph (DRR), which is obtained by simulating X-ray projection of the CT volume. No segmentation is required. In this work, we use maximization of mutual information (MI), an accurate similarity measure for multi-modal and mono-modal image registration tasks [1][2][3]. However, it is known that the standard mutual information measure only takes intensity values into account without considering spatial information and its robustness is questionable [4]. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 792–800, 2007. c Springer-Verlag Berlin Heidelberg 2007
Incorporating Spatial Information into 3D-2D Image Registration
793
Several attempts have been made to adapt the MI-based registration framework to incorporate spatial information of individual images [4][5][6][7][8]. However, the resultant similarity measure either requires to compute the entropy of higher dimensional probability distributions, which is not advisable because of the increase of statistical uncertainties with higher dimensions due to the scarcity of data, or is not robust to outliers. In this paper, instead of directly maximizing mutual information, we propose to use a variational approximation derived from the Kullback-Leibler bound [9]. Spatial information is then incorporated into this variational approximation using a Gibbs random field (GRF) model [10]. The newly derived similarity measure has a least-squares form and can be effectively minimized by a multiresolution Levenberg-Marquardt non-linear optimizer. The paper is organized as follows. In Section 2, we describe the proposed approach. In Section 3, we presents the experimental results, followed by conclusions in Section 4.
2 2.1
The Proposed Approach Derivation of a Variational Approximation to the MI
In this work, we assumes that the X-ray images are calibrated for their intrinsic parameters and that the X-ray images are corrected for distortion. If multiple X-ray images are used, they are all registered to a common reference frame. Therefore, the goal of a 3D-2D registration is to compute the rigid transformation T that relates the coordinate frame of the CT volume with the reference coordinate frame of the X-ray images. In the following, we focus on the derivation based on the qth X-ray image (where q = 1, ..., Q) and its associated DRR. Let us denote the values of the X-ray image (V ) as v(x) and the corresponding values of the DRR (U ) created from the CT volume given the current transformation estimation as u(x; T ). In this work, we regard the image values v(x) and u(x; T ) as random variables with associated probability density functions p(v(x)) and p(u(x; T )), respectively. The joint probability density function of these two random variables is p(v(x), u(x; T )). The conditional probability density function of v(x) given the values of u(x; T ) is expressed as p(v(x)|u(x; T )). The mutual information of two random variables is derived from the entropy values of the variables, both separately and jointly, as given by: H(V ) = − p(v) log(p(v))dv; H(V, U ) = − p(v, u) log(p(v, u))dvdu (1) and the conditional entropy of two random variables is: H(V |U ) = − p(v, u) log(p(v|u))dvdu
(2)
The entropy can be seen as a measure of uncertainty of a random variable. The mutual information between two random variables is defined by: q SMI (V, U, T ) = H(V ) + H(U ) − H(V, U ) = H(V ) − H(V |U )
(3)
794
G. Zheng
After some replacement, we can write Eq. 3 as: q SMI (V, U, T ) = p(v(x), u(x; T )) log(p(v(x)|u(x; T )))dvdu + H(V )
(4)
The optimal estimation of the rigid transformation can then be obtained by: Tˆ = arg max T
Q
q SMI (V, U, T )
(5)
q=1
where Q is the number of input X-ray images. Eq. 5 is the standard registration framework using maximization of mutual information. Histogram-based method [2] as well as Parzen window based method [1] have been proposed to compute the mutual information. It is known that the standard mutual information measure only takes intensity values into account without considering spatial information and its robustnesses are questionable. It can be shown using Kullback-Liebler bound [9] that: q SMI (V, U, T ) p(v(x), u(x; T )) log(q(v(x)|u(x; T ))dvdu + H(V ) (6) where q(v(x)|u(x; T )) is an arbitrary variational distribution. We call the right side of Eq. 6 the variational approximation to mutual information (VA-MI) and denote it as SVq A−MI (V, U, T ). The approximation is exact if q(v(x)|u(x; T )) ≡ p(v(x)|u(x; T )). As we are dealing with discrete images, the values v(x = (i, j)) and u((x = (i, j)); T ) that we observe from the images can be regarded as random samples from p(v(x), u(x; T )). Note that H(V ) in Eq. 6 does not depend on T . Ignoring this constant term, we can further approximate SVq A−MI (v(x), u(x; T )) by its sample estimate: SVq A−MI (V, U, T ) ≈
I J 1 log(q(v(i, j)|u((i, j); T ))) I × J i=1 j=1
(7)
where I × J is the pixel size of the X-ray image. Using Eq. 7, we actually convert the maximization of mutual information to an optimal labeling problem in which the labels are the conditional intensity values (v(i, j)|u((i, j); T )). Such a problem can be effectively solved using a Gibbs random field model. 2.2
Effective Realization of the Variational Approximation
Gibbs Random Field Theory. Gibbs random field theory is a branch of probability theory that relates the probabilities of the various possible states of a system to the energies associated to them. It has been used extensively in image restoration, segmentation, object recognition and matching [10]. In what follows, the basic concepts of the GRF are reviewed for completeness. For rigorous expositions, one may refer to [10].
Incorporating Spatial Information into 3D-2D Image Registration
795
Let L = {(i, j) : 1 i I, 1 j J} be an I × J integer lattice; then D = {Di,j ; (i, j) ∈ L} denotes a family of random variables, i.e., a random field, defined on L. A system d can be viewed as a discrete sample realization of D assuming certain state on each pixel site. d = {d1,1 , d1,2 , · · · , dI,J } is referred as a configuration of D. The complete set of all the configuration is denoted as D. r Definition 1. A neighborhood system for L is defined as N = {Ni,j ; (i, j) ∈ L}, r where Ni,j is the set of sites around (i, j) and is defined as follows: r Ni,j = {(i , j )|(i , j ) ∈ L, (i , j ) = (i, j), |(i , j ) − (i, j)| r}
(8)
where r is a positive integer that determines the size of the neighborhood system. Definition 2. A clique c is a subset of L, for which every pair of sites is a neighbor. Single pixels are also considered cliques. The set of all cliques related with the pixel site (i, j) is denoted by Ci,j . Definition 3. D is a Gibbs random field with respect to (w.r.t.) the neighborhood system N if and only if: 1 −E(d) e (9) z where p(d) is called a Gibbs measure of d; z is a normalization constant called the partition function and E(d) is the energy function of the form: p(d) =
E(d) = −
I,J
Wc (di,j )
(10)
i,j c∈Ci,j
where Wc is called the clique potential. Generally, Wc is a function of the cliques around the site under consideration. Realization of the Variational Approximation Using A GRF. To estimate the conditional distribution q(v(x)|u(x; T )), knowing the exact values of v(i, j) and u((i, j); T ) are not important. We are more interested in knowing the conditional difference between v(i, j) and u((i, j); T ). Having this knowledge and using the concept of Gibbs measure, we can approximate the conditional distribution q(v|u) using a GRF w.r.t. the neighborhood system N defined on the lattice L = {(i, j) : 1 i I, 1 j J} of the X-ray image. However, until now we still can not directly compare v(i, j) to u((i, j); T ) at each pixel site because there are inherent differences between the X-ray image and the associated DRR. In this paper, we propose to use a local normalization to circumvent this problem. The rationale behind it is that in a local region the intensity differences between different sites are mainly caused by the imaged object, if no external object presents in the field-of-view. Definition 4. A local region of size r for the pixel site (i, j) ∈ L is the set of sites defined by: r Ri,j = {(i , j )|(i , j ) ∈ L, |(i , j ) − (i, j)| r}
(11)
796
G. Zheng
The local normalization of both the X-ray image and the associated DRR is then performed as follows: v¯(i, j) =
v(i,j)−mv (Rri,j ) ; σv (Rri,j )
and u ¯((i, j); T ) =
u((i,j);T )−mu (Rri,j ) σu (Rri,j )
(12)
r r r r where mv (Ri,j ), σv (Ri,j ) and mu (Ri,j ), σu (Ri,j ) are the mean value and the standard deviation calculated from the intensity values of all sites in the local r region Ri,j of the X-ray image and of the associated DRR, respectively. We can now model the difference image
s((i, j); T ) = v¯(i, j) − u ¯((i, j); T )
(13)
as a GRF w.r.t. the neighborhood system N defined on the lattice L = {(i, j) : 1 i I, 1 j J}. According to the relationship between the probability measure and the energy function of a GRF at a single site, we have: log q(v(i, j)|u(i, j); T ) ≈ log p(s((i, j); T ); and log p(s((i, j); T ) = −E(s((i, j); T ) = − Wc (s(i, j); T )
(14)
c∈Ci,j
We can further expand the clique potentials in Eq. 14 according to the clique size. In this work, we only consider the cliques of size up to two. Using such an approximation, we derive a new similarity measure. We call it the GRF model based variational approximation to mutual information (GRF-VA-MI) q and denote it as SGRF −V A−MI (V, U, T ). It has the form: q −SVq A−MI (V, U, T ) ≈ SGRF −V A−MI (V, U, T ) =
I,J
Wc (s((i, j); T ))
i,j
+
I,J i,j
1 r ) card(Ni,j
·
r (i ,j )∈Ni,j
(15)
Wc (s((i, j); T ), s((i , j ); T ))
where SVq A−MI (V, U, T ) is negative because mutual information is maximized, whereas energy must be minimized. The first term of the rightest side is the potential function for single-pixel cliques and the second term is the potential r function for all other pairwised cliques. card(Ni,j ) means to compute the number r of pixels in neighborhood Ni,j . The selection of the potential function in Eq. 15 is a critical issue in GRF modeling [10]. It is worth in the future to investigate how the selection of the potential function affects the registration performance. In this work, we simply use following form: Wc (s((i, j); T )) = [s((i, j); T )]2 Wc (s((i, j); T ), s((i , j ); T )) = [s((i, j); T ) − s((i , j ); T )]2
(16)
Now the registration framework as described in Eq. 5 is changed to: Q q Tˆ = arg min[ SGRF −V A−MI (V, U, T )] T
q=1
(17)
Incorporating Spatial Information into 3D-2D Image Registration
797
Fig. 1. Experimental datasets. Left two images: phantom and one of the X-ray images; right two images: volume rendering result showing the location of one of the fiducial markers and one of the X-ray images with the projection of the interventional instrument (delineated by a black box). Table 1. Data specifications CT Data Specification test object rows × columns × slices
data res. (mm3 )
start res. (mm3 )
end res. (mm3 )
phantom
512 × 512 × 93
0.36 × 0.36 × 2.5 2.88 × 2.88 × 2.5 2.88 × 2.88 × 2.5
spine
512 × 512 × 72
0.36 × 0.36 × 1.25 2.88 × 2.88 × 1.25 2.88 × 2.88 × 1.25 X-ray Data Specification
test object width × height × images data res. (mm2 )
start res. (mm2 )
end res. (mm2 )
phantom
768 × 576 × 2
0.39 × 0.39
3.12 × 3.12
3.12 × 3.12
spine
768 × 576 × 2
0.39 × 0.39
3.12 × 3.12
3.12 × 3.12
2.3
Implementation Details
To accelerate the registration process, we exploit a spline-based multi-resolution 3D-2D registration scheme [11]. A cubic-splines data model is used to compute the multi-resolution data pyramids for the CT volume, the X-ray images, the DRRs, as well as for the gradient and the Hessian of the GRF-VA-MI. The registration is then performed from the coarsest resolution until the finest one. At each resolution level, the size of the local region for the normalization is always equal to that of the neighborhood system used in Eq. 15. And to improve the capture range, we use two different sizes of neighborhood systems: r=15 and r=3. The GRF-VA-MI with the bigger neighborhood system is first minimized via a Levenberg-Marquardt non-linear least-squares optimizer. The estimated Tˆ is then treated as the starting value for optimizing the GRF-VA-MI with the smaller neighborhood system.
3
Experimental Results
We conducted two studies on X-ray and CT datasets of a plastic phantom and a cadaveric spine segment (Fig. 1). The data sizes, the original data resolution, the start and the end resolutions of the X-ray and the CT datasets are summarized in Table 1. The ground truth transformations of both datasets were obtained by
798
G. Zheng
Fig. 2. Probe through the minimum of similarity measures on the phantom data (the first row) and on the spine data (the second row). The ordinate shows the value of similarity measures normalized to the range [0.0, 1.0], which are given as functions of each parameter in the range of [−15o ,15o ] or [-15mm, 15mm] away from its ground truth (the first three columns represent the translational probe along X, Y and Z axis, respectively; the last three columns represents the rotational probe along each axis).
Table 2. Study results using the datasets of cadaveric spine segment absolute parameter range (o , mm) (2, 2) (4, 4) (6, 6) (8, 8) (10, 10) (12, 12) average of the initial mTRE (mm) 2.3
4.6
6.9
9.2
11.5
success percentage
100
100
100
99
95
13.5 85
average of the final mTRE (mm)
0.8
0.8
0.8
0.8
0.8
0.8
performing paired-point matchings on implanted fiducial markers. The phantom was custom-made to simulate a good condition. In contrast, the quality of the X-ray images for the cadaveric spine was poor and there were projections of interventional instruments presented (see the rightest image of Fig. 1). Using the datasets of both objects downsampled until the start resolution, we first compared the behavior of the GRF-VA-MI to those of a MI-based measure using a histogram-based implementation [2] and of a similarity measure introduced in [11], which is a global normalization based sum-of-squared-differences (SSD). The results are presented in Fig. 2. It was found that all similarity measures had similar behavior when tested on the phantom dataset but different behavior when tested on the spine segment dataset. The GRF-VA-MI shows a superior behavior compared to others. More specifically, all curves of the GRFVA-MI have clear minima and are smoother than those of others. It also shows that using bigger neighborhood system, which is equivalent to incorporate wider range of spatial information, leads to smoother energy function whereas using smaller neighborhood system results in higher accuracy. The second study was performed only on spine segment dataset to evaluate the performance of the registration scheme using the GRF-VA-MI. In this study, we perturbed the ground truth by randomly varying each parameter in the range of [−2o , 2o ] or [-2mm, 2mm] to get 200 positions, and then another 200 positions in the rage of [−4o , 4o ] or [-4mm, 4mm], and so on until the range of [−12o , 12o ] or [-12mm, 12mm]. We then performed the registration starting from these perturbed positions and counted the success rate. Using a method similar to
Incorporating Spatial Information into 3D-2D Image Registration
799
that reported in [12], we regarded a registration as successful if the mean target registration errors (mTRE) evaluated on the fiducial markers was smaller than 1.5 mm . The capture range was then defined as the the average of the initial mTRE when a 95% success rate is achieved. The study results are presented in Table 2. When the absolute parameter range is (12o , 12mm), the average CPU time tested on a 3.0 GHz Pentium machine was 26.7 seconds. It was found that the capture range of the GRF-VA-MI was much larger than those reported in [7] and in [12], although the attained accuracy was lower than that reported in [12]. This might be explained by the large inter-slice distance (2.5 mm in this work vs. 0.31 mm in [12]) and the region outliers in the X-ray images.
4
Conclusions
In this paper, we derived a novel information and Gibbs random field theory based similarity measure, the GRF model based variational approximation to mutual information, based on the Kullback-Leibler bound. The newly derived similarity measure enabled us to effectively incorporate spatial information into a 3D-2D registration. Results from the experiments performed on the datasets of a plastic phantom and of a cadaveric spine segment show that the newly derived similarity measure has a larger capture range than those have been previously reported and attains satisfactory accuracy.
References 1. Wells, W., Viola, P., et al.: Multi-modal volume registration by maximization of mutual information. MedIA 1, 35–51 (1996) 2. Maes, F., Collignon, A., et al.: Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 16, 187–1998 (1997) 3. Pluim, J.P., et al.: Mutual information based registration of medical images: a survery. IEEE Trans. Med. Imaging 22, 986–1004 (2003) 4. Pluim, J., et al.: Image registration by maximization of combined mutual information and gradient information. IEEE Trans. Med. Imaging 19, 809–814 (2000) 5. Rueckert, D., Clarkson, M.J., et al.: Non-rigid registration using higher-order mutual information. SPIE Medical Imaging: image processing 3979, 438–447 (2000) 6. Sabuncu, M.R., Ramadge, P.J.: Spatial information in entropy-based image registration. In: Gee, J.C., Maintz, J.B.A., Vannier, M.W. (eds.) WBIR 2003. LNCS, vol. 2717, Springer, Heidelberg (2003) 7. Russakoff, D.B., Tomasi, C., et al.: Image similarity using mutual information of regions. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 596–607. Springer, Heidelberg (2004) 8. Gan, R., Chung, A.C.S.: Multi-dimensional mutual information based robust image registration using maximum distance-gradient-magnitude. In: Christensen, G.E., Sonka, M. (eds.) IPMI 2005. LNCS, vol. 3565, pp. 210–221. Springer, Heidelberg (2005) 9. Barber, D., Agakov, F.V.: The IM algorithm: a variational approach to information maximization. In: NIPS’03, vol. 16, MIT Press, Cambridge, MA (2004)
800
G. Zheng
10. Li, S.Z.: Markov random field modeling in computer vision. Springer, Heidelberg (1995) 11. Joni´c, S., Th´evenaz, P., et al.: An optimized spline-based registation of a 3D CT to a set of C-arm images. Int. J. Biomed. Imaging, Article ID 47197, 1–12 (2006) 12. von de Kraats, E.B., Penney, G.P., et al.: Standardized evaluation methodology for 2-D-3-D registration. IEEE Trans. Med. Imaging 24, 1177–1189 (2005)
Performance Evaluation and Recent Advances of Fast Block-Matching Motion Estimation Methods for Video Coding Berenice Ramirez Centro de Investigacion y Desarrollo de Tecnologia Digital, Av. del Parque 1310, 22510 Tijuana B.C., Mexico
[email protected] http://www.citedi.mx/portal
Abstract. This paper presents a performance evaluation of a set of fast block matching motion estimation methods of recent development. These methods are Kite-Cross-Diamond search (KCDS), Cross-Diamond-Hexagonal search (CDHS), Efficient-Hexagonal-Inner search (EHIS) and a newly proposed in this paper, called Cross-Half-Diamond low resolution search (CHDS), which is an improvement of Cross-Diamond-Hexagonal search. Also, Diamond search (DS) and Full search (FS) algorithms are implemented. These methods are evaluated based on the mean square error and the number of search points. Based on experimental results, it is concluded that DS, CDHS, CHDS and EHIS are appropriate for small and large motion contents, except EHIS for high frequencies and complex motion contents. The KCDS method has the lowest performance compared with other algorithms. Keywords: Motion Estimation, Block-Matching, Video Coding, Performance Evaluation.
1
Introduction
The fast block-matching motion estimation (ME) begins with the 2D-logarithmic search, the three step search (TSS), and the modified conjugate direction search, developed between 1981-1984, mentioned in [1]. These methods become the influence of later algorithms, developed between 1990-1996. The algorithms found in this research are the four step search (4SS) [2], the cross search (CS) [3], the new three step search (N3SS) [4], and the block-based gradient descent search (BBGDS) [5]. Later, in 1998 the Diamond Search (DS) algorithm [6] is developed based on statistics studies, and from 2000 new algorithms based on DS are developed, these are, the optimized diamond search [7], the cross diamond search (CDS) [8], the efficient three step search (E3SS) [9], and the algorithms evaluated in this work developed between 2004-2006, kite-cross-diamond search (KCDS) [10], the cross-diamond-hexagonal search (CDHS) [11], and the improvement of CDHS, called cross-half-diamond low resolution search (CHDS) presented in Section 2 on this paper; the efficient-hexagonal-inner search (EHIS) W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 801–808, 2007. c Springer-Verlag Berlin Heidelberg 2007
802
B. Ramirez
[12] is based on hexagon search patterns following the line in [13]. All the algorithms mentioned in this section are called conventional fast block-matching motion estimation algorithms, because they consider only translational motions. On the other hand, there are the no conventional motion estimation algorithms, those found in this work are, the fast motion estimation method based on particle swarm optimization (PSO) [14], those that use Lie operators [15, 16], a very recent method using a Kalman filter [17], and those that take advantages of the available tools of video coding standards, as the scheme for motion estimation using multireference frames in H.264/AVC [18], and the automatic block split pyramidal motion estimation for MPEG 4 [19], and finally those methods based on wavelets [20]. The focus in this paper is to report the performance evaluation of fast blockmatching motion estimation methods developed recently, but that they are conventional algorithms. In Section 2 the fast block-matching motion estimation methods evaluated in this work are presented concisely. Also, a brief description of the bases of the methods is presented. The Section 3 presents the experimental environment and features. Finally, in Tables 1 and 2 the experimental results of average MSE and average number of search points are presented, respectively. Finally, in Section 4 the conclusions are exposed.
2
Fast Block-Matching ME Methods
The fast block-matching motion estimation methods KCDS, CDHS, and CHDS evaluated in this work are based on statistics studies. The first results can be found in [6]. They treat the motion vector as a random variable, and then they compute its cumulative probability distribution function for a set of test sequences. For their experiments they use the FS algorithm, the MSE as the matching criterion, and a 15×15 search window (SW). They found that exist a greater probability to find motion vectors in a circular support with radium of 2 pels and centered in zero motion vector (the center of SW) within the SW, the circular support is referred in sections below as a circular area. From these results they developed the DS algorithm. Later works followed this line and made another statistics experiments. The results of theses later experiments can be found in [8] and [11]. In [8] can be found the results of compute the cumulative probability distribution function for the search patterns: square, diamond, and cross. In their experiments, they found that exist a greater probability to find motion vectors on the horizontal and vertical directions, than another locations at the same radium. From these results the cross shape becomes an important search pattern. Finally, in [11] are published the results of compute the probability to find motion vectors in the corners of a diamond pattern, they call this case as a diamond-corner-hit. In base on their experimental results they proposed hexagonal patterns to reduce the number of search points. The EHIS
Block-Matching Motion Estimation: Performance Evaluation
803
algorithm does not belong to line DS, it only uses a hexagon pattern, in [13] can be found the first work of hexagon-based search pattern. 2.1
Kite-Cross-Diamond Search
This algorithm is proposed by Lam et al. at 2004 [10]. The patterns used in this method are: cross pattern, diamond pattern, and four kite patterns called up-kite, down-kite, left-kite, and right-kite. This algorithm starts to explore the search window using a 3×3 cross pattern and centered in zero motion. Depending of the direction of the minimum distortion, the kite pattern corresponding to that direction is used to find the global minimum point. But, on the other hand, the experimental results of Zhu [6] demonstrated that exist a greater probability to find motion vectors in the circular area, and the results of Cheung [8] demonstrated that exist more probability to find horizontal and vertical motions. But this method focuses only in the 3×3 cross-shaped centered in the zero motion, and another possible directions are not explored, as a result, this method finds local minimum points. If we compare the experimental results of this method with results of the other methods, shown in Section 4, KCDS has the lowest performance. 2.2
Cross-Diamond-Hexagonal Search
The algorithm is proposed by Cheung and Po at 2005 [11]. This algorithm uses different patterns that can be found in [11]. It incorporates the results of Zhu [6], Cheung [8], and their own results published in [11]. The algorithm starts exploring the 5×5 cross-shaped area centered in zero motion, using a small cross pattern and a large cross pattern. As a result, this method explores the circular area with a lot of efficiency if we compare it to the other algorithms implemented in this work. It achieves a comparable precision to that of DS, but reducing the number of search points. Furthermore, according to these authors when a diamond pattern is used, it is possible to reduce the number of search points switching it by hexagonal patterns. 2.3
Cross-Half-Diamond Low Resolution Search: An Improvement of CDHS
This improvement of CDHS is a contribution of this work developed at 2006. It is based on contributions of Zhu [6], and Cheung [8] described in previous sections. The CHDS keeps the search method of CDHS to explore the 5×5 cross-shaped area centered in zero motion, but when the search moves away from the circular area the CHDS propose four half diamond patterns of low resolution shown in Fig. 1 for different directions of the global minimum point. The search is shown in Fig. 2. The proposal of these patterns is based on contributions in [6], and [8] interpreted as follow: If the probability to find motion vectors decreases when the search moves away from the circular area, it is not required to explore the rest
804
B. Ramirez
Fig. 1. Search patterns of CHDS a) upwards, b) downwards, c) towards the left, and d) towards the right
Fig. 2. Example of a CHDS coarse search using towards the right search pattern
of SW accurately. Furthermore, there is a greater probability to find horizontal and vertical displacements. Finally, the CHDS method achieves to improve slightly the speed of CDHS keeping its precision. However, it is only useful in video with long motions. 2.4
Efficient-Hexagonal-Inner Search
This algorithm is proposed by Su et al. at 2005 [12]. The algorithm uses a prediction scheme and a hexagon pattern to reduce the number of search points. EHIS focuses in the improvement of inner search using six groups of distortions. These groups of distortions are compared and the smallest gives the locations
Block-Matching Motion Estimation: Performance Evaluation
805
in the inner of the hexagon for fine resolution. The efficiency of this algorithm depends highly of the prediction scheme.
3
Experimental Results
As the experimental environment a SP ML MPEG-2 encoder with a Group of Pictures (GOP) length set to 5, a frame distance set to 1, a 16 × 16 pixel block, and a 15 × 15 search window is used. For image sequences with small and moderate motion content, Claire, and Tennis are used, respectively, and for complex and large motion content, Flowergarden, and Football are used, respectively. The MSE between the original image sequence and the reconstructed image sequence using only motion vectors is computed to evaluate the precision of the methods. Furthermore, the number of search points to compute motion vectors is calculated as a measure of the speed of the methods. In Table 1 the average MSE is presented, and in Table 2 the average number of search points is presented. For Claire sequence as it is shown in Table 1, DS, KCDS, CDHS, CHDS, and EHIS presented a MSE performance very close to MSE performance of FS. Therefore, these algorithms estimate accurately short motion vectors. For Tennis sequence, DS, CDHS, CHDS, and EHIS have similar MSE values to MSE value of FS for translational motions as it is shown in Fig. 3 (frame 1 to 17), but as it is expected, the performance of the methods decreases for another motion models, in this case, a zoom motion content, shown in Fig. 3 from frame 18. The accuracy of KCDS method is achieved only in image sequences with very small motion content, in accordance with experimental results, although it has the smallest number of search points in all sequences, except in Claire. In Flowergarden sequence, DS, CDHS, and CHDS achieve a MSE performance close to MSE performance of FS. In terms of average MSE, EHIS is away almost 240 units, in contrast with the other algorithms reaching almost the 50 units for lowest performance. Table 1. Average MSE Values Method Claire Tennis Flowergarden Football FS 3.16 91.26 287.5 439.00 DS
3.16 103.26
307.29
505.39
KCDS
3.51 215.81
1320.71
852.86
CDHS
3.17 110.24
334.82
521.86
CHDS
3.17 110.36
315.36
522.21
EHIS
3.30 135.90
527.04
579.00
806
B. Ramirez Table 2. Average Number of Search Points Method Claire Tennis Flowergarden Football DS 10.58 12.97 16.63 16.02 KCDS
6.99
6.71
6.84
6.71
CDHS
4.80
8.38
14.60
11.85
CHDS
4.80
8.10
13.68
11.18
EHIS
9.36
10.39
12.25
11.00
Fig. 3. MSE for Tennis sequence after to apply the fast search methods
Fig. 4. Reconstructed frame of Flowergarden sequence, using only motion vectors computed with FS method
Therefore DS, CDHS, and CHDS are more appropriate to use in image sequences with high frequencies and complex motion content. In Figures 4 and 5, a reconstructed frame of Flowergarden sequence made by the MPEG encoder using only the motion vectors after apply FS and CDHS algorithms is shown, respectively. For Football sequence, DS, CDHS, and CHDS achieve a close MSE performance among them, but away almost 85 units from MSE of FS. EHIS is
Block-Matching Motion Estimation: Performance Evaluation
807
Fig. 5. Reconstructed frame of Flowergarden sequence, using only motion vectors computed with CDHS method
away 140 units from MSE of FS, and almost 57 units from lowest performance of mentioned methods. Under this observation, it has not the performance of other methods but it achieves to follow them.
4
Conclusions
In this paper, the performance evaluation and recent advances of fast blockmatching motion estimation methods are reported. Based on experimental results, DS, CDHS, CHDS, and EHIS are appropriate for small and large motion content, except EHIS for high frequencies and complex motion content. The KCDS method has the lowest performance if we compare it to the other algorithms; it only achieves a good performance for small motion content.
References 1. Musmann, H., Pirsch, P., Grallert, H.: Advances in Picture Coding. Proceedings IEEE 73(4), 523–548 (1985) 2. Po, L.M., Ma, W.C.: A Novel Four-Step Search Algorithm for Fast Block Motion Estimation. IEEE Trans. on Circuits and Systems for Video Technology 6(3), 313–317 (1996) 3. Ghanbari, M.: The Cross-Search Algorithm for Motion Estimation. IEEE Trans. on Communications 38(7), 950–953 (1990) 4. Li, R., Zeng, B., Liou, M.L.: A New Three-Step Search Algorithm for Block Motion Estimation. IEEE Trans. on Circuits and Systems for Video Technology 4(4), 438– 443 (1994) 5. Liu, L.K., Feig, E.: A Block-Based Gradient Descent Search Algorithm for Block Motion Estimation in Video Coding. IEEE Trans. on Circuits and Systems for Video Technology 6(4), 419–423 (1993) 6. Zhu, S., Ma, K.: A New Diamond Search Algorithm for Fast Block-Matching Motion Estimation. IEEE Trans. on Image Processing 9(2), 287–290 (2000) 7. Zhu, C., Lin, X., Chau, L.P., Ang, H.A., Ong, C.Y.: An Optimized Diamond Search Algorithm for Block Motion Estimation. IEEE Int’1 Symposium on Circuits and Systems 2(2), 488–491 (2002)
808
B. Ramirez
8. Cheung, C., Po, L.: A Novel Cross-Diamond Search Algorithm for Fast Block Motion Estimation. IEEE Trans. on Circuits and Systems for Video Technology 12(12), 1168–1177 (2002) 9. Chau, L-P., Jing, X.: Efficient Three-Step Search Algorithm for Block Matching Motion Estimation. IEEE Int’1 Conf. on Acoustics 3, 421–424 (2003) 10. Lam, C.W., Po, L.M., Cheung, C.H.: A Novel Kite-Cross Diamond Search Algorithm for Fast Block Matching Motion Estimation. IEEE Int’l. Symposium on Circuits and Systems 3, 729–732 (2004) 11. Cheung, C., Po, L.: A Novel Cross-Diamond-Hexagonal Search Algorithms for Fast Block Motion Estimation. IEEE Trans. on Multimedia 7(1), 16–22 (2005) 12. Su, C., Hsu, Y., Chang, C.: Efficient Hexagonal Inner Search for Fast Motion Estimation. IEEE Int’l Conference on Image Processing 1, 1093–1096 (2005) 13. Zhu, C., Lin, X., Chau, L.P.: Hexagon-based Search Pattern for Fast Block Motion Estimation. IEEE Trans. on Circuits Systems for Video Technology 12(5), 349–355 (2002) 14. Du, G-Y., Huang, T-S.: A Novel Fast Motion Estimation Method Based on Particle Swarm Optimization. IEEE Proc. of 2005 Int’l Machine Learning and Cybernetics 8, 5038–5042 (2005) 15. Nalasani, M., Pan, W.: On the Complexity and Accuracy of Motion Estimation using Lie Operators. IEEE Proc. of the 36th. Southeastern Symposium on System Theory, 16–20 (2004) 16. Veeravalli, A.G., Pan, W.D., Nalasani, M., Mohan, M.K.S.: Covariance Analysis of Non-translational Motion-compensated Frame Differences. In: IEEE Proc. of the 37th. Southeastern Symposium on System Theory, pp. 462–465 (2005) 17. Kuo, C-M., Chung, S-C., Shih, P-Y.: Kalman Filtering Based Rate-Constrained Motion Estimation for Very Low Bit Rate Video Coding. IEEE Trans. on Circuits and Systems for Video Technology 16(1), 3–18 (2006) 18. Zhu, X., Zhu, S., Liu, T.: Automatic Block Split Pyramidal Motion Estimation for MPEG 4 Video Encoder. In: IEEE 5th. World Congress on Intelligent Control and Automation. vol. 5, pp. 3958–3961 (2004) 19. Kim, S-E., Han, J-K., Kim, J-G.: An Efficient Scheme for Motion Estimation using Multireference Frames in H.264/AVC. IEEE Trans. on Multimedia 8(3), 457–466 (2006) 20. Park, H-W., Kim, H-S.: Motion Estimation using Low-Band-Shift Method for Wavelet-based Moving-Picture Coding. IEEE Trans. on Image Processing 9(4), 577–587 (2000)
Spectral Eigenfeatures for Effective DP Matching in Fingerprint Recognition Boris Danev and Toshio Kamei Common Platform Software Research Laboratories, NEC Corporation 1753, Shimonumabe, Nakahara, Kawasaki, Kanagawa 211-8666, Japan
[email protected],
[email protected] Abstract. Dynamic Programming (DP) matching has been applied to solve distortion in spectral-based fingerprint recognition. However, spectral data is redundant, and its size is huge. PCA could be used to reduce the data size, but leads to loss of topographical information in projected vectors. This allows only inter-vector similarity estimations such as Euclid or Mahalanobis distances, and proves to be inadequate in presence of distortion occurring in finger sweeping with a line sensor. In this paper, we propose a novel two-step PCA to extract compact eigenfeatures amenable to DP matching. The first PCA extracts eigenfeatures of Fourier spectra from each image line. The second extracts eigenfeatures from all lines to form the feature templates. In matching, the feature templates are inversely transformed to line-by-line representations on the first PCA subspace for DP matching. Fingerprint matching experiments demonstrate the effectiveness of our proposed approach in template size reduction and accuracy improvement.
1
Introduction
Automated fingerprint identification has been used in law enforcement, and showed the identification capability of fingerprints [1,2]. Nowadays fingerprint recognition is being widely spread in various applications, especially consumer devices such as personal computers, mobile phones, and USB keys [3]. In these devices, fingerprint sensors are embedded to authenticate the validity of users. The most popular sensor is a sweep-type line sensor because of its cost and size. The users sweep their fingers on the sensor, and fingerprint images are captured and recognized. However, current fingerprint recognition technologies are not enough to satisfy all the requirements such as accuracy, size and cost for embedded devices. Matsumoto et al. [4] proposed fingerprint verification methods based on DP matching of spectral features. In this work, spectral features such as Linear Prediction Coefficients (LPC) were extracted from each image line. DP treats the spectral features as one-dimensional signals to solve distortion in fingerprint images. Although this approach might have the potential to satisfy the requirements for embedded devices, spectral data contains redundant information, and its size is huge. Principal Component Analysis (PCA) is effective in reducing data W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 809–816, 2007. c Springer-Verlag Berlin Heidelberg 2007
810
B. Danev and T. Kamei
size by keeping original information, and therefore it is widely used in biometrics [5,6,7]. PCA projects the original feature into a PCA subspace. Inter-vector distances such as Euclid or Mahalanobis distances are often used to compute similarity between projected vectors. In case of spectral features of fingerprints, PCA could be used to reduce the data size, but leads to loss of topographical information in projected vectors. Therefore, DP matching is not applicable to the projected vectors. Even if inter-vector distances were used, it is difficult to reach enough accuracy due to the presence of distortion occurring in finger sweeping with a line sensor. Our challenge is how to introduce an efficient PCA technique into DP matching of spectral features. In this paper, we propose a novel two-step PCA to extract compact eigenfeatures amenable to DP matching. The first PCA is used to extract eigenfeatures of Fourier spectra from each image line. Then a second PCA is applied to extract the eigenfeatures from all image lines in order to form the feature templates. In matching phase, the feature templates are inversely transformed into the first PCA subspace to reconstruct line-by-line representations. Then we apply DP matching to the line-by-line representations. Furthermore, in order to enhance the second PCA, we introduce automatically generated images into the estimation of the second PCA matrix. These images are generated from the original images based on warping information provided by DP matching.
2 2.1
Spectral Eigenfeatures for DP Matching Spectral Eigenfeature Extraction Using Two-Step PCA
In feature extraction, spectral eigenfeatures are extracted using two-step PCA to obtain compact features amenable to DP matching as summarized in Fig. 1(a). First, we apply the one-dimensional Discrete Fourier Transformation (DFT) line by line to the normalized image f (m, n) of a given fingerprint: M−1 1 mω F (ω, n) = √ f (m, n) exp(−2πi ) M M m=0
(1)
where 0 ≤ m ≤ M − 1 (horizontal), 0 ≤ n ≤ N − 1 (vertical) In the first PCA, a projected vector g l , also called an eigenfeature, is extracted from the Fourier amplitude spectrum of each horizontal image line l using a PCA matrix WL to obtain its principal components: g l = WLt sl
(2)
Here, sl denotes the spectrum of the Fourier amplitude |F (ω, l)| in a vector t form: sl = [ |F (1, l)| |F (2, l)| · · · |F (M/2 − 1, l)| ] , where the DC component and redundant half of the spectrum are removed. Extraction for all lines is written as G = WLt S in step (a) in Fig. 1 using G, an array of g l , and a matrix S = [s0 s1 · · · sN −1 ], also called spectral image in this paper.
Spectral Eigenfeatures for Effective DP Matching in Fingerprint Recognition
W tL
h1
h1
(b)
G
gPN
(c)
hD
hD h
g
h
(a) Extraction
WI
g11
-1
g11. . g1N
(d)
gPN g
(e)
..
W tI
Reconstructed features
..
gPN
..
g11
Inverse transform
..
gP1
Template feature
2st PCA
..
..
(a)
..
..
..
.. s s(M/2-1)1 (M/2-1)N
g11. . g1N
Template feature
..
s 11 . . s 1N
S
Raster scan
1st PCA
..
Fourier spectra
811
gP1. . gPN G
(b) Matching
Fig. 1. A scheme for spectral eigenfeature extraction and matching
The second PCA projects a vector g obtained by raster-scanning of elements in G into the PCA subspace specified by a matrix WI to obtain the second eigenfeature h as shown in step (c): h = WIt φ(G) = WIt φ WLt S (3) where φ(G) denotes the raster-scanning of matrix G. The second eigenfeature h is enrolled as a feature template used in matching. The numbers of principal components in the first and second PCA are experimentally determined. 2.2
DP Matching Using Eigenfeatures
DP matching [4,8] is used to match feature templates. To restore topographical information, a reference template hR and a test template hT are inversely transformed into the first PCA subspace to reconstruct line-by-line representations ˆR, G ˆ T using the second PCA matrix WI : G ˆ R = φ−1 (WI hR ), G
ˆ T = φ−1 (WI hT ) G
(4)
where φ−1 denotes the inverse operation of raster-scanning φ. ˆ R, G ˆ T are approximations As the reconstructed line-by-line representations G of the principal components of spectral features of horizontal lines, DP matching ˆ R and G ˆT . is performed in vertical direction to obtain the similarity between G 2.3
Training of PCA Matrices
PCA matrix is given as the set of κ eigenvectors in the eigenvector matrix V corresponding to the κ-highest eigenvalues in the eigenvalue problem CV = V Λ where Λ is the eigenvalue matrix and C is the covariance matrix calculated from a training set. K normalized images f k (m, n)(0 ≤ k ≤ K − 1) and their spectral images S k = [sk0 sk1 · · · skN −1 ] are used as the training set to calculate the first and second PCA matrices WL and WI . Since each line is projected using the same PCA matrix in the first PCA, the matrix can be calculated from a training set {s00 , s01 , · · · , s10 , s11 , · · · , sK−1 N −1 } consisting of all Fourier amplitude spectra skl of all spectral images S k . On the other hand, the second PCA extracts the principal components of all features obtained in the first PCA. In order to increase training dataset, we
812
B. Danev and T. Kamei
(a) Generated image from the same finger
(b) Generated image from the different finger Fig. 2. Automatically generated images for training
introduce an automatically generated training set into the estimation of the second PCA. The generated training set {X pq } is produced by warping images in the original training set. Image X pq is generated by transforming an original image f q into the topography of the reference image f p using warping information obtained in the DP matching between the spectral images of f p and f q . For example, the generated image in Fig. 2(a) has changed position of ridge bifurcation marked by rectangles in the figure, because DP matching applied on the original image with the reference image has tried to solve the deformation in the fingerprint. In the case of using a reference image from a different finger, the generated image in Fig. 2(b) looks consistently stretched compared to the original. Using the spectral images SX pq derived from the generated images X pq , the covariance matrix CI for the second PCA is calculated: CI =
K K 1 ¯ )(xpq − x ¯ )t , (xpq − x K 2 − 1 p=1 q=1
xpq = φ(WLt SX pq )
(5)
Therefore, the number of training data becomes K 2 from K.
3 3.1
Experiments Dataset
For experiments, we used a database of fingerprint images captured with an optical line sensor with 858 dpi resolution. The database contains images from 18 people with 8 different fingers, 3 samples per finger.
Spectral Eigenfeatures for Effective DP Matching in Fingerprint Recognition
813
1.8 DFT GDS
Equal error rate (%)
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 64x64
96x96
128x128
192x192
Image size
Fig. 3. Accuracy dependency of DFT and GDS on image size
We pre-processed the database in the following way: we first manually identified a reference point. A core was usually selected as such a point. In cases where core was not present, a minutia near the core region was selected. Then, a square region of size 96×96 was cropped with reference point in the middle, and pixel-value normalization was applied [9]. Furthermore, the database was divided in two datasets. The first 144 (48 fingers × 3 samples) images formed the training set, while the other 288 images (96 fingers × 3 samples) composed the testing set. Thus, training and testing sets were disjoint. 3.2
Spectral Features
DFT vs GDS. Matsumoto et. al proposed Group Delay Spectrum (GDS) as DFT alternative for spectral representation of each horizontal line of image in DP matching of fingerprints [4]. We compared DFT with GDS for different image sizes using their approach. Hamming window is applied to each horizontal line in the DFT calculation. As shown in Fig. 3, GDS feature proved better than DFT feature in large size region (e.g., 192×192), and scored poorly in small size ones (e.g., 96×96). The reasons for this are probably twofold. First, as GDS uses phase information of the LPC spectrum, the short length of the one-dimensional signal is not enough to properly predict the linear coefficients. Second, larger size regions in our database had areas that include background regions. This tends to degrade discriminant capabilities of the DFT feature. Thus, we selected DFT of 96×96 cropped images in the experiments described later. Dimensionality of Features. Since the width of the cropped images is 96, the dimensionality of the Fourier amplitude spectrum sl becomes 96/2-1=47. The size of S becomes 96×47=4512. The first 12 principal components were extracted in the first PCA, as this experimentally proved to be enough to represent each Fourier amplitude spectrum without losing matching accuracy. The size of G becomes 96×12=1152.
814
B. Danev and T. Kamei
7
5
Ia Ib Ref
5 4 3 2
IIa IIb Ref
4.5 4
Equal error rate (%)
Equal error rate (%)
6
3.5 3 2.5 2 1.5 1
1 0.5
0 10
30
50
70
90
110
150
200
400
Dimensionality
(a) Euclidean distance (Ia,Ib)
800 1152
0 10
30
50
70
90
110
150
200
400
800 1152
Dimensionality
(b) Proposed matching (IIa,IIb)
Fig. 4. Accuracy dependency on the dimensionality of feature template Table 1. Summary of the matching results Algorithm Ref[4] Ia Ib IIa IIb
3.3
Matching DP matching Euclidean distance Euclidean distance Proposed matching Proposed matching
Training Set Best EER Template Size N/A 0.42% 4512 original images 3.81% 200 generated images 3.47% 400 original images 0.42% 1152 generated images 0.26% 200
Matching Experiments
We evaluated our proposed algorithm in comparison to other algorithms with respect to the feature template size and matching accuracy. The reference algorithm is the algorithm in [4] which uses DFT instead of GDS. Equal Error Rate (EER) of the reference algorithm was 0.42% at image size 96×96. We also compared our algorithm with an algorithm based on an inter-vector matching between feature templates h in Eq. 3 using the Euclidean distance. The effect of image generation in WI computation was also investigated. Fig. 4 shows the accuracy dependency on the dimensionality of feature template. In the figure, the suffixes a and b (e.g., Ia, Ib) denote the difference of a training set used in WI computation; a denotes the original training images was used, and b - the generated training images was used. Although the accuracy of inter-vector matching (Ia, Ib) showed around EER=4%, our proposed algorithms (IIb) reached the reference accuracy EER=0.42% with 90 dimensions, and scored the best accuracy EER=0.26% with 200 dimensions as shown in Fig. 4. The figure also shows that the automatic image generation is significantly effective in the training of IIb. Table 1 summarizes the experimental results. Template size in the table means the dimensionality of the feature template at the best EER. The proposed algorithm is also very efficient in terms of speed and memory consumption. At the best EER, it required 200×4=800 bytes for template storage in floating precision, while the template size of the reference algorithm
Spectral Eigenfeatures for Effective DP Matching in Fingerprint Recognition
815
7 Ia Ib IIa IIb
1
Ia Ib IIa IIb
6
0.8
5
Fisher ratio
Contribution ratio
0.9
0.7 0.6 0.5
4
3
0.4
2 0.3 0.2 10
30
50
70
90
110
150
200
400
800 1152
Dimensionality
(a) Cumulative contribution ratio Γ (κ)
1 10
30
50
70
90
110
150
200
400
800 1152
Dimensionality
(b) Fisher ratio F (κ)
Fig. 5. Performance measures of subspaces
is 18,048 bytes. The memory size for WL and WI are 47×12×4=2,256 bytes, 1152×200×4=921,600 bytes, respectively. Matching time was about 10 ms on a Intel Pentium 4 - 3.2 GHz processor using MATLAB code.
4
Discussion
We present the results of performance measures to empirically explain the reasons behind the significant improvement achieved by the proposed algorithm. The cumulative contribution ratio and Fisher ratio of the second PCA subspace are computed as the performance measures. The cumulative contribution ratio Γ (κ) shows the extent to which κ κ principal n components reflects the original information. It is defined as Γ(κ) = i σi2 / i σi2 , where σi2 is the variance of i-th principal component, and n is the dimensionality of the original space. The variances σi2 were computed using the testing set. Γ (κ) with image generation (Ib, IIb) are much larger than those without image generation (Ia, IIa) as in Fig. 5(a). It means that subspaces produced with image generation effectively represent the original information. This effect is mainly obtained from the increase of training database size by the image generation. Fisher ratio is a measure of the discriminant efficiency. For subspaces, it can be computed as F (κ) = tr(SκB )/tr(SκW ), where SκB , SκW are the between and within covariance matrices of the κ-dimensional subspace. It is displayed in Fig. 5(b). Applying DP matching in subspaces (IIa, IIb) scored higher Fisher ratios which translates in their better discriminant efficiency. Furthermore, the PCA subspace calculated with generated images (IIb) exhibits the highest Fisher ratios. This effect is obtained from the warping in DP matching. In case of IIb, the distance of two features is calculated in the topography warped by DP matching, while the generated images were transformed into the similar topography. The similarity of the topography makes the image generation effective in case of IIb. Additionally, an interesting phenomenon is observed in cases IIa, IIb at the dimensionality larger than 400. The subspaces of IIa, IIb composed of 1152
816
B. Danev and T. Kamei
principal components are theoretically the same, which explains the same Fisher ratio achieved at the end. We expected the curves IIa and IIb on Fig. 5(b) to exhibit a smoother joining as seen in the accuracy curves in Fig. 4(b). The joining indeed occurs but it is fairly steep. A possible explanation is that although the higher components in case IIb represent mostly noise, in the case of IIa discriminant information still remains in the higher components because the PCA without image generation could not pack the relevant information into the lower subspace due to the insufficient training size and the inconsistency of topography.
5
Conclusions
We have proposed a novel way to extract spectral eigenfeatures and an effective Dynamic Programming for matching. We have also introduced an image generation technique to enhance the principal component analysis. In that context, we have proved that PCA can successfully extract very compact eigenfeatures. Learning with the image generation significantly improved the proposed scheme allowing to use only 800B for EER = 0.26% compared to the 18KB used in the reference approach with EER=0.42%.
Acknowledgment We would like to thank the Student Exchange Office at Swiss Federal Institute of Technology (EPFL) and Prof. Takeshi Mizuno from Saitama University for the help provided in finding a post-diploma research internship in NEC Corporation.
References 1. Asai, K., Kato, Y., Hoshino, Y., Kiji, K.: Automatic fingerprint identification. Proc. Soc. of Photo-Optical Instrumentation Engineers 182, 49–56 (1979) 2. Ratha, N., Bolle, R. (eds.): Automatic Fingerprint Recognition Systems. Springer, Heidelberg (2004) 3. Mainguet, J.: Biometrics for large-scale consumer products. In: Proc. Intl. Conf. on Artificial Intelligence, pp. 310–314 (2003) 4. Matsumoto, N., Sato, S., Fujiyoshi, H., Umezaki, T.: Evaluation of a fingerprint verification method based on LPC analysis. Trans. of IEE 122(5), 799–807 (2002) 5. Moghaddam, B., Pentland, A.: Probabilistic visual learning for object representation. IEEE Trans. on Pattern Anal. Machine Intell. 19(7), 696–710 (1997) 6. Kamei, T., Mizoguchi, M.: Fingerprint preselection using eigenfeatures. In: Proc. Intl. Conf. on Computer Vision and Pattern Recognition, pp. 918–923 (1998) 7. Kamei, T.: Face retieval by an adaptive Mahalanobis distance using a confidence factor. Proc. IEEE Intl. Conf. on Image Processing 1, I153–I156 (2002) 8. Sakoe, H., Chiba, S.: Dynamic programming algorthim optimization for spoken word recognition. Speech and Signal Processing 26(1), 43–49 (1978) 9. Jain, A., Prabhakar, S., Hong, L.: A multichannel approach to fingerprint classification. IEEE Trans. on Pattern Anal. Machine Intell. 21(4), 348–359 (1999)
Registering Long-Term Image Series Detlev Droege and Dietrich Paulus Active Vision Group, Institute for Computational Visualistics, University of Koblenz-Landau, Universitätsstr. 1, 56070 Koblenz, Germany {droege,paulus}@uni-koblenz.de http://www.uni-koblenz.de/agas Abstract. In this article we present a method to register image sequences taken at long time intervals. We suggest an improvement to Wards registration method for high dynamic range images to achieve limited sub-pixel accuracy. We show the chosen method to be appropriate even for heavily changing lighting conditions.
1 Introduction Image series generated from long time outdoor observations impose some difficult problems, for comparative processing as well as for the generation of presentation material. Several external parameters vary over time and thus have considerable influence on the images taken over the course of months or even years. Major influences are caused by – – – – –
weather conditions (e.g. sunny, rainy or foggy), time of the day dependent sun position, seasonal sun position and vegetation, natural changes (growing vegetation) and intended changes (e.g. construction works).
Additional influences are caused by properties of the recording equipment, such as – automatic exposure control, – positional inaccuracies (e.g. due to wind or mechanical instabilities) and – focus control that may lead to images of different sharpness. Fig. 1 shows three frames from one of the image sequences recorded on our campus. The data set consists of 52 sequences (i.e. camera positions) each holding 815 images during the main construction phase of the buildings (26 month) and another 520 images recorded over a period of 18 months when the exterior of the buildings was mostly finished. The purpose of this material is to create a documentation of the construction of the new buildings on the campus. Observing the construction process of a building over such a long period of time using a pan tilt camera thus results in a sequence of images which are not directly suitable to produce a fast motion film of the growing building. The observed negative results when just creating a film by lining up consecutive images can be accounted to the above mentioned reasons: – positional jitter due to mechanical inaccuracies and wind, – bright, high contrast (direct sunlight) vs. dim, low contrast (cloudy), W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 817–822, 2007. c Springer-Verlag Berlin Heidelberg 2007
818
D. Droege and D. Paulus
Fig. 1. Images taken at three consecutive days around noon
– cast shadows on sunny days, almost no shadows on other days, – color shifts due to varying illumination (e.g. bluish/grayish for cloudy days, yellowish for sunny days, reddish for sunrise and sunset), – moving trees/bushes (wind) and – (dis-)appearing objects (e.g. trucks or equipment). The goal is to stitch images that are captured when the camera pans and tilts to one larger image per recording date. These images are then combined to form a fast motion movie. A two stage registration process is needed to first align the images of one day to a single panorama. Then these have to be registered for the fast motion sequence. However, this cannont be done independently, as usual techniques like bundle adjustment might cause local displacements of some of the mosaic images. Further, to suppress flickering, illumination changes need to be compensated. Moving objects, such as pedestrians or cars, need to be distinguished from progress in the construction. Whereas the moving objects need to be suppressed, the changes of the buildings needs to remain unchanged in the images. The recordings started almost 8 years ago and the camera that was used is of relatively low quality. This imposes additional challenges to image processing, which will need to increase resolution and image quality. We will show in Sect. 2 how easy and robust rigid image registration can be obtained with a method that was originally used to create HDR images. First quantitative results are shown in Sect. 3. We conclude with an outlook on our further plans in Sect. 4.
2 Approach The drawbacks of the given image material invalidate most of the usual methods for automatic registration. We tested several approaches and ended up with an algorithm that was originally developed to register exposure steps for high dynamic range images. 2.1 Related Work Various authors report on algorithms that they apply to long image sequences. Whereas [1] estimates the background using a statistical color model, the authors in [2] estimate the foreground. Markov random fields are used in [3] to adapt the processing routines to illumination changes.
Registering Long-Term Image Series
819
The algorithms above operate on video streams. The major difference of our problem to conventional processing of video streams is the delay between the recording of successive frames that will cause considerable differences in illumination and scenic content. 2.2 Image Registration Standard intensity based approaches for image registration fail due to the heavily varying lighting conditions in our image data. Color based approaches as described in [4] did lead to some improvement, but were only usable for those sequences with a sufficient number of non-moving colored objects within the scene. This precondition cannot be met for all camera positions. Common feature based methods suffer from similar problems as the intensity based methods do: they rely on a considerable amount of similarity of the same feature between two consecutive images. However, this assumption often does not hold for pictures taken with a delay of 24 hours. In [5], Brown introduced the application of Lowes SIFT features [6] to the automatic recognition of panoramic images. SIFT features are not only scale invariant as their name suggests, but also to some degree invariant to illumination changes. However, the rather low resolution of our input images (352 × 288 pixel) causes a rather small number of SIFT features to be detected. For re-occurring architectural elements these features have very similar feature description vectors, thus leading to incorrect correspondences between them. Other features attach to shadow corners and thus do not have stable positions between images. The remaining feature correspondences do not outnumber the outliers by a sufficient amount to provide usable results.
Fig. 2. Binary images computed from the first two images in Fig. 1. Exclusion information is additionally depicted in gray
2.3 HDR Image Registration Finally, the image registration method presented by Ward in [7] proved to be well suited for our needs, although initially developed to register high dynamic range (HDR) images. While one would expect this method to be capable to deal with illumination changes, it also showed surprisingly good results when registering images with very different shadow situations. Wards algorithm uses a resolution hierarchy and a simplified image based correspondence measure based on binarized images (Fig. 2). The key
820
D. Droege and D. Paulus
Fig. 3. Isolated exclusion masks corresponding to Fig. 2
to his approach is the selection of the threshold, which should be the median of the image (after being converted to a gray-scale image). To avoid a negative influence of intensity values close to the threshold, the authors additionally employ an exclusion mask which contains all pixel positions containing values close to the threshold in any of the images (Fig. 3). XOR-ing the binarized images and AND-ing the result with the (inverted) exclusion mask provides a dissimilarity measure by counting all 1-bits. Thus the best position is determined in a [−1, 0, 1] range in vertical and horizontal direction in every resolution level around the current optimum of the preceding level.1 2.4 Extension For high resolution images like those made with state of the art digital cameras, often consisting of several million pixels, a registration accuracy of a discrete pixel, resulting in a positional error of ±0.5 pixels, usually is sufficient to acquire satisfying results. For low resolution images like ours (100K pixels) this accuracy is not sufficient to obtain a smooth impression in consecutive frames of the generated film. To get a certain degree of sub-pixel accuracy, we extended Wards approach by adding enlarged versions of the images to the resolution hierarchy (zoomed by cubic interpolation). While with high resolution images this would impose considerable impact on the memory usage and runtime, our low resoluted images will merely approach the quantities of smaller HDR images even when scaled to 4 × 4 of its size.
3 Experiments As noted in Sect. 2.2, the rapidly changing illumination conditions heavily influence the ability to register the images, which is in turn needed for any process of retrospective estimation of the lighting conditions. As to focus on this problem first, the initial tests were done on image sequences showing the almost finished buildings of the campus. The changes appearing even in these rather static scenes proved to be challenging enough, before moving to the even more difficult to process construction scenes. Some 1
This can be accelerated considerably by employing bitwise operations on 32- or even 64-bit processor register words, but this gain is not as significant for small pictures like those present in our case.
Registering Long-Term Image Series
821
Fig. 4. Even these sunblinds caused only one misregistration in a sequence of 520 images
Fig. 5. Heavily changing scenery causes misregistration
considerable influences on the comparability of the images like the presence of sunblinds had surprisingly little impact on the registration quality (see Fig. 4). Evaluation was done by visual inspection. At a frame rate of 25 fps these sequences play approx. 20 sec and even slight misregistrations are easily observed in the material during playback. Most sequences having about 40-50% of the visible area without too much change registered almost perfectly. In average, such scenes (like the one shown in Fig. 4) had 1 to 3 misregistration within 520 images. Sequences with larger areas of change (e.g. Fig. 5) suffered from a large number of misregistrations, depending in their amount on the type of change, sometimes even exceeding 50%.
4 Conclusion The registration method introduced in [7] shows to fit excellently for our need to register image series with heavily varying lighting conditions. Our extension to achieve a limited amount of sub-pixel accuracy further improved the quality of the results. For our given scenario of known camera positions, the current registration problems with heavily changing scenes can be easily avoided by introducing exclusion masks for the changing areas. Such masks integrate seamlessly with the registration algorithm. In a further step, attempts could be made to determine such masks automatically.
822
D. Droege and D. Paulus
Based on these results, the more difficult to process sequences of the construction phase will be approached. By splitting these sequences in several subsequences with only tolerable change over time these could be registered similar to the later scenes. The alignment of the subsequences will be addressed with a hierarchical approach. After the registration, the illumination intensity and color changes will be addressed.
References 1. François, A.R., Medioni, G.G.: Adaptive color background modeling for real-time segmentation of video streams. In: Proceedings of the International Conference on Imaging Science, Systems and Technology, Las Vegas, NA, pp. 227–232 (1999) 2. Wu, Z., Chen, C.: A new foreground extraction scheme for video streams. In: MULTIMEDIA ’01. Proceedings of the ninth ACM international conference on Multimedia, New York, NY, USA, pp. 552–554. ACM Press, New York, NY, USA (2001) 3. Wesolkowski, S., Fieguth, P.W.: Adaptive color image segmentation using markov random fields. In: ICIP 2002, International Conference on Image Processing, Proceedings. vol. III, pp. 769–772 (2002) 4. Droege, D., Hong, V., Paulus, D.: Farbnormierung auf langen Bildfolgen. In: Franke, K.H. (ed.) 9. Workshop Farbbildverarbeitung, Ilmenau, Zentrum für Bild- und Signalverarbeitung e.V, pp. 107–112 (2003) 5. Brown, M., Lowe, D.G.: Recognising panoramas. In: ICCV ’03. Proceedings of the Ninth IEEE International Conference on Computer Vision, Washington, DC, USA, IEEE Computer Society, Los Alamitos. 1218 (2003) 6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 7. Ward, G.J.: Fast, robust image registration for compositing high dynamic range photographs from hand-held exposures. Journal of graphics tools 8(2), 17–30 (2003)
Graph Similarity Using Interfering Quantum Walks David Emms, Edwin R. Hancock, and Richard C. Wilson Department of Computer Science University of York YO10 5DD, UK
Abstract. We consider how continuous-time quantum walks can be used for graph matching, both exact and inexact, and measuring graph similarity. Our approach is to simulate the quantum walk on the two graphs in parallel by using an auxiliary graph that incorporates both graphs. The auxiliary graph allows quantum interference to take place between the two walks. Modelling the resultant interference amplitudes, which result from the differences in the two walks, we calculate probabilities for matches between pairs of vertices from the graphs. Using the Hungarian algorithm on these probabilities we recover a mapping between the graphs. To calculate graph similarity, we combine these probabilities with edge consistency information to give a consistency measure. We analyse our approach experimentally using synthetic graphs.
1
Introduction
Quantum algorithms have attracted considerable attention in the theoretical computer science community primarily because of the considerable speed-up over classical algorithms they achieve [6,11]. However, quantum algorithms also have a richer structure than their classical counterparts since they use quantum rather than classical states as their basic representational unit. For example, the state of a quantum walk is complex-valued whereas states of a classical random walk take only positive real values. In addition, (quantum) interference is an important feature of quantum walks, and will play a central role in our algorithm. From a practical perspective, there have been a number of useful applications of random walks in, for example, the analysis of routing problems in network and circuit theory. Of more recent interest is the use of ideas from random walks to define the page-rank index for internet search engines such as Googlebot [1]. In the pattern recognition community there have been several attempts to use random walks for graph matching. These include the work of Robles-Kelly and Hancock which has used both a standard spectral method and a more sophisticated one based on ideas from graph seriation to convert graphs to strings, so that string matching methods may be used [10]. Meila and Shi use a random walk based on pairwise similarities between image pixels to carry out clustering and thus segmentation of images [9]. Gori, Maggini and Sarti [5], on the other W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 823–831, 2007. c Springer-Verlag Berlin Heidelberg 2007
824
D. Emms, E.R. Hancock, and R.C. Wilson
hand, have used ideas borrowed from page-rank to associate a spectral index with graph nodes and have then used standard subgraph isomorphism methods for matching the resulting attributed graphs. Quantum walks have been introduced as quantum counterparts of random walks [7] and posses a number of interesting properties not exhibited by classical random walks. The amplitudes of the paths of quantum walks have been used to define a matrix representation of graphs that is able to lift the cospectrality of certain classes of graphs that are typically hard to distinguish [2]. Also, quantum commute times have been used to successfully embed graphs in two dimensions rather than a submanifold of 2D space, as can occur when classical commute times are used [4]. In this paper we continue previous work which introduced a novel auxiliary graph structure for graph matching. We make use of the continuous-time quantum walk, a model that is significantly different in structure from the discretetime quantum walk. The use of the continuous-time quantum walks has a number of advantages over its discrete relative. The walk can be designed to allow the walks to interfere continuously rather than at a final end stage. In addition, complex-valued states can be utilized for the walk. Our approach is to connect every vertex of one graph to every vertex of the other graph via ‘auxiliary vertices’. A walk is then simulated on this new graph and the auxiliary vertices serve as sites on which the walks interfere. We model the resulting amplitudes on these interference sites to calculate probabilities for vertex-vertex matches between the graphs. Using the Hungarian algorithm [8] on these probabilities, we recover the mapping between the graphs. We also show how the probabilities can be combined with edge-connectivity constraints to construct a similarity measure. We carry out experiments on sets of synthetic graphs in order to evaluate our approach.
2
Graphs
Let G = (V, E) be a unweighted graph with vertex set, V , and edge set, E = {{u, v}|u, v ∈ V , u adjacent to v}. The graph has a adjacency matrix, W , with W (u, v) = 1 if {u, v} ∈ E, and 0 otherwise. Let n = |V | be the total number of vertices in the graph. We define the degree matrix to be the matrix D = diag(d(1), d(2), . . . , d(n)) where d(u) = |{v|u ∼ v}| is the degree of vertex u. The Laplacian matrix, L = W − D, has elements ⎧ ⎨ 1 if {u, v} ∈ E; Luv = −d(u) if u = v; ⎩ 0 otherwise. The Laplacian matrix is used to define the time evolution of the quantum walk on the graph. For a brief description of the classical random walk see [4].
Graph Similarity Using Interfering Quantum Walks
2.1
825
Quantum Random Walk on a Graph
The state space for a the continuous-time quantum random walk is the set of vertices, V . However, the state is described by a complex state vector which we |V | write (using Dirac’s notation) as |ψt ∈ C . This can be written componentwise as |ψt = u∈V au (t)|u where the magnitude of au (t) gives the probability of the walk being at u ∈ V at time t according to the rule P (X t = u) = au a∗u where a∗u is the complex conjugate of au . The axioms of probability give that |au (t)| ∈ [0, 1] for all u ∈ V , t ∈ R+ , and u∈V au a∗u = 1. Transitions occur only between adjacent vertices and at a rate μ. The evolution of the state vector d is given by dt |ψt = −iμL|ψt . Since the evolution of the probability vector of the walk at time t depends on the state vector of the walk (not merely the probability vector), the quantum walk is not a Markov chain. However, given an initial state, |ψ0 , we get that |ψt = e−iμLt |ψ0 .
3
Algorithm
The algorithm can be broken down into a number of separate parts. An auxiliary graph is created from the two input graphs. A quantum walk is simulated on the auxiliary graph which produces a set of |V1 ||V2 | complex-valued interference amplitudes, each one corresponding to one of the possible matches between a pair of vertices from the two graphs. (Non-zero interference amplitudes arise from differences in the amplitudes of the walk between a pair of vertices from the two graphs.) The probabilities that each of these interference amplitudes corresponds to a true match are calculated. To carry out graph matching, the Hungarian method is applied to the reciprocals of these probabilities. Alternatively, the similarity measure is calculated using these probabilities and edge consistencies from the two graphs. The component parts of the algorithm are described in the following subsections. 3.1
Auxiliary Matching Structure
Given two graphs, G and H, an auxiliary graph, Γ (G, H), is created. The auxiliary graph is symmetric with respect to interchanging its two arguments (i.e. Γ (G, H) ∼ = Γ (H, G)), and its structure is shown in Figure 1. Let G = (VG , EG ) and H = (VH , EH ), then Γ = (VΓ , EΓ ) where the vertex and edge sets can be decomposed such that VΓ = VG ∪ VH ∪ VA and EΓ = EG ∪ EH ∪ EA . The auxiliary vertices, VA , serve to connect the two graphs and act as sites on which the interference takes place. The auxiliary edges connect one of each of the |VG ||VH | pairs of vertices from the two graphs by way of one of the auxiliary vertices. That is VA = {v{gi ,hj } |gi ∈ VG , hj ∈ VH } and EA = {{gi , v{gi ,hj } }, {hj , v{gi ,hj } }|gi ∈ VG , hj ∈ VH }. The vertices gi ∈ VG and hj ∈ VH are linked via an auxiliary vertex, denoted v{gi ,hj } . The auxiliary graph is similar to the association graph, however, information about the structure of the two graphs comes from incorporating the original graphs themselves rather than through the connections between the auxiliary vertices.
826
D. Emms, E.R. Hancock, and R.C. Wilson
−5
6
x 10
5
probability
4
3
2
1
0 0
Fig. 1. The auxiliary graph, Γ (G, H), showing the vertices g1 , g2 , g3 ∈ VG and h1 , h2 , h3 ∈ VH connected by way of auxiliary vertices
3.2
0.5
1
T
1.5 time
2
2.5
3
Fig. 2. The probabilities associated with the interference for a particular vertex showing the time, T , of the first maximum in the interference amplitudes
The Quantum Walk on the Auxiliary Graph
The state of the quantum walk on Γ at time t is given by |ψt = The starting state has amplitudes ⎧ d(u) if u ∈ VG ; ⎨ C au (0) = − d(u) if u ∈ VH ; ⎩ C 0 otherwise.
u∈VΓ
au (t)|u.
where C = u∈VH ∪VH d(u)2 is a normalisation constant. This starting state is used rather than the a starting state that is constant and positive on the vertices of one graph and constant and negative on the vertices of the other since such a state is a stationary state of the walk. The state of the walk at time T is calculated, where T is the time when the first maximum in probability is achieved by one of the interference amplitudes, as shown in Figure 2. If there is an isomorphism, ρ : VG → VH , between the two graphs, then av{g,h} (t) = 0 for all t ∈ R+ whenever ρ(vg ) = vh . If, however, some noise has been applied, we still expect ag (t) ≈ ah (t), thus av{g,h} (t) will be close to zero in comparison to the other interference amplitudes. That this is the case is demonstrated in the following section in which we model the amplitudes for ‘matching’ and ‘non-matching’ pairs of vertices for graphs subject to noise. 3.3
Modelling the Interference Amplitudes
In this subsection we provide a model for the amplitudes for ‘true’ and ‘false’ matches between vertices and use this model to calculate the probability that an observed amplitude corresponds to a true pairing of vertices from the two graphs.
Graph Similarity Using Interfering Quantum Walks
827
To provide related graphs, an initial graph, G, with n vertices is generated by randomly connecting each two vertices in the graph with probability p. A related graph, H, is generated by adding and removing at random e edges from G. Figure 3 shows the distribution of the interference amplitudes for such a pair of graphs. We see that the false matches are the most widely distributed and are in a banded structure. The true matches which are not directly affected by noise all appear in the central band and the true matches which have been directly affected by noise lie close to the central band. Thus we can immediately assign a probability of zero to the matches corresponding to the amplitudes outside the central band and concentrate only on amplitudes in the central band. Also, as the amount of noise increases, fewer true matches remain in the central band and as will become clear, this will improve the performance of the measure. The histograms representing the distribution of the amplitudes for true and false matches in the central band are shown in Figure 4. Let a = x + iy be a particular amplitude and let X and Y be the sets of real and imaginary components for some set of interference amplitudes (we will write XT an XF etc. if we need to distinguish between the sets for true and false matches). Let μZ and σZ be the mean and standard deviation of a set Z. To calculate the probability that a particular amplitude corresponds to a true match we model the real and imaginary parts of these amplitudes representing true matches as originating from a bivariate Gaussian distribution with zero mean. Thus 1 z x2 y2 2ρxy P (x, y|t) = exp − , where z = 2 + 2 − 2 2(1 − ρ ) σX σY σX σY 2πσX σY 1 − ρ2 −μX μY where ρ = μXYσX is the correlation of the real and imaginary parts. The σY same model is used for the probability distribution, P (x, y|f ), for amplitudes resulting from false matches but with different values of the standard deviations and correlation. It can be seen from Figure 4 that σXT σXF and σY T σY F . Clearly, in practice, we cannot distinguish between the amplitudes for true and false matches in order to determine the standard deviations, thus they must be
−3
2
x 10
400 False matches True matches− no edges changed True matches− edges changed
1
300
0.5
250
0
200
−0.5
150
−1
100
−1.5
50
−2 −8
−6
−4
−2
0 real parts
2
4
6
8 −3
x 10
Fig. 3. Scatter plot of interference amplitudes
False matches True matches
350
Frequency
imaginary parts
1.5
0 −2
−1.5
−1
−0.5
0 Real part
0.5
1
1.5
2 −4
x 10
Fig. 4. Histogram of the real part of the interference amplitudes for true and false matches
828
D. Emms, E.R. Hancock, and R.C. Wilson
estimated from measurements of the distribution of the two sets together. The variance and fourth central moment of the observed distributions of the amplitudes could be used to obtain estimates for all of the standard deviations [3], however, the differences in magnitudes lead to inaccurate estimates. Furthermore, we have found that the consistency measure which we will outline below is robust with respect to the estimates used and so for these, and the correlations, we set 10σRf = σRt , 10σIf = σIt , ρt = 0.16 and ρf = −0.75 based on our analysis. The difference between ρt and ρf , which are apparent in Figure 3, make the bivariate Gaussian distributions particularly powerful at modelling the probabilities for the interference amplitudes for true and false matches. Given a particular amplitude a = x + iy and the conditional probabilities of observing this amplitude if the match is true, P (x, y|t), and if it is false, P (x, y|f ), Bayes’ rule can be used to calculate the probability that a particular amplitude corresponds to a true match. Let N be the number of interference amplitudes in the central band. We have that p(α) = p(α|t)p(t) + p(α|f )p(f ) | 1−|V | where p(t) = |V N and p(f ) = N . Note that this slightly over estimates p(t) and underestimates p(f ) as there are actually fewer than |V | true matches in the central band, but this does not greatly affect the similarity measure. Thus, the probability that some interference amplitude corresponds to a true match is given by p(t|α) = p(α|t)p(t) . p(α)
4
Correspondence Measure
We combine the probabilities for vertex matches with structural information in order to give a ‘correspondence measure’, which quantifies the quality of the match between the graphs. Consider two pairs of vertices (g1 , h1 ) and (g2 , h2 ), where g1 , g2 ∈ VG and h1 , h2 ∈ VH , corresponding to two possible matches with probabilities p(g1 , h1 ) and p(g1 , h2 ) respectively. The quantity p(g1 , h1 )p(g2 , h2 ) will in general be larger when both (g1 , h1 ) and (g2 , h2 ) are correct matches than if either, or neither, of them are. Thus we can define a correspondence measure by ME (G, H) = {g1 ,g2 }∈EG {h1 ,h2 }∈EH p(g1 , h1 )p(g2 , h2 ). Note that this sum need only be carried out over the sets of interference amplitudes with a non-zero probability that they are correct matches and can thus be calculated in time O(N 2 ).
5
Experiments
We divide our experimental section into two parts. In the first part we evaluate the proposed method for graph matching and in the second section we evaluate our consistency measure. We randomly generate graphs using the method described in Section 3.3 with n between 10 and 58 and p between 0.3 and 0.7. We then carry out a random permutation of the vertices and run the algorithm with the permuted and unpermuted graphs as the two inputs to try to recover the isomorphism. We assign unit
Graph Similarity Using Interfering Quantum Walks
829
cost for matching a pair of vertices with zero interference amplitude and infinite cost to those with non-zero interference amplitudes. Table 1 gives the proportion of false matches which are assigned unit matching cost and the proportion of complete correct graph matches. For graphs with few vertices a large proportion of false matches between vertices have zero interference amplitudes, however, this proportion decreases rapidly as the number of vertices increases. Moreover, we are always able to recover the isomorphism despite this small proportion of mis-identified false matches Table 1. The number of false matches with zero interference amplitudes and the rate for successful reconstruction of the isomorphism between the two graphs Vertices Mis-identified Successful graph Vertices Mis-identified Successful graph false matches match rate false matches match rate 10 0.316 1 28 0 1 12 0.168 1 30 0.008 1 14 0.064 1 32 0.116 1 16 0.02 1 34 0.08 1 18 0.016 1 36 0.06 1 20 0 1 38 0.028 1 22 0 1 40 0.06 1 24 0.004 1 42 0 1 26 0 1 44-58 0 1
We now consider the performance of the algorithm in the presence of noise. Given a graph obtained as before, we permute the vertices and then add and remove random edges in order to create graphs with 1%, 2%, 3%, 5% and 10% noise. Figure 5 shows the performance of the algorithm as we increase the number of veritces and the noise. As before, we are always able to recover the permutation when there is no noise. In the presence of noise, even 10%, we are still able to recover the correct permutation in almost all cases and this recovery rate even shows a slight improvement as the number of vertices increases. We now provide an experimental evaluation of the performance of the similarity measure. Again we randomly generate graphs, this time with edit distance from the original 0 to 20 (approximately 9% noise). We calculate the similarity measure between the original graph and each of the related graphs. We repeat this whole process for 100 sets of 21 graphs derived from randomly generated graphs. Figure 6 shows the mean similarity measure as a function of edit distance for this set of graphs and also the relative deviation. The mean of the similarity measure decreases monotonically as edit distance increases. Furthermore, at larger values of edit distance the rate of change of the similarity measure decreases; such behaviour could be useful in the field of robust statistics. At smaller edit distances, however, the similarity measure is sensitive to small changes in edit distance and its relative deviation is small providing a good measure of graph similarity.
D. Emms, E.R. Hancock, and R.C. Wilson
30 32 34 36 38 40 42 44 46 48 50 52 54 56
0% 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1% 0.982 0.987 0.991 0.991 0.98 0.99 0.985 0.986 0.98 0.984 0.987 0.98 0.986 0.983
Noise 2% 3% 0.972 0.96 0.973 0.968 0.974 0.97 0.979 0.973 0.971 0.965 0.978 0.97 0.973 0.968 0.976 0.973 0.976 0.973 0.973 0.97 0.977 0.975 0.978 0.975 0.979 0.974 0.976 0.975
5% 0.953 0.963 0.963 0.963 0.962 0.964 0.968 0.964 0.967 0.968 0.972 0.97 0.971 0.971
10% 0.95 0.954 0.957 0.952 0.957 0.96 0.959 0.962 0.963 0.964 0.967 0.966 0.969 0.969
Fig. 5. The rate for successfully recovering the permutation applied to the graph as a function of noise and number of vertices
6
100
80
Similarity measure
830
60
40
20
0
−20
0
5
10 Edit distance
15
20
Fig. 6. The similarity measure as a function of edit distance
Conclusion
In this paper we have looked at one of the ways in which the richer structure inherent in quantum processes can be utilized classically. We have described an auxiliary graph that can be used for the purpose of graph matching. We simulated a continuous-time quantum walk on this auxiliary structure which allows the complex-valued amplitudes on the two graphs being compared to interfere. The differences in the amplitudes on the two graphs manifest themselves as non-zero interference amplitudes. By modelling these amplitudes using bivariateGaussian distributions, the real and imaginary parts of the amplitudes, as well the relationship between them, are used to calculate probabilities for matches between the graphs. We have shown that these probabilities can be used to recover the mapping between a pair of graphs even in the presence of significant noise, and also that the performance is in no way reduced as the number of vertices increases. These probabilities can also be used to calculate a similarity measure using consistencies in vertex connectivity.
References 1. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1–7), 107–117 (1998) 2. Emms, D., Severini, S., Wilson, R.C., Hancock, E.: Coined quantum walks lift the co-spectrality of graphs and trees. In: Rangarajan, A., Vemuri, B., Yuille, A.L. (eds.) EMMCVPR 2005. LNCS, vol. 3757, pp. 332–345. Springer, Heidelberg (2005)
Graph Similarity Using Interfering Quantum Walks
831
3. Emms, D., Wilson, R., Hancock, E.: Graph matching using interference of coined quantum walks. In: IEEE 17th International Conference on Pattern Recognition (August 2006) 4. Emms, D., Wilson, R., Hancock, E.: Graph embedding using quantum commute times. In: GbR (2007) 5. Gori, M., Maggini, M., Sarti, L.: Graph matching using random walks. In: IEEE 17th ICPR (August 2004) 6. Grover, L.: A fast quantum mechanical algorithm for database search. In: STOC ’96. Proc 28th ACM Theory of computing, pp. 212–219. ACM Press, New York (1996) 7. Kempe, J.: Quantum random walks – an introductory overview. Contemporary Physics 44(4), 307–327 (2003) 8. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistic Quarterly 2, 83–97 (1955) 9. Meila, M., Shi, J.: A random walks view of spectral segmentation (2001) 10. Robles-Kelly, A., Hancock, E.: Graph edit distance from spectral seriation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 365–378 (2005) 11. Shor, P.W.: Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J. Comput. 26, 1484–1509 (1997)
Visual Speech Recognition Using Motion Features and Hidden Markov Models Wai Chee Yau1 , Dinesh Kant Kumar1 , and Hans Weghorn2 1
2
School of Electrical and Computer Engineering, RMIT University GPO Box 2476V Melbourne, Victoria 3001, Australia Information Technology, BA-University of Cooperative Education, Stuttgart, Germany
[email protected] Abstract. This paper presents a novel visual speech recognition approach based on motion segmentation and hidden Markov models (HMM). The proposed method identifies utterances from mouth video, without evaluating voice signals. The facial movements in the video data are represented using 2D spatial-temporal templates (STT). The proposed technique combines discrete stationary wavelet transform (SWT) and Zernike moments to extract rotation invariant features from the STTs. HMMs are used as speech classifier to model English phonemes. The preliminary results demonstrate that the proposed technique is suitable for phoneme classification with a high accuracy. Keywords: visual speech recognition, hidden Markov models, Zernike moments.
1
Introduction
Speech-based systems are emerging as attractive interfaces that provide the flexibility for users to control machines using speech. In spite of the advancements in speech technology, speech recognition systems are not widely used in mainstream human computer interfaces (HCI). The difficulty of audio-only speech recognition systems is the sensitivity of such systems to changes in acoustic conditions. The performance of such systems degrades when the acoustic signal strength is low, or in situations with high ambient noise levels. Possible methods to overcome this limitations are such as visual [11,14] and recording of facial muscle activity [1]. The vision-based techniques are more desirable options as such techniques are non intrusive. Research where audio visual speech recognition (AVSR) systems are being made more robust, and able to recognize complex speech patterns are being reported [10,5]. While AVSR systems are suitable for applications such as for telephony in noisy environment, such systems are not useful when it is essential to maintain silence. The need for visual-only, voice-less communication systems arises. Such systems are also known as visual speech recognition (VSR) systems. This research proposes to use video data related to facial movements for VSR applications. The possible advantages of VSR system are : (i) not affected W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 832–839, 2007. c Springer-Verlag Berlin Heidelberg 2007
Visual Speech Recognition Using Motion Features and HMM
833
by audio noise (ii) not affected by change in acoustic conditions (iii) does not require the user to make a sound. Visual features proposed in the literature can be categorized into shape-based, pixel-based and motion-based features. The shape-based features rely on the shape of the mouth. The first VSR system was developed by Petajan [9] using shape-based features such as height and width of the mouth. Researchers have reported on the use of artificial markers on speaker’s face to extract lip contours [6]. The use of artificial markers is not suitable for practical speech-controlled applications. VSR systems that use pixel-based features assume that the pixel values around the mouth area contain salient speech information [8,11]. Pixelbased and shape-based features extracted from static frames and can be viewed as static features. Motion-based features are features that directly utilize the dynamics of speech. Few researchers have focused on using motion-based features for VSR. Nevertheless, the dynamic features are reported to be most discriminative when comparing static and motion features by [4]. This paper proposes a novel VSR technique based on motion features extracted using spatial-temporal templates (STT). One of the advantages of the proposed technique is that it does not require the use of artificial markers on the speakers face. This paper proposes a system where the camera is attached in place of the microphone to the commonly available headsets. Potamianos et. al. [11] has demonstrated that using mouth videos captured from cameras attached to wearable headset produced better results as compared to full face videos. Another advantage of this is that it is no longer required to identify the region of interest, reducing the computation required.
2 2.1
Theory Visual Speech Model
This paper models visual speech based on visemes.Visemes are the atomic units of visual movements associated with phonemes (basic units of speech sound). Visemes can be concatenated to form words and sentences, thus providing the flexibility to expand the vocabulary of the system. The pronunciation of different speech sounds (such as /p/ and /b/) may be associated with identical visible facial movements. Thus, each viseme may corresponds to more than one phoneme, resulting in a many-to-one mapping of phonemes-to-visemes. This paper adopts a viseme model established for facial animation applications by an international audiovisual object-based video representation standard known as MPEG-4. Based on the this model the English phonemes can be mapped to 14 visemes as shown in Table 1. 2.2
Motion Segmentation
This proposed technique performs motion segmentation on the video data to generate spatial-temporal templates(STT). STT are grayscale images that show where and when facial movements occurs in the video [2,14]. STT is generated
834
W.C. Yau, D.K. Kumar, and H. Weghorn Table 1. Viseme model of the MPEG-4 standard for English phonemes Viseme number Corresponding phonemes Vowel or consonant Example words 1 p, b, m consonant put, bed, me 2 f,v consonant f ar, v oice 3 T,D consonant think, that 4 t,d consonant tick, d oor 5 k, g consonant k ick, gate 6 tS, dZ, S consonant chair, j oin, she 7 s,z consonant sit, z eal 8 n,l consonant need, l ead 9 r consonant r ead 10 A: vowel car 11 e vowel bed 12 I vowel tip 13 Q vowel top 14 U vowel book
using an accumulative image difference approach. The facial movement is segmented by detecting the changes between consecutive frames. Intensity values between successive frames of the video are subtracted to generate the difference of frames (DOFs). The DOFs are converted to binary images - represented as function B(x, y) by thresholding the DOFs to obtain a change or no change classification. B(x, y) will be assigned a pixel value of 1 at spatial coordinates where the intensity values between two consecutive frames are appreciably different based on a threshold value. The intensity value of the STT at pixel location (x, y) of tth frame is defined by ST Tt (x, y) = max
N −1
Bt (x, y) × t
(1)
t=1
where N is the total number of frames of the mouth video. Bt (x, y) represents the binarised version of the DOF of frame t. In Eq. 1, Bt (x, y) is multiplied with a linear ramp of time to implicitly encode the temporal information of motion into the STT. By computing the STT values for all the pixels coordinates (x, y) of the image sequence using Eq. 1 will produce a grayscale image (STT) where the brightness of the pixels indicates the history of facial movements in the image sequence. Figure 1 illustrates the STTs of fourteen visemes used in the experiments. This motivation of using STT to segment the facial movements is because of the ability of STT to remove static elements and preserve the short duration facial movements in the video data. Also, STT is insensitive to the speaker’s skin color due to the image subtraction process. This paper suggests a model to approximate the speed variations of speech by normalizing the overall duration
Visual Speech Recognition Using Motion Features and HMM
835
Fig. 1. Spatial-temporal templates (STT) of fourteen visemes based on the viseme model of MPEG-4 standard
of the utterance. The paper applies discrete stationary wavelet transform (SWT) on STT to obtain a transform representation that is insensitive to small variations for different videos of the same utterance. 2-D SWT at level 1 is applied on the STT. SWT decomposition of the STT generates four sub images. The approximate (LL) sub image is the smoothed version of the STT and is used to represent the STT. 2.3
Feature Extraction
The proposed technique adopts Zernike moments as features to represent the SWT LL image. Earlier work of the authors compares 3 features - Zernike moments, Hu moments and geometric moments and demonstrates that Zernike moments have the best image representation ability and are least sensitive to rotational, translational and scale changes of the mouth [14]. Zernike moments are computed by projecting the image function f (x, y) onto the orthogonal Zernike polynomial Vnl of order n with repetition l, defined within a unit circle (i.e.:x2 + y 2 ≤ 1). Zernike moments Znl of order n and repetition l is given by 2π ∞ n+1 Znl = [Vnl (ρ, θ)] f ∗ (ρ, θ)dρdθ (2) π 0 0 |l| ≤ n and (n−|l|) is even. f (ρ, θ) is the intensity distribution of the approximate image of STT mapped to a unit circle of radius ρ and angle θ where x = ρcosθ
836
W.C. Yau, D.K. Kumar, and H. Weghorn
and y = ρsinθ.The main advantage of Zernike moments is the simple rotational property of the features[7]. Changes in the orientation of the mouth in the image result in a phase shift on the Zernike moments of the rotated image as compare to the features of non rotated image [14]. Thus, the absolute value of Zernike moments is invariant to the rotation of the image patterns. This paper uses the absolute value of the Zernike moments as the rotation invariant features. 49 Zernike moments that comprise of 0th up to 12th order moments are extracted from the approximate image of the STT. 2.4
Classification Using Hidden Markov Models
HMM is a finite state network based on stochastic processes. The strength of left-right HMM lies in its ability to statistically model the time-varying speech features [12]. Left-right HMM is a commonly used classifier in speech recognition [10]. This paper adopts single-stream, continuous HMMs to classify the motion features. Continuous HMMs is employed as opposed to discrete HMMs to avoid the loss of information occur in the quantization of the features. The motion features are assumed to be Gaussian distributed. Each viseme is modelled using a left-right HMM with three states, one mixture of Gaussian component per state and diagonal covariance matrix. During training of the HMMs, the unknown HMMs parameters vectors consisting of the transition probability and observation probability are estimated iteratively based on the training samples using Baum-Welch algorithm. In the classification stage, the unknown motion features are presented to the 14 trained HMMs and the features are assigned to the viseme class whose HMM produces output with the highest likelihood.
3
Methodology
Experiments were conducted to test the proposed visual speech recognition technique. Fourteen visemes from the viseme model of MPEG-4 standard (highlighted in bold fonts in Table 1 )were evaluated in the experiments. Each viseme represents one pattern of facial movement. The speaker pronounces each vowel or consonant in isolation. The number of publicly available audio visual speech databases is much less as compared to audio-only speech databases. Most of the audio visual speech databases are recorded in ideal studio environment with controlled lighting. To evaluate the performance of the approach in a real world environment, video data was recorded using an inexpensive web camera in a typical office environment. This was done towards having a practical voiceless communication system using low resolution video recordings. The camera focused on the mouth region of the speaker and was kept stationary throughout the experiment. The following factors were kept the same during the recording of the videos : window size and view angle of the camera, background and illumination. 280 video files (240 x 240 pixels) were recorded and stored as true color (.AVI) files. One STT was generated from each AVI files. An example of STT for each visemes are shown in Figure 1. SWT at level-1 using Haar wavelet was applied
Visual Speech Recognition Using Motion Features and HMM
837
Fig. 2. Block diagram of the proposed technique
on the STTs and the approximate image (LL) was used for analysis. 49 Zernike moments have been used as features to represent the SWT approximate image of the STT. The Zernike moments features were used to train the hidden Markov models(HMM) classifier. Each viseme is modelled using one HMM. The leaveone-out method was used in the experiment. The HMMs were trained with 266 training samples and were evaluated on the 14 remaining samples (1 sample from each viseme group). This process is repeated 20 times with different sets of training and testing data. The average recognition rates of the HMMs for the 20 repetitions were computed. Figure 2 shows a block diagram of the proposed visual speech recognition technique.
4
Results and Discussion
The classification accuracies of the HMM are tabulated in Table 2. The average recognition rate of the proposed visual speech recognition system is 88.2%. The results indicate that the proposed technique based on motion features is suitable for viseme recognition. Based on the results, the proposed technique is highly accurate for vowels classification using the motion features. An average success rate of 97% is achieved in recognizing vowels. The classification accuracies of consonants are slightly lower due to the poor recognition rate of one of the consonant - /n/. One of the possible reason for the misclassifications of /n/ is due to the inability of vision-based technique to capture the occluded speech articulators movements. The movement of the tongue within the mouth cavity is not visible (occluded by the teeth) in the video data during the pronunciation of /n/. Thus, STT of /n/ does not contain information on the tongue movement which may have resulted a high error rate for /n/. Same visual-only speech recognition task (based on the the 14 visemes of MPEG-4 standard) is tested in [3] and a similar error rate is obtained using shape-based features extracted from static images. Nevertheless, the errors made
838
W.C. Yau, D.K. Kumar, and H. Weghorn
Table 2. Recognition Rates of the proposed system based on viseme model of MPEG-4 standard Viseme Recognition Rate (%) m 95 v 90 T 70 t 80 g 85 tS 95 s 95 n 40 r 100 A: 100 e 100 I 95 Q 95 U 95
in the authors’ proposed system using motion features are different as compared to the errors reported in [3] that uses static features. This indicates that complementary information exist in static and dynamic features of visual speech. For example, our proposed system has a much lower error rate in identifying visemes /m/, /t/ and /r/ by using the facial movement features as compare to the results in [3]. This shows that motion features are better in representing phones which involve distinct facial movements (such as the bilabial movements of /m/). The static features of [3] yield better results in classifying visemes with ambiguous or occluded motion of the speech articulators such as /n/. The results demonstrate that a computationally inexpensive system which can easily be developed on a DSP chip for voice-less communication application. The proposed system has been designed for specific applications such as control of machines using simple commands consisting of discrete utterances without requiring the user to make a sound.
5
Conclusion
This paper reports on a technique to identify utterances from videos without using sound signals. A high recognition rate is obtained in classifying isolated English phonemes using hidden Markov models (HMM). The results suggest that the proposed technique based on the different patterns of facial movements is suitable in identifying phonemes. For future work, the authors intend to test the method on a larger vocabulary set covering words. Furthermore, the investigation shall be extended from an English-spoken environment to other languages such as German. Potential applications of such a system is to drive computerized machinery in noisy environments and for voice-less communication.
Visual Speech Recognition Using Motion Features and HMM
839
References 1. Arjunan, S.P., Kumar, D.K., Yau, W.C., Weghorn, H.: Unspoken Vowel Recognition Using Facial Electromyogram. IEEE EMBC, New York (2006) 2. Bobick, A.F., Davis, J.W.: The Recognition of Human Movement Using Temporal Templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 257–267 (2001) 3. Foo, S.W., Dong, L.: Recognition of Visual Speech Elements Using Hidden Markov Models. In: Chen, Y.-C., Chang, L.-W., Hsu, C.-T. (eds.) PCM 2002. LNCS, vol. 2532, pp. 607–614. Springer, Heidelberg (2002) 4. Goldschen, A.J., Garcia, O.N., Petajan, E.: Continuous Optical Automatic Speech Recognition by Lipreading.presented at In: 28th Annual Asilomar Conf on Signal Systems and Computer (1994) 5. Hazen, T.J.: Visual Model Structures and Synchrony Constraints for Audio-Visual Speech Recognition. IEEE Transactions on Audio, Speech and Language Processing 14(3), 1082–1089 (2006) 6. Kaynak, M.N., Qi, Z., Cheok, A.D., Sengupta, K., Chung, K.C.: Audio-visual modeling for bimodal speech recognition. IEEE Transactions on Systems, Man and Cybernetics 34, 564–570 (2001) 7. Khontazad, A., Hong, Y.H.: Invariant Image Recognition by Zernike Moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 489–497 (1990) 8. Liang, L., Liu, X., Zhao, Y., Pi, X., Nefian, A.V.: Speaker Independent AudioVisual Continuous Speech Recognition. In: IEEE Int. Conf. on Multimedia and Expo (2002) 9. Petajan, E.D.: Automatic Lip-reading to Enhance Speech Recognition. In: GLOBECOM’84 (1984) 10. Potamianos, G., Neti, C., Gravier, G., Senior, A.W.: Recent Advances in Automatic Recognition of Audio-Visual Speech. Proc. of IEEE 91 (2003) 11. Potamianos, G., Neti, C., Huang, J., Connell, J.H., Chu, S., Libal, V., Marcheret, E., Haas, N., Jiang, J.: Towards Practical Deployment of Audio-Visual Speech Recognition. In: ICASSP, IEEE (2004) 12. Rabiner, L.R.: A tutorial on HMM and selected applications in speech recognition. Proc. IEEE 77(2-2), 257–286 (1989) 13. Teh, C.H., Chin, R.T.: On Image Analysis by the Methods of Moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 10, 496–513 (1988) 14. Yau, W.C., Kumar, D.K., Arjunan, S.P.: Visual Speech Recognition Method Using Translation, Scale and Rotation Invariant Features. In: IEEE International Conference on Advanced Video and Signal based Surveillance, Sydney, Australia (2006)
Feature Extraction of Weighted Data for Implicit Variable Selection Luis Sánchez, Fernando Martínez, Germán Castellanos, and Augusto Salazar Control & Digital Signal Processing Group, Universidad Nacional de Colombia Sede Manizales {lgsanchezg,fmartinezt,cgcastellanosd,aesalazarj}@unal.edu.co
Abstract. Approaches based on obtaining relevant information from overwhelmingly large sets of measures have been recently adopted as an alternative to specialized features. In this work, we address the problem of finding a relevant subset of features and a suitable rotation (combined feature selection and feature extraction) as a weighted rotation. We focus our attention on two types of rotations: Weighted Principal Component Analysis and Weighted Regularized Discriminant Analysis. The objective function is the maximization of the J4 ratio. Tests were carried out on artificially generated classes, with several non-relevant features. Real data tests were also performed on segmentation of naildfold capillaroscopic images, and NIST-38 database (prototype selection). Keywords: Relevance, Feature Selection, PCA, LDA.
1
Introduction
Many approaches based on obtaining relevant information from overwhelmingly large sets of measures have been recently adopted, replacing those ones that rely on specialized (expert-based) and sometimes difficult to obtain features. Reducing the size of data either by encoding or removing irrelevant information, becomes necessary, if one wants to achieve good performance in the inference. In the present work, we study weighting schemes for attaining subset selection by finding a relevant subset of features and a suitable rotation (feature selection and feature extraction) as a weighted rotation. The attention is then focused on a type Supervised Weighted Principal Component Analysis and Weighted Regularized Discriminant Analysis. Feature extraction methods such as PCA and LDA have been widely treated, and WPCA has been referred in image processing applications with improvements on the results obtained with classic PCA [1]. Although, properties of changing the problem formulation of PCs are discussed in [2], any comments about weight matrix estimation are kept aside. Other works have directed their efforts towards variable selection using weights for ranking their relevance [3], [4], however their work is not directly related to
This research has been made under grant of the project titled "Técnicas de Computación de Alto Rendimiento en la Interpretación Automatizada de Imágenes Médicas y Bioseñales". DIMA, Universidad Nacional de Colombia Sede Manizales.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 840–847, 2007. c Springer-Verlag Berlin Heidelberg 2007
Feature Extraction of Weighted Data
841
WPCA or WRDA. In our approach, we make use of J4 feature selection criteria for assessing the relevance of a given set [5]. In order to test the effectiveness of the weighting algorithms, several tests on artificially generated data were carried out. Results are encouraging, since non-relevant variables received weights much closer to zero than classic Regularized Discriminant Analysis (RDA). Experiments on real data encompass segmentation of naildfold capillaroscopic images, and NIST-38 digit detection by using dissimilarities (prototype selection).
2
Weighted Rotation, Variable Selection and Relevancy
Variable selection problem can be understood as selecting a subset of p features from a larger set of c measures. This type of search is guided by some evaluation function that have been defined as the relevancy of a given set [6]. Typically, such procedures involve exhaustive search (binary selection) and so relevance measure must take into account the number of dimension p to state if there is significant improvement when dimensions are added or rejected. On the other hand, feature extraction techniques take advantage of data to encode representation efficiently in lower dimensions [7]. We can capitalize on this property to keep a fixed dimension (projected space) and assess the relevancy of the projected set. Binary search algorithms involves explicit enumeration of the subsets and only optimal selection is guaranteed after exhaustive search. Suboptimal heuristics such as greedy search have been introduced to reduce the computational complexity, but relevance criteria involves the size of the subset, and proper tuning of objective function becomes problematic. Weighting schemes are more flexible in this sense. Although, weights do not provide optimal solution in most of cases, they can reach fair results. We can combine feature extraction methods with weighted data to maintain a fixed set size and accommodate weights in such a manner relevance is maximized. Next, we will give descriptions of the employed methodologies and algorithms. 2.1
PCA and Probabilistic PCA
Classic PCA. Principal Component Analysis has been the mainstream for data analysis in large number of applications. Its attractiveness relies on the simplicity and capacity for compressing dimensionality by minimizing the squared reconstruction error from a linear combination of latent variables known as Principal Components. The model parameters can be directly computed from centered data matrix either by Singular Value Decomposition or the diagonalization of the positive semidefinite covariance matrix. Among other methods developed for estimating the PCs, there is a probabilistic framework discussed by [8, 9], that we will adopt for the computation of our weighted algorithm via the Expectation Maximization (EM) procedure for computing p principal components. Weighted Probabilistic PCA. Consider the latent variable model for data x = Cz + v;
z ∼ N (0, I) v ∼ N (0, R)
(1)
842
L. Sánchez et al.
latent variables z are assumed to be independent and identically distributed over a unit variance spherical gaussian. Notice the difference between PPCA model and PCA were the variance of latent variables can be associated to the diagonal elements of eigenvalue matrix (the diagonal matrix of the singular value decomposition of the covariance matrix). The model also considers a general perturbation matrix R, but [8] restrict it to be I (Isotropic noise). Now, we modify the formulation of the model to introduce weights on the measured variables and so perform the weighted rotation. Let D be a diagonal matrix containing the weight of the i-th variable in the element dii . If we assume the new observed variable as dii xi , the observed vector becomes Dx, the model is now defined as: (2)
Dx = Cz + v
z and v are still distributed as in (1). From equation (2) it can be derived an EM algorithm to estimate the unknown state (latent variable) in the e-step and maximize the expected joint likelihood of the estimated z and observed y in the m-step by choosing C and R under model assumptions. In the case of the modified model (weighted variables), we have the two variants of EM: – Noise-free model e − step : ZT = CT C – Noisy model SPCA
−1
CT DXT ;
−1
e − step : β = CT CCT + I
−1
m − step : C = DXT Z ZT Z
; μz = βDXT ; Σz = nI − nβC + μz μTz
m − step : C = DXT μTz Σ−1 ; = trace DXT XD − Cμz XD /nc
Up to this point, it has been only defined the way PCA rotation can be estimated. Before moving on to weight updating, we will review some brief aspects about Regularized Discriminant Analysis. 2.2
Regularized Discriminant Analysis
RDA was proposed by [10] for small sample, high-dimensional data sets to overcome the degradation of the discriminant rule. In our particular case. we refer to Regularized linear discriminant analysis. The aim of this technique is to find a linear projection of the space where scatter between classes is maximized and the within scatter is minimal. A way of finding such a projection is to maximize the ratio between the projected between ΣB and within ΣW class matrices [5]. Conditional extremes can be obtained from Lagrange multipliers; the solutions are the k − 1 leading generalized eigenvectors W of ΣB and ΣW that are the leading eigenvectors of Σ−1 W ΣB . The need of regularization arises from small samples were ΣW can not be directly inverted. Then the solution is rewritten
Feature Extraction of Weighted Data
843
as (ΣW + δI)−1 ΣB W = WΛ. After weighting data, that is XD we can recast |WT DΣB DW| the J quotient as JD = |W . T DΣ W DW| 2.3
Variable Weighting and Relevance Criteria
From the previous sections, we may know now the desired weighted linear transformation Φ we want to apply. Data will be projected onto a fixed dimension subspace. Such dimensionality is chosen depending on the rotation criteria, for instance if we have a two-class problem and we want to test WRDA the fixed dimension should be 1 as demonstrated for LDA. In order to assess the relevance of a given weighted projection of a fixed dimension we introduce some separability measure. The search function will fall in some local maximum of the target function. The parameter to be optimized is the weight matrix D, and selected criteria is the ratio traces of the aforementioned within and between matrices; this criteria is particularly known as J4 [5]. For weighted-projected data this measure is given by: J4 (D, Φ) =
trace(ΦT DΣB DΦ) trace(ΦT DΣW DΦ)
(3)
where Φ can be either replaced by the PCA rotation or RDA rotation W. The size of Φ is (c × f ) and f denotes the fixed dimension, which is the number of projection vectors φ; Φ = φ1 φ2 · · · φf . In order to apply matrix derivatives easily, we may want to rewrite D in terms of its diagonal entries and represent it as a column vector d. For this purpose, equation (3) can be rewritten in terms of Hadamard products as follows: f f T T T T J4 (d) = d ΣB ◦ φi φi d d ΣW ◦ φi φi d (4) i=1
i=1
This target function is quite similar in nature to the one obtained for LDA1 . Therefore, the solution of d with constrained L2 norm is given by the leading f f eigenvector of ( ΣW ◦ φi φTi + δI)−1 ( ΣB ◦ φi φTi ). i=1
i=1
Notice this type of description assumes elements of Φ as static, though derivatives on d do exist. To overcome this issue we interleave the computation of d and Φ until convergency of both [11, 12]. WPCA Algorithm. We take advantage of iterative nature and convergence of the EM estimation of parameters from probabilistic frameworks. The procedure is described as follows: 1. Normalize each feature vector to have zero mean and · 2 = 1 2. Start with some initial set of orthonormal vectors U(0) 1
Regularization can be thought as Tikhonov filtering of singular solutions of (4) or the SNR from sampling populations with equal means.
844
L. Sánchez et al.
3. Compute d(r) from solution given in section 2.3, and reweigh data. 4. Compute the (e and m)-steps of the desired latent variable model (Noise-free or Sensible). The normalized columns of the estimated C, that is · 2 = 1, become the columns of U(r) 5. Compare U(r) and U(r−1) for some ε and return to step 3 if necessary2 . 6. Orthogonalize the obtained subspace as below: svd(UT DXT XDU) = ASAT ; Uend = AT U WRDA Algorithm. In this case we do not worry for missing the derivatives of the rotation with respect to the data weights since both functions (weight and rotation calculation) have similar directions. The procedure is described as follows: 1. 2. 3. 4. 5. 6.
Set dimension to be k − 1, being k the number of classes Normalize each feature vector to have zero mean and · 2 = 1 Start with some initial set of orthonormal vectors W(0) Compute d(r) from solution given in section 2.3, and reweigh data. Compute the W(r) from solution given in section 2.2. Compare W(r) and W(r−1) for some ε and return to step 3 if necessary (likewise WPCA).
In the following section we test our algorithms in toy and real data problems.
3
Experimental Setup
Tests are carried out on three types of data, artificially generated classes (toy data), and two binary classification tasks on real problems, namely, medical image segmentation and digit recognition. 3.1
Toy Data
For comparison purposes we generate artificial observation vectors in the same way is described for the linear problem in [4]. Figure 1 shows the resulting weight vector for Noise-free Weighted PCA (3 principal directions); remaining irrelevant features exposed the same behavior (zero values). Reduced space by WPCA is also presented along with classic PPCA. From figure 1(c) it can be seen how classes are well separated compared with probabilistic principal components depicted in 1(b). We also test the accuracy of the algorithms with respect to number of observations for training. A fixed set of 1000 observations was kept for validation while different sizes of training sets were generated. Figure 2 shows the classification accuracy for different training sizes by computing the mean of 20 training sets of the same size. Classification results come from a linear classifiers; 2(a) assumes equal gaussian distributions for each class and 2(b) uses a linear SVM. 2
T
We make use of the sum of absolute values of the diagonal elements of (U(r) ) U(r−1) , T which are compared to the value obtained for (U(r−1) ) U(r−2) .
Feature Extraction of Weighted Data Absolute of weights WPCA Noise−free
Reduced tridimensional space with PPCA
845
Reduced tridimensional space with WPCA 0.04
0.04
0.7
0.03 0.03 0.02
0.02
0.6
0.01
0.01
0
0
weight
0.5
−0.01
−0.01
0.4
−0.02
−0.02
−0.03
−0.03
−0.04
−0.04
0.3
0.3
0.08 0.2
0.06 0.1
0.2
0.04 0 0.02 −0.1
0.1
0 −0.2 −0.02
−0.3
0
−0.4
0
5
10
15
20
25
−0.5
30
−0.4
−0.3
−0.2
−0.1
0.2
0.1
0
0.3
0.4
−0.04 −0.06 −0.2
−0.15
−0.05
−0.1
0
0.05
0.1
0.15
0.2
variable index
(a) Weight vector WPCA Noise-free
for
(b) Principal nents
Compo-
(c) Weighted Principal Components Noise-free
Fig. 1. Weight vector for WPCA. Spanned spaces for PPCA and WPCA. The size of the training set is 100 observations per class.
WRDA RDA First PC WPCA with 3 Components First 3 PPCs
0.5 0.45
0.35
0.35
0.3 0.25 0.2 0.15
0.3 0.25 0.2 0.15
0.1
0.1
0.05
0.05
0 10
WRDA RDA First PC WPCA with 3 Components First 3 PPCs
0.4
Validation error
Validation error
0.4
0.5 0.45
20
30
40
50
60
70
80
90
100
0 10
20
30
(a) Linear pooled covariance matrix
40
50
60
70
80
90
100
Number of training observations per class
Number of training observations per class
(b) Linear SVM
Fig. 2. Comparison of projection methods feeding linear classifiers
3.2
Nailfold Capillaroscopic (NC) Images
Information about connective tissue diseases can be obtained from capillaroscopic images (see figure 3(a)). In [13], medical image segmentation through machine learning approaches has been successfully applied, which motivates the idea of adapting the proposed method for solving a similar problem. In the present case, each pixel is represented by 24-dimension vector obtained from standard color space transformations: RGB, HSV, YIQ, YCbCr, LAB, XYZ, UVL, CMYK. The underlying idea is the contrast enhancement of capillary images. We tested the algorithm on 20 images After applying the weighting algorithms we have obtained a set of relevant color transformations, specifically for WRDA: Q plane from Y IQ space, Cr from Y CbCr, and A from LAB transformation. For WPCA with one principal component, we have: Q plane from Y IQ space, and A from LAB transformation (see figure 4). We make use of ROC curves to assess the quality of the transformation. Scores are obtained from the likelihoods of discriminant functions of gaussian distributions with equal priors. In figure 5(a) are the plots of 4 ROCs. The areas under the curves are: 0.9557, 0.9554, 0.9549, 0.7295 for WRDA, WPCA,RDA and PCA, respectively. Notice how there is no significant changes on working with only 2 or 3 color planes of the original 24, while rotation offered by PCA does not perform as well as the rest of rotations. Even though,
846
L. Sánchez et al.
(a) Acquired Image
(b) Desired Segmentation
Fig. 3. Sample Image from Nailfold Capillaroscopic data
Fig. 4. Enhanced contrast plane by means of different linear transformations
ROC curve
1
true positive rate
true positive rate
0.8
0.6
0.4
WRDA WPCA RDA PCA
0.2
0 0
0.2
0.4
0.6
ROC curve
1
false positive rate
0.8
1
0.8
0.98
0.6
0.96
0.4
0.2
1
0 0
0.94
WRDA WPCA RDA PCA
0.2
0.4
0.6
false positive rate
(a) NC Images
0.8
WRDA WPCA RDA
0.92
1
0.9 0
0.02
0.04
0.06
0.08
0.1
(b) NIST-38 digit detection
Fig. 5. Receiver Operating Characteristic curve for NC Images and NIST-38 digit detection
segmentation performance can be increased if more involved methods brought into consideration, the results obtained can be used as a preprocessing stage for subsequent improvements. 3.3
NIST-38 Digit Recognition
The MNIST digit database is a well known benchmark for testing algorithms. We want to observe the feasibility of applying our approach to dissimilarity representations. Prototype selection can be seen as a variable selection process in dissimilarity spaces [14]. We construct our training set by randomly picking 900 examples per class. These new samples are split into observation set (600 images) and prototype set (remaining 300 images; the same rule applies for validation sets. Dissimilarity matrices are (1200×600) containing Euclidean distances. ROC curves are displayed for WRDA, WPCA (1 WPC), RDA and PCA (1 PC) (see figure 5(b)); and their respective areas under the curve are: 0.995547, 0.991719, 0.998769, 0.509725. Even though, RDA presents slight increment in performance, the number of prototypes needed is 600, compared to 30 to 40 prototypes selected by WPCA and WRDA.
Feature Extraction of Weighted Data
4
847
Conclusions
We have exposed a feature selection criteria based on weighting variables followed by a projection onto a fixed dimension subspace, which is a suitable method for evaluating the relevance of a given set. Results have shown how reduced dimensionality does not affect in the overall performance of the inference system (linear cases such as capillaroscopic images). WRDA may perform better for classification than WPCA and can be catalogued as a wrapper method since projections are class dependent (supervised projection). Derivations from the probabilistic point of view are worthy of further investigation and nonlinear extensions are the current subject of study. Results obtained for artificial data are encouraging since feature selection performed very well on small training samples (see figure 2).
References 1. Skočaj, D., Leonardis, A.: Weighted incremental subspace learning. In: Workshop on Cognitive Vision, Zurich (September 2002) 2. Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer Series in Statistics. Springer, Heidelberg (2002) 3. Jebara, T., Jaakkola, T.: Feature selection and dualities in maximum entropy discrimination. In: 16th Conference on Uncertainty in Artificial Intelligence (2000) 4. Weston, J., Mukherjee, S., Chapelle, O., Pontill, M., Poggio, T., Vapnik, V.: Feature selection for svms. In: NIPS (2001) 5. Webb, A.R.: Statistical Pattern Recognition, 2nd edn. John Willey and Sons, West Sussex, England (2002) 6. Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. AI 97(1-2) (1997) 7. Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: Computer Vision and Pattern Recognition (1991) 8. Tipping, M., Bishop, C.: Probabilistic principal component analysis. J. R. Statistics Society 61, 611–622 (1999) 9. Roweis, S.: Em algorithms for pca and spca. Neural Information Processing Systems 10, 626–632 (1997) 10. Friedman, J.H.: Regularized discriminant analysis. Journal of the American Statistical Association 84, 165–175 (1989) 11. Wolf, L., Shashua, A.: Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weighted-based approach. Journal of Machine Learning Research (2005) 12. Direct feature selection with implicit inference. In: ICCV (2003) 13. Li, S., Fevens, T., Krzyzak, A., Li, S.: Automatic clinical image segmentation using pathological modelling, pca and svm. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, Springer, Heidelberg (2005) 14. Pekalska, E., Duin, R.P.W., PaclÌk, P.: Prototype selection for dissimilarity-based classifiers. Elsevier, Pattern Recognition 39 (2006)
Analysis of Prediction Mode Decision in Spatial Enhancement Layers in H.264/AVC SVC Koen De Wolf, Davy De Schrijver, Wesley De Neve, Saar De Zutter, Peter Lambert, and Rik Van de Walle Ghent University – IBBT Department of Electronics and Information Systems – Multimedia Lab Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium {koen.dewolf, davy.deschrijver, wesley.deneve, saar.dezutter, peter.lambert, rik.vandewalle}@ugent.be http://multimedialab.elis.ugent.be/
Abstract. On top of the prediction modes defined in the H.264/AVC standard, Scalable Video Coding defines prediction modes for inter-layer prediction. These inter-layer prediction modes allow the re-use of coded data from the base layer, at the cost of increasing the search space at the encoder and as a result increase the encoding time. In this paper, we investigate the relation between the coding decisions taken in the base layer and the enhancement layer. Our tests have shown that a number of relations can be clearly identified. We have observed that the co-located macroblock of a base layer macroblock coded in P 8x8 mode has a 40 % chance of being coded in P 8x8 mode as well. Further, we have observed that the P Skip mode is only used when the quantization parameter in the enhancement layer is high. For macroblocks coded in B Skip mode, the co-located macroblock in the enhancement layer will be coded in the B Skip mode when it is highly quantized (probability of 63 % to 92 % for quantization parameter 30). These observations can be used to construct a model for fast mode decision in SVC.
1
Introduction
H.264/AVC Scalable Video Coding (SVC), as proposed by the Joint Video Team (JVT), can be classified as a layered video specification based on the single-layer H.264/AVC standard [1], [2]. Additional Enhancement Layers (ELs) contain information pertaining to the embedded spatial and SNR enhancements. Similar single-layer prediction techniques as in H.264/AVC are applied, in particular, intra and motion-compensated prediction. However, additional inter-layer prediction mechanisms have been developed for the minimization of redundant information between different layers. In case of Rate Distortion-Optimized coding (RDO), the encoder complexity is significantly increased due to the large search space constructed by the numerous prediction modes incorporated in SVC. These prediction modes can be divided into two groups. The first group contains the prediction modes that originate from the single-layer H.264/AVC specification, i.e., the various intra and W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 848–855, 2007. c Springer-Verlag Berlin Heidelberg 2007
Analysis of Prediction Mode Decision in Spatial Enhancement Layers
849
motion-compensated prediction modes. The second group contains the interlayer prediction modes. In this paper, we will analyze and discuss the mode decisions of an SVC encoder. This analysis can be used to construct a model for fast mode decision. The use of such a model in an SVC encoder can significantly reduce the encoding time at the cost of a lower coding efficiency (i.e., possibly higher bit rate together with lower visual quality). Such an analysis was already performed by He Li et al. [3] with version 2 of the reference software (dated 2005). However, in the mean time several new coding tools were added to the SVC specification. Furthermore, in our analysis, we also take the Quantization Parameters (QPs) into account. This paper is organized as follow. First, we provide a detailed outline of the prediction modes that are available in SVC, used in the context of spatial scalability. In the third section, we analyze and discuss the relation between the coding decision in the base layer and the enhancement layer. Finally, conclusions are provided in Sect. 4.
2
SVC Prediction Modes
As mentioned in the introduction, SVC is a layered extension of the H.264/AVC video coding specification. The structure of a possible SVC encoder is shown in Fig. 1. In this figure, the original input video sequence is down-scaled in order to obtain the pictures for all different spatial layers (resulting in spatial scalability). In each spatial layer, a motion-compensated pyramidal decomposition is performed, taking into account the characteristics of each layer (i.e., GOPstructure, bit rate, ...). This temporal decomposition results in a motion vector field on the one hand and residual texture data on the other hand. This information is coded by using similar techniques as in H.264/AVC, extended with progressive SNR refinement features. Several methods that allow the reuse of coded information among different spatial resolution layers are under investigation by the JVT. In particular, the layered structure of SVC allows the reuse of motion vectors, residual data, and intra-texture information of lower spatial and SNR layers for the prediction of higher-layer pictures in order to reduce inter-layer redundancy. In the next section, these inter-layer prediction methods are discussed in more detail. 2.1
Intra-layer Prediction
Depending on the slice type, a macroblock (MB) in a slice can be coded using one of the following three different prediction modes: intra prediction (I-macroblock), uni-directional prediction (P-macroblock), or bi-directional (B-macroblock) prediction. Whereas the first prediction mode only uses the values of spatially neighboring samples of the block (intra-slice prediction), the latter two modes rely on Motion-Compensated (MC) prediction signals based on samples originating from
850
K. De Wolf et al.
2
+
T Q
FFReference reference reference
Reorder & Entropy Encoding
MC
Multiplex
ME
Pictures
SVC Bit Stream
Current Picture
Q-1
Intra Prediction + Reconstructed Pictures
+
T-1
2 2
C Current Picture
-
Base Layer Encoder
+
+
T
A B MVR Q
-
Reorder & Entropy Encoding
ME
FFReference reference reference
MC
Q-1
Pictures
Original Pictures
2 Reconstructed Pictures
Intra Prediction D
+ T-1 + Enhancement Layer Encoder
Fig. 1. Possible SVC encoder structure that supports two spatial layers
preceding or subsequent pictures. The prediction error is obtained by subtracting the prediction signals from the original signal. Usually, this prediction error (or residual) is then transform-coded. In a next step, these transform coefficients are quantized and entropy coded. Intra-Slice Prediction Modes. Three types of intra prediction are defined in the H.264/AVC specification: I 4x4, I 16x16, and I PCM. In the first 2 types, the prediction signal (respectively 4x4 and 16x16 samples) is constructed by copying or interpolating the values of previously coded neighboring samples. Nine and four modes are respectively defined for the I 4x4 type and I 16x16 type. These modes stipulate the direction of the interpolation or the direction of the copy of the signal1 . The I PCM mode (Intra-slice Pulse Code Modulation) allows an encoder to bypass the prediction and transform coding processes, directly sending the values of the samples to the entropy encoder. Motion-Compensated Prediction Modes. The concepts of MC prediction, as defined in H.264/AVC, are used in the SVC ELs as well. Hence, variable 1
Except for mode 2, in which the prediction signal is constructed by taking the average of adjacent pixels.
Analysis of Prediction Mode Decision in Spatial Enhancement Layers
851
block-size motion compensation, quarter-sample motion vector accuracy, multiple reference pictures, and weighted prediction are supported by SVC. For an introduction to this tools, we refer to [2]. In case of uni-directional coding, only one MC prediction block is used. The translation of this block – commonly referred to as motion vector – and the index of the picture in the reference list i.e., list 0 (L0) used for this prediction need to be coded. This reference list may contain pictures before and after the current picture in display order. H.264/AVC allows variable block sizes, ranging from 16x16, 16x8, 8x16 to 8x8 (P L0 16x16, P L0 L0 16x8, P L0 L0 8x16, and P 8x8 respectively). An 8x8 partition can be further divided into 8x4, 4x8, and 4x4 sub-partitions. This means that for each of these partitions both a motion vector and a picture reference index need to be coded, with the exception of sub-partitions that all must use the same reference picture. An additional unidirectional prediction type is called P Skip. For this type, no motion vector, reference index, or residual is coded. The prediction signal (16x16 samples) is constructed by using the picture with index 0 in the picture reference list (L0). The motion vector is equal to the motion vector predictor2 of that MB. In case of bi-directional coding, at most two motion-compensated prediction signals can be used. These signals originate from pictures that are stored in two separate reference lists: list 0 (L0) and list 1 (L1). For B-macroblocks, four modes of MC prediction are defined in H.264/AVC: list 0, list 1, bi-predictive, and direct. In case of mode list 0 and list 1, only one motion vector per partition is transmitted, together with the index of the picture in the appropriate reference list. For the bi-predictive mode, a weighted average of the MC prediction blocks is used to obtain the prediction signal. In the direct mode, the coding mode is inferred from the neighboring prediction modes and can be list 0, list 1, or bi-predictive. When a MB is coded in direct mode and no prediction error signal is coded, this mode is referred to as B Skip mode. Together with the possible MB partitions, this results in 24 prediction modes. 2.2
Inter-layer Prediction
On top of the MC prediction and intra prediction modes, as defined in H.264/AVC, four prediction methods are defined that re-use coding information from a Base Layer (BL). This is called Inter-Layer Prediction (ILP). The BL for a particular EL is the layer that can be used for ILP. This BL does not need to be the layer with the lowest quality nor lowest spatial resolution. In the first ILP method, BL motion information is re-used for efficient coding of the EL motion data. In this mode, commonly referred to as base layer mode, no additional motion vectors are transmitted for the EL. When the base layer is a down-scaled version of the current layer, the motion vectors and the MB partitioning mode are up-sampled accordingly. This is illustrated in Fig. 1 by the dashed arrow A. Also, for the current MB, the same reference indices as for the corresponding 8x8 partitions of the base layer are used [4]. 2
The motion vector predictor is the average of the motion vectors of the upper and left MB.
852
K. De Wolf et al.
The second ILP prediction mode is called quarter pel refinement mode and is an extension to the base layer mode. The motion vectors, the partitioning mode, and the reference indices of the current MB are derived from the corresponding sub-macroblock in the base layer. Additionally, a quarter pel Motion Vector Refinement (MVR) is transmitted for that particular block (dashed arrow B in Fig. 1). This allows to improve the MC prediction. The third ILP method is called residual prediction mode. Here, residual information of MC coded MBs from the base layer can be used for the prediction of the residual of the current layer. A flag, indicating whether residual prediction is used, is transmitted for each MB of the current EL. When residual prediction is applied, the base layer residuals of the corresponding MBs are block-wise up-sampled using a bi-linear filter with constant border extension. Doing so, only the difference between the residual of the current layer, obtained after motion compensation (MC), and the upsampled residual of the base layer is coded (dashed arrow C in Fig. 1). The fourth ILP mode is inter-layer intra prediction (dashed arrow D in Fig. 1). In this mode, an intra-predicted MB of a slice in the base layer can be used for the prediction of the co-located block in the current EL. Therefor, the upsampled version of decoded block is used. In SVC, the 6-tap filter used for halfsample interpolation, as defined in the H.264/AVC specification, is used for the interpolation of these decoded samples. The drawback of this approach is that a decoder needs to decode all the referred intra-coded MBs in the base layer. Note that residual prediction mode can be combined with the base layer mode, quarter pel refinement mode, and the inter-layer intra prediction mode. The combination of base layer mode and residual prediction mode is also called BL Skip mode. A detailed analysis of the coding performance and time-complexity of the ILP modes is given in [5].
3
Prediction Mode Decision
The prediction mode decision is taken by minimizing the Lagrangian cost functional J = D + λR for all possible MB coding modes. Here, D represents the distortion between original and reconstructed signal, R is the bit rate needed for coding of the motion vectors and residual data, and λ is the Lagrangian multiplier, which depends on the chosen QP setting [6]. The value of this Lagrangian parameter is layer-specific. The minimization of J is a very time-consuming operation as for all possible modes, motion vectors (within the search window), and reference pictures this cost function needs to be evaluated. Fast Mode Decision and Fast Motion Estimation algorithms are developed in order to minimize the complexity of this decision taking process [7], [8], [9]. A side effect of such an algorithm is the introduction of a coding efficiency penalty as the mode decision will be sub-optimal.
Analysis of Prediction Mode Decision in Spatial Enhancement Layers
Foreman 50
45
45
20 15
Prediction Modes
Foreman 50
45
45
20 15
Prediction Modes
I_ 4x 4
P_ 8x 8
O LO _L P_
ip BL _S k
4x 4 I_
P_ 8x 8
8x 16 P_ LO _L O
P_ LO _L O
_1 6x
8
0 P_ Sk ip
5
0
8x 16
10
5
_1 6x 8
10
25
x1 6
15
(30, 12) (30, 18) (30, 24) (30, 30)
30
LO _L O
20
35
_1 6
25
P_
30
40
p
(24, 12) (24, 18) (24, 24) (24, 30)
P_ Sk i
35
P_ LO
40
Selection Probability (%)
50
P_ LO _1 6x 16
Selection Probability (%)
4x 4
Prediction Modes
Foreman
BL _S ki p
I_
BL _S ki p
4x 4 I_
P_ 8x 8
8
8x 16
_1 6x
P_ LO _L O
P_ LO _L O
P_ Sk ip
0 P_ LO _1 6x 16
5
0
P_ 8x 8
10
5
8x 16
10
25
8
15
(18, 12) (18, 18) (18, 24) (18, 30)
30
_1 6x
20
35
P_ LO _L O
25
40
P_ LO _L O
(12, 12) (12, 18) (12, 24) (12, 30)
30
P_ Sk ip
35
P_ LO _1 6x 16
40
Selection Probability (%)
50
BL _S ki p
Selection Probability (%)
Foreman
853
Prediction Modes
Fig. 2. Prediction mode decision in spatial EL for the Foreman sequence when the colocated MB in BL is coded in P 8x8 mode. Base layer QP= 12 (top-left), 18 (top-right), 24 (bottom-left), and 30 (bottom-right).
3.1
Test Configuration
The statistical analysis of the used prediction modes, is performed on five sequences: Crew, Foreman, Mobile & Calendar, Mother & Daughter, and Stefan. Two spatial layers are used: QCIF and CIF, both at 30Hz. The GOP size is set to 16; intra-coded slices are inserted every 32 frames. Full-search ME and all ILP modes are enabled. We coded the sequences with version 8 of the reference software [10] using 16 different QP-combinations (QPBL , QPEL )|QPBL , QPEL ∈ {12, 18, 24, 30} for all sequences. 3.2
Results and Discussion
Due to space limitations, we are unable to publish all results in detail. Therefore, we will discuss two of the most selected MB prediction modes in the BL i.e., P 8x8 mode for P-pictures and B Skip mode for B-pictures. Co-located MB is Coded in P 8x8 Mode. This mode is used in the BL for about 30 % of the MBs in P-pictures. We can see from the graphs in Fig. 2 that in approximately 40 % of the co-located MBs in the enhancement layer are also coded in P 8x8 mode, except when the EL is highly quantized. In that case, BL Skip mode is selected most (24-40 %). When the QP of the EL is low, the
K. De Wolf et al.
Stefan
Stefan 90
40 30
90
30
Prediction Modes
B_ 8x 8
_8 x1 6
B_ Bi _B i
B_ Bi _B i
_1 6x 8
16
BL _S ki p
B_ 8x 8
_8 x1 6
_1 6x 8
B_ Bi _B i
Prediction Modes
B_ Bi _B i
6
16
6x 1
B_ Bi _1 6x
B_ L1 _1
B_ L0 _1
6x 16
0 ire ct
10
0
6
20
10
B_ Sk ip
B_ 8x 8
40
6x 1
20
B_ L1 _1
30
50
6x 16
40
(30, 12) (30, 18) (30, 24) (30, 30)
60
ire ct
50
70
B_ L0 _1
(24, 12) (24, 18) (24, 24) (24, 30)
60
80
B_ Sk ip
70
B_ D
80
Selection Probability (%)
100
90
B_ D
_8 x1 6
Stefan
100
Selection Probability (%)
_1 6x 8
Prediction Modes
Stefan
BL _S ki p
B_ Bi _B i
Prediction Modes
B_ Bi _B i
6 6x 1
BL _S ki p
B_ 8x 8
ct B_ L0 _1 6x 16 B_ L1 _1 6x 16 B_ Bi _1 6x 16 B_ Bi _B i_ 16 x8 B_ Bi _B i_ 8x 16
B_ D ire
B_ Sk ip
0
BL _S ki p
10
0
16
20
10
B_ Bi _1 6x
20
50
B_ L1 _1
30
(18, 12) (18, 18) (18, 24) (18, 30)
60
ire ct
40
70
6x 16
50
80
B_ D
(12, 12) (12, 18) (12, 24) (12, 30)
60
B_ L0 _1
70
B_ Sk ip
80
Selection Probability (%)
100
90
Selection Probability (%)
100
B_ Bi _1 6x
854
Fig. 3. Prediction mode decision in spatial EL for the Stefan sequence when the colocated MB in BL is coded in B Skip. Base layer QP= 12 (top-left), 18 (top-right), 24 (bottom-left), and 30 (bottom-right).
I 4x4 mode is selected second most (22-25 %). For the tested configurations, the P Skip mode is only used for higher QPs in the EL. Co-located MB is Coded in B Skip Mode. In Fig. 3, the prediction mode decisions (expressed in %) in the spatial EL for the Stefan sequence are shown when the co-located MB in the base layer is coded using the B Skip mode. Notwithstanding the QP of the BL, for high QPs in the EL, the MB will be coded in B Skip mode (accuracy ranging from 39 % to 86 % for QPEL = 24 and from 63 % to 92 % for QPEL = 30). For low QPs in the EL, the B Skip mode is seldom used. Instead, the used prediction modes are more or less equally divided over BL Skip, B Direct, B Bi 16x16, and B 8x8. Also, we observe that for less quantized MBs in the BL, BL Skip, and B Skip modes are selected more often in the EL. The same behaviour is observed for the other sequences.
4
Conclusion
SVC is an extension of H.264/AVC providing spatial, temporal, and SNR scalability with a high compression efficiency. This compression efficiency is achieved by relying on the available coding modes. Exhaustive search techniques are used to select the best coding mode for each MB. Doing so, these techniques achieve the highest possible coding efficiency, but at the cost of a higher computational
Analysis of Prediction Mode Decision in Spatial Enhancement Layers
855
complexity. We have analyzed the relation between the coding mode decisions made in the BL and the EL. Our tests have shown that a number of relations can be clearly identified. We have observed that the co-located MB of a BL MB coded in P 8x8 mode has a 40 % chance of being coded in P 8x8 mode. Moreover, we have noticed that the P Skip mode is only used for high QPs in the EL. For MBs coded in B Skip mode, the co-located MB in the EL will be coded in the B Skip mode when the EL is highly quantized (63 % to 92 % for QPEL = 30). The observations in this paper can be used to construct a model for fast mode decision. Such a model can be used to guide the mode decision algorithm of an SVC encoder, hereby reducing the overall encoding time. Acknowledgements. The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT-Flanders), the Fund for Scientific Research-Flanders (FWO-Flanders), and the European Union.
References 1. ITU-T, ISO/IEC JTC 1: Advanced video coding for generic audiovisual services, ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC (2003) 2. Wiegand, T., Sullivan, G., Bjøntegaard, G., Luthra, A.: Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology 13, 560–576 (2003) 3. Li, H., Li, Z., Wen, C., Chau, L.P.: Fast mode decision for spatial scalable video coding. In: IEEE International Symposium on Circuits and Systems (ISCAS) (2006) 4. Wiegand, T., Sullivan, G., Reichel, J., Schwarz, H., Wien, M. (eds.).: Joint Scalable Video Model 8: Joint Draft 8 with proposed changes, Doc. JVT-U202. JVT (2006) 5. De Wolf, K., De Schrijver, D., De Zutter, S., Van de Walle, R.: Scalable Video Coding: Analysis and coding performance of inter-layer prediction. In: Proceedings of the 9th International Symposium on Signal Processing and its Applications, Dubai (U.A.E.), SuviSoft Oy Ltd, 4 (2007) 6. Sullivan, G., Baker, R.: Rate-distortion optimized motion compensation for video compression using fixed or variable size blocks. In: Proceedings of the IEEE Global Telecommunications Conference, Phoenix, AZ. vol. 3, pp. 85–90 (1991) 7. Pan, F., Lin, X., Rahardja, S., Lim, K., Li, Z., Wu, D., Wu, S.: Fast mode decision algorithm for intraprediction in H.264/AVC video coding. IEEE Transactions on Circuits and Systems for Video Technology 15, 813–822 (2005) 8. Dai, Q., Zhu, D., Ding, R.: Fast mode decision for inter prediction in H.264. In: Proceedings of International Conference on Image Processing (ICIP) (2004) 9. Lin, Z., Yu, H., Pan, F.: A scalable fast mode decision algorithm for H.264. In: IEEE International Symposium on Circuits and Systems (ISCAS) (2006) 10. Vieron, J., Wien, M., Schwarz, H. (eds.).: Joint Scalable Video Model (JSVM) 8 software, Doc. JVT-Q203. JVT (2006)
Object Recognition by Implicit Invariants 1 ˇ Jan Flusser1 , Jaroslav Kautsky2 , and Filip Sroubek 1
2
Institute of Information Theory and Automation, AS CR Pod vod´ arenskou vˇeˇz´ı 4, 182 08, Prague 8, Czech Republic The Flinders University of South Australia, Adelaide, Australia
Abstract. The use of traditional moment invariants is limited to a certain set of simple geometric transforms, such as rotation, scaling and affine transform. This paper presents a novel concept of so-called implicit moment invariants, which enable us to recognize objects under a broader set of geometric deformations.
1
Introduction
Recognition of objects and patterns that are deformed in various ways has been a goal of much recent research. There are basically three major approaches to this problem – full search, image normalization, and invariant descriptors. The approach using invariant descriptors appears to be the most promising one and has been used extensively. Its basic idea is to describe the object by a set of features which are not sensitive to particular deformations and which provide enough discrimination power to distinguish among objects belonging to different classes. In 2D object recognition, various moment invariants have become classical and frequently used shape descriptors during last forty years. Even if they suffer from some intrinsic limitations (the most important of which is their globalness, which prevents them from being used for recognition of occluded objects), they often serve as the ”first-choice descriptors” and as a reference method for evaluation of the performance of other shape descriptors. All moment invariants ever studied (see for instance [1,2,3,4]) are so-called explicit invariants. An explicit invariant is a functional (let us denote it as E) acting on the space of image functions which does not change its value if the image f undergoes certain deformation τ from the set of admissible deformations, i.e. which satisfies the condition E(f ) = E(τ (f )) for any image f . There have been described many systems of explicit moment invariants with respect to rotation, scaling, affine transform, contrast changes, and linear filtering. However, there are several classes of image deformations which occur frequently in practice but explicit moment invariants with respect to them are not known or even have been proven they cannot exist. Such typical examples are projective transform, cylindrical and spherical projections, quadratic transform, and other polynomial transforms of the image coordinates. To overcome this, we propose in this paper a new concept of implicit invariants. Implicit invariant I is a functional defined on image pairs such that W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 856–863, 2007. c Springer-Verlag Berlin Heidelberg 2007
Object Recognition by Implicit Invariants
857
I(f, τ (f )) = 0 for any image f and deformation τ . According to this definition, explicit invariants are just particular cases of implicit invariants. Clearly, if an explicit invariant exists, we can set I(f, g) = E(f ) − E(g). As we show later on in the paper, there are many types of image deformations where explicit moment invariants do not exist while implicit moment invariants do. In those cases, implicit invariants can be used as features for object recognition. Unlike explicit invariants, implicit invariants do not provide description of a single image because they are always defined for a pair of images. For recognition purposes this is not a drawback. We consider I(f, g) to be a ”distance measure” (even if it does not exhibit all properties of a metric) between f and g factorized by τ and we can, for each database template gi , calculate the value of I(f, gi ) and then to classify f according to the minimum.
2
General Moments
Definition 1. Let p0 , p1 , . . . , pn−1 , . . . be some basis functions defined on a bounded D ⊂ IRN and let f be an image function having a finite integral. By a general moment of f we understand the functional μj (f ) = f (x)pj (x)dx. D
If N = 1 and pj (x) = xj , we speak about standard moments. Using a matrix notation we can write ⎛ p (x) 0 ⎜ p1 (x) p(x) = ⎜ .. ⎝ .
⎞ ⎟ ⎟ ⎠
and
⎛ μ (f ) 0 ⎜ μ1 (f ) μ(f ) = ⎜ .. ⎝ .
pn−1 (x)
⎞ ⎟ ⎟ . ⎠
(1)
μn−1 (f )
˜ be a transformation of the domain D into D ˜ and let f˜ : D ˜ → IR Let r : D → D be another image function which satisfies f˜(r(x)) = f (x)
(2)
˜ for x ∈ D and f (˜ x) = 0 for x ˜ ∈ D/r(D). (This means that image f˜ is a spatially deformed version of f .) We are interested in the relation between the moments μ(f ) and the moments μ ˜ (f˜) = f˜(˜ x)p(˜ ˜ x)d˜ x= f˜(˜ x)p(˜ ˜ x)d˜ x r(D)
˜ D
of the transformed function with respect to some other n ˜ basis functions T
p(˜ ˜ x) = ( p˜0 (˜ x) p˜1 (˜ x) . . . p˜n˜ −1 (˜ x) )
˜ We can now formulate the following Theorem. defined on D.
858
ˇ J. Flusser, J. Kautsky, and F. Sroubek
Theorem 1. Denote by Jr (x) the Jacobian of the transform function r. If p(r(x))|J ˜ r (x)| = Ap(x)
(3)
for some n ˜ × n matrix A then μ ˜ = Aμ .
(4)
The power of this theorem depends on our ability to choose the basis functions so that we can, for a given transform r, express the left-hand side of (3) in terms of the basis functions p and thus construct the matrix A. This is always possible for a polynomial r by choosing polynomial bases p(x) and p(˜ ˜ x).
3
Implicit Moment Invariants
Let us assume that the transformation r depends on a finite number, say m, m < n ˜ , parameters a = (a1 , . . . , am ). Traditional explicit moment invariants with respect to r can be obtained in two steps. (a) Eliminate a = (a1 , . . . , am ) from the system (4). This leaves us n ˜ − m equations which depend only on the two sets of general moments (and on the choice of basis functions, of course). We call it a reduced system. (b) Re-write these equations equivalently in the form qj (˜ μ(f˜)) = qj (μ(f )) ,
j = 1, . . . , n ˜−m
(5)
for some functions qj . Then the explicit moment invariants are E(f ) = qj (μ(f )). However, for some transforms (quadratic, cubic, etc.) we may not be able to perform the second step – finding the explicit forms qj . Introducing implicit invariants can overcome this drawback. The reduced system in step (a) is independent of the particular transformation. For classifying of an object, we traditionally compare the values of its descriptors (explicit moment invariants) with those of the database images, that is we look for such database image, which satisfy equations (5). However, it is equivalent to checking for which database image the above reduced system is satisfied. So we can, in case we are not able to find explicit moment invariants in the form (5), use this system as a set of implicit invariants. In other words, the images are classified according to the error with which the system is satisfied. We will demonstrate the above idea of implicit moment invariants on a 1D example using standard powers as basis functions. Consider the transform r(x) = x + ax2 , ˜ = a−1, a+1. where a ∈ (0, 1/2, which maps interval D = −1, 1 on interval D Let us show two implicit invariants. Since m = 1 (one-parameter transform), we
Object Recognition by Implicit Invariants
859
need n ˜ = 3 and n = 6. The Jacobian is Jr (x) = 1 + 2ax and for standard powers for both p and p˜ we would get ⎛ ⎞ 1 2a 0 0 0 0 A = ⎝ 0 1 3a 2a2 0 0 ⎠ . 0 0 1 4a 5a2 2a3 However, now we have to evaluate the moments of the transformed signal over ˜ which depends on the unknown parameter a. This problem is the domain D resolved by choosing a shifted power basis p˜j (˜ x) = (˜ x − a)j ,
j = 0, 1, . . . n ˜−1
as we have then, after the shift of variable x ˜ = xˆ + a, a+1 1 μ ˜j (f˜) = f˜(˜ x)(˜ x − a)j d˜ x= f˜(ˆ x + a)ˆ xj dˆ x −1
a−1
which is now independent of a as fˆ(ˆ x) = f˜(ˆ x + a) has domain basis p˜ we obtain a different transform matrix ⎛ 1 2a 0 0 0 A = ⎝ −a 1 − 2a 3a 2a2 0 a2 2a(a2 − 1) 1 − 6a2 4a(1 − a2 ) 5a2
−1, 1. For this ⎞ 0 0 ⎠ 2a3
and the first of equations (4) gives a=
μ ˜ 0 − μ0 2μ1
while the two reduced equations rewrite, after substitution, as 2μ21 (˜ μ1 − μ1 ) = μ1 (3μ2 − μ ˜0 )(˜ μ0 − μ0 ) + μ3 (˜ μ0 − μ0 )2 3 2 4μ1 (˜ μ2 − μ2 ) = 4μ1 (2μ3 − μ1 )(˜ μ0 − μ0 ) + μ1 (5μ4 + μ ˜ 0 − 6μ2 )(˜ μ0 − μ0 )2 3 +(μ5 − 2μ3 )(˜ μ0 − μ0 ) . (6) In the example above it was straightforward to derive the transform matrix A for simple transformation r and a small number of invariants. For numerical reasons, this intuitive approach cannot be used for higher-order polynomial transform r and/or for more invariants. To obtain numerically stable method it is important to use suitable polynomial bases, such as orthogonal polynomials, without using their expansions into standard (monomial) powers. Our implementation is based on the representation of polynomial bases by matrices with a special structure [5]. This representation allows to evaluate the polynomials efficiently by means of recurrent relations.
4
Implementation of the Implicit Invariants
Depending on r, the elimination of the m parameters of the transformation function from the n ˜ equations of (4) to obtain a parameter-free reduced system
860
ˇ J. Flusser, J. Kautsky, and F. Sroubek
may require numerical solving of nonlinear equations. This may be undesirable or impossible. Even the simple transform r used in the experimental section would lead to cubic equations in terms of its parameters. Obtaining a neat reduced system may be very difficult. Furthermore, even if successful, we create an unbalanced method – we have demanded some of the equations in (4) to hold exactly and use the accuracy in the resulting system as a matching criterion to find the transformed image. We therefore propose another implementation of the implicit invariants. Instead of eliminating the parameters, we calculate the ”uniform best fit” from all equations in (4). For a given set of values of the moments μ and μ ˜, we find values of the m parameters to satisfy (4) as best as possible in 2 norm; the error of this fit then becomes the value of the respective implicit invariant. Our actual implementation of the recognition by implicit invariants can be described as follows. (a) Given is a library (database) of images gj (x, y), j = 1, . . . , L, and a deformed image f˜(˜ x, y˜) which is assumed to have been obtained by a transform of a known polynomial form r(x, y, a) with unknown values of m parameters a. (b) Choose the appropriate domains, polynomial bases p and p, ˜ and the recurrence matrices for evaluation of the polynomials. (c) Derive a program to evaluate the matrix A(a). This critical error-prone step is performed by a symbolic algorithmic procedure which produces the program used then in numerical calculations. This step is performed only once for the given task. (It has to be repeated only if we change the polynomial bases or the form of transform r(x, y, a), which basically means only if we move to another application). (d) Calculate the moments μ(gj ) of all library images gj (x, y). (e) Calculate the moments μ ˜ (f˜) of the deformed image f˜(˜ x, y˜). (f) For all j = 1, . . . , L calculate, using an optimizer, the values of the implicit invariant I(f˜, gj ) = min μ ˜(f˜) − A(a)μ(gj ) a and denote M = min I(f˜, gj ). j
The norm used here should be weighted, for example relatively to the components corresponding to the same degree. ˜ ) I(f,g (g) The identified image is gk for which I(f˜, gk ) = M ; the ratios I(f˜,gj ) , j = k, k may be used as confidence measures of the identification.
5
Numerical Experiments
As we have shown earlier, the implicit moment invariants can be constructed for a very broad class of image transforms including all polynomial transforms.
Object Recognition by Implicit Invariants
861
Here we will demonstrate the implementation and the power of the method on images transformed by the following function
x ˜ ax + by + c(ax + by)2 = r(x, y) = , (7) y˜ −bx + ay which is a rotation with scaling (parameters a and b) followed by a quadratic deformation in the x˜ direction (parameter c). We have chosen this particular transform for our tests for the following reasons: – It is general enough to approximate many real-life situations, for instance deformations caused by the fact that the photographed object was drawn/ printed on a spherical or cylindrical surfaces like bottles and balls. – It is sometimes used by web designers to warp images in order to reach desirable visual effect. Very often this is an unauthorized act violating the copyright. It is important for the copyright owners to have a tool how to identify such images. – Explicit invariants to this kind of transforms cannot exist because they do not preserve the moment orders and do not form a group. The first experiment was aimed to test the discriminative power of the implicit invariants and to demonstrate that they can be used as shape descriptors for recognition of distorted real objects This test was done on a standard benchmark database ALOI [6]. We took 100 ALOI images and deformed each of them by the warping model (7) (see Fig. 1 for some examples). The coefficients of the deformations were generated randomly; c from a range of admissible values and the rotation angle from (−40◦ , 40◦ ), both with uniform distribution. Each deformed image was then classified against the undistorted database by three different methods: by implicit invariants according to minimal norm, by the Hu’s rotation moment invariants [1] according to minimum distance, and by affine moment invariants (AMI) [2] also using the minimum-distance rule. In all three cases, six invariants were used. The last two methods were selected for a comparison because they are similar to the new technique in their nature (all of them are based on moments) and because they are traditional, well-established reference methods in pattern recognition. We run the whole experiment several times with different deformation parameters. In each run the recognition rate we achieved was 99 or 100% for the implicit invariants, from 43 to 47% for the rotation invariants, and from 34 to 40% for the AMI’s. These results illustrate two important facts. First, the implicit invariants can serve as an efficient tool for object recognition in case when the object deformation corresponds to the assumed model. Secondly, in case of nonlinear distortions the implicit invariants significantly outperform both rotation as well as affine moment invariants, which corresponds to our theoretical expectation. In case when only rotation is present, all three methods are equivalent. To illustrate this, we run the experiment once again but we fixed c = 0. Then the recognition rate of all three methods was 100%.
862
ˇ J. Flusser, J. Kautsky, and F. Sroubek
Fig. 1. The original images from the ALOI database (top) and their deformed versions (bottom)
The second experiment was done on real images taken in our lab. It illustrates good performance and high recognition power of the implicit invariants even in the case where theoretical assumptions about the degradation are not fulfilled. With a standard digital camera (Olympus C-5050), we took a photo of letters printed on a label which was glued to a bottle, see Fig. 2(a). The letters were organized in a 4 × 3 mesh with “A”s, “B”s, “V”s and “X”s each printed three times in a row. After a simple segmentation, the letters were labeled from left to right A1 , A2 , A3 , B1 , . . . , V1 ,. . . ,X1 , X2 , and X3 . Due to the curvature of the bottle surface, the letters appear distorted in the horizontal direction and the distortion grows to the right. A1 does not exhibit any visible distortion while A3 is the most distorted one and likewise for the other three letters. The task was to recognize (classify) these letters against a database containing the full English alphabet (26 undistorted letters of the same font). In an ”ideal” case when the camera is in infinity the image distortion can be described by orthogonal projection of the cylinder onto a plane, i.e.
x ˜ r sin( xr ) = , y˜ y where r is the bottle radius, x, y are the coordinates on the bottle surface and x ˜, y˜ are the coordinates on the acquired images. In our case the object-to-camera distance was finite, so small perspective effect appears in addition to the above model. We assume the actual image deformation can be approximated by a quadratic polynomial in x direction. Although it is clear that such approximation cannot be very accurate, we will demonstrate that it is accurate enough for our purpose. We classified all deformed letters by means of implicit invariants in the same way as in the previous experiment and also by the Hu’s moment invariants. The table in Fig 2(b) summarizes the classification results of both methods. Implicit invariants provided a perfect recognition rate as all deformed letters were classified correctly with high minimum confidence (see the definition in (g), Section 4). It might be a bit surprise if one considers the very rough approximation of the
Object Recognition by Implicit Invariants
A1 A2 A 3 Imp-inv. A A A confidence 65 29 15 Hu-inv. A Y N (a)
B1 B2 B B 64 186 B S (b)
B3 B 69 S
V1 V 96 V
V2 V 62 G
V3 V 80 N
X1 X 35 X
X2 X 98 X
863
X3 X 19 I
Fig. 2. (a) Letters captured by a standard digital camera exhibit distortion due to the cylindrical shape of the bottle. (b) Classification of four letters (each having three different degrees of distortion) by implicit invariants (first row) with confidence in the second row, and classification by Hu’s invariants (third row).
transformation model made above. It indicates some degree of robustness of the implicit invariants to the type of the image deformation. Hu’s moment invariants classified correctly only the letters without any quadratic deformation (A1 , B1 , V1 , X1 ) and failed in other cases, which is in agreement with the theory. Acknowledgement. This work was partially supported by the Czech Ministry ˇ of Education under the project 1M0572 (Research Center DAR). F. Sroubek was also supported by the Czech Academy of Sciences under the project AV0Z10750506-I055.
References 1. Hu, M.K.: Visual pattern recognition by moment invariants. IRE Trans. Information Theory 8, 179–187 (1962) 2. Flusser, J., Suk, T.: Pattern recognition by affine moment invariants. Pattern Recognition 26, 167–174 (1993) 3. Reiss, T.H.: The revised fundamental theorem of moment invariants. IEEE Trans. Pattern Analysis and Machine Intelligence 13, 830–834 (1991) 4. Flusser, J.: On the independence of rotation moment invariants. Pattern Recognition 33(9), 1405–1410 (2000) 5. Kautsky, J., Golub, G.: On the calculation of Jacobi matrices. Linear Algebra Appl. 52/53, 439–455 (1983) 6. Amsterdam Library of Object Images: http://staff.science.uva.nl/∼ aloi/
An Automatic Microarray Image Gridding Technique Based on Continuous Wavelet Transform Emmanouil Athanasiadis1, Dionisis Cavouras2, Panagiota Spyridonos1, Ioannis Kalatzis2, and George Nikiforidis1 1
Medical Image Processing and Analysis (M.I.P.A.) Group, Laboratory of Medical Physics, School of Medical Science, University of Patras, 26500 Rion - Patras, Greece
[email protected],
[email protected],
[email protected] http://mipa.med.upatras.gr 2 Medical Image and Signal Processing (MED.I.S.P.) Laboratory, Department of Medical Instruments Technology, Technological Educational Institute of Athens, Ag. Spyridonos Street, Aigaleo, 122 10, Athens, Greece {cavouras, ikalatzis}@teiath.gr http://www.teiath.gr/stef/tio/medisp/
Abstract. In the present study, a new gridding method based on continuous wavelet transform (CWT) was performed. Line profiles of x and y axis were calculated, resulting to 2 different signals. These signals were independently processed by means of CWT at 15 different levels, using daubechies 4 mother wavelet. A summation, point by point, was performed on the processed signals, in order to suppress noise and enhance spot’s differences. Additionally, a wavelet based hard thresholding filter was applied to each signal for the task of alleviating the noise of the signals. 14 real microarray images were used in order to visually assess the performance of our gridding method. Each microarray image contained 4 sub-arrays, each sub-array 40x40 spots, thus, 6400 spots totally. Moreover, these images contained contamination areas. According to our results, the accuracy of our algorithm was 98% in all 14 images and in all spots. Additionally, processing time was less than 3 sec on a 1024×1024×16 microarray image, rendering the method a promising technique for an efficient and fully automatic gridding processing. Keyword: Microarrays, gridding, Continuous Wavelet Transform.
1 Introduction Microarray imaging is used for the simultaneous identification of thousands of genes in bioinformatics [1]. In a complementary DNA (cDNA) microarray experiment, mean fluorescence intensity values are calculated. These intensities are closely related to the expression level of a specific gene. Hence, the more precise the localization of a spot, the more accurate the intensity measurement is. Consequently, a more precise expression measurement of a gene is obtained. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 864 – 870, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Automatic Microarray Image Gridding Technique
865
In order to acquire spot intensity measurements, three major steps are followed [1][2]: 1/ the gridding or addressing step, where a precise localization of the spots with their surrounding is determined, 2/ the segmentation step, where a discrimination of the cell’s foreground from background is accomplished and 3/ the intensity extraction step, where calculation of the mean fluorescence value of each spot is performed. Nevertheless, all the above steps are not a trivial issue, due to the fact that microarray images are usually highly contaminated with noise and artifacts during their construction [1]. More precise, rotations, misalignments, and local deformations of the ideally rectangular grids of the image occur, rendering the whole process demanding. Gridding is the first step in the chain of the expression measurement process. It is vital that the gridding should be accurate since any errors during addressing procedure are transferred to the following steps. A lot of work has already been done for the task of automatic gridding processing. However, initialization or other parameters are needed in order the gridding to be applied. Jain et al. [3] described a gridding method that requires as input the rows and columns of all grids. In Katzer et al [4] technique, gaps between the grids are essential while in Steinfath et al. [5], filter arrays with radioactive label are used. On the other hand, there are many researchers that create manually the grid and then perform an automatic procedure for intensity measurement [1]. In this paper, a novel full automatic gridding approach is presented, based on the properties of the continuous wavelet transform (CWT) [6]. This technique attempts to solve the automatic gridding procedure problem efficiently, providing a tool for an accurate and fast microarray image processing.
2 Material and Methods In the present study, 9 microarray images, concerning Saccharomyces cerevisiae, were obtained by a publicly available database [7]. In those images, contamination areas, as well as a slight shift of the alignment were visible. Each image consists of 4 subarrays, each subarray consists of 40x40 spots, leading to 6400 spots, totally. 2.1 Continuous Wavelet Transform CWT is a transform that is used to decompose a signal into wavelets, small oscillations that are highly localized in time and described by eq. 1.
C ( a , b) =
+∞
∫ f (t )
−∞
1 ⎛t −b⎞ ψ⎜ ⎟dt |a| ⎝ a ⎠
(1)
where, f(t) is the original signal, á represents scale, b represents time or space, ø represents the mother wavelet and z represents the complex conjugate of z. The choice of Daubechies (db) was guided by the fact that this mother wavelet family resulted in better results. Furthermore, we worked with only four coefficients
866
E. Athanasiadis et al.
(db 4), since using a greater number of coefficients, have not resulted in any significant change in our results. On the other hand, keeping the number of coefficients, as low as possible it was essential for our methodology since the processing time is an important issue for the gridding procedure. After applying the CWT, a hard-threshold wavelet based technique [8] was performed according to eq. 2.
Wout
⎧Win + T ⋅ (G − 1) if Win > T ⎫ ⎪ ⎪ = ⎨Win − T ⋅ (G − 1) if Win < −T ⎬ ⎪0 ⎪ otherwise ⎩ ⎭
(2)
where Wout denotes the output and Win the input coefficient values of the details. T and G are threshold and gain values respectively (Fig 1).
Fig. 1. Wavelet-based hard thresholding mapping layout
2.2 Gridding Procedure The gridding procedure was accomplished as follow: firstly, line-profiles for x and y axes were calculated and two signals were obtained. Secondly, the CWT was applied to both signals, at 15 scales, using db4 mother wavelet, and 15 detail images for each signal were obtained (Fig.2). Scale was set to 15, due to the fact that in higher scales, no significant information was divulged. During decomposition process, high frequency details are being distinguished more accurately from noise structures, as noise is not present in all decomposed signals, in contradiction to the alternations of spots with their background that can be present in all levels. Thirdly, a summation point by point of the signals of all the 15 scales was performed for the purpose of increasing the actual signal of the spots and decreasing the noise. Fourthly, a hard- thresholding
An Automatic Microarray Image Gridding Technique
867
wavelet based technique [6] was applied to each signal for the task of suppressing noise (Fig.3). Finally, local maxima and local minima were calculated on both signals, corresponding to the centers and the boundaries of the spots respectively. Optimal results were obtained by setting the threshold equal to 10% of the maximum signal value and the gain equal to unity. The described method is briefly summarized in the following steps: Step 1: Load the initial Image Step 2: Calculate line profiles of x and y axis Step 3: Apply the CWT on Both x and y axis signals, up to 15 scales using as mother wavelet the db4. Step 4: For each one of the two signals, make a summation point by point from 1st to 15th the coefficients of the CWT. Step 5: Apply wavelet denoising technique using as threshold the 10% of the maximum value of each signal. Step 6: Find the local maxima (centers of spots) and local minima (boundaries of spots) in x and y axis.
Fig. 2. X-axis signal processed by the CWT in 3 dimensions (coefficient, time or space, scale). In this image, scales a are ranging from 1 to 15, space b is the dimension of the signal (here 1024) and COEFS are the resulted coefficient values of the wavelet transform.
868
E. Athanasiadis et al.
(a)
(b) Fig. 3. Initial (a) and processed with hard threshold wavelet based filtering technique signal (b). Local maxima corresponded to the centers and local minima corresponded to the boundaries of the cells. In (b) noise has been suppressed by the wavelet threshold filter.
3 Results and Discussion The proposed method was evaluated by visually inspecting [9] the gridding results and assigning each spot into two categories: either perfectly surrounding the spot (category 1) or not (category 2). Additionally, our method was compared with the method proposed by Blekas et. al. [9]. According to our findings, our algorithm was 98% accurate in all the 14 microarray images in opposition to the other method that scored 95%. It should be noted that microarray images were not pre-processed, thus
An Automatic Microarray Image Gridding Technique
869
Fig. 4. Gridding results of contaminated arrayer. The contamination is being localized at the middle of the figure.
noise affected the outcome of both gridding algorithms. Nevertheless, in the proposed technique, noise suppression by using the hard-thresholding filter, leaded to higher accuracy results. A typical contaminated area of a microarray image with the grid overlaid is illustrated in Fig. 4. As we can conclude by the results, the contamination at the boundaries of the arrayer is eliminated by the hard thresholding technique. It should be noted that the processing time for the gridding, on a 1024×1024×16 microarray image, was lower than 3sec (processor: PentiumIV 3.00GHz, 512 MB RAM), rendering the technique a valuable tool for a fully automated microarray image processing application.
4 Conclusion In the present study we have proposed a new method for automatic addressing of microarray images based on continuous wavelet transform Following this technique, the noise of microarray images was sufficiently suppressed, and the boundaries of the spots were delineated with high accuracy. The processing time was minimal providing the method an effective tool for the demanding task of microarray image processing.
870
E. Athanasiadis et al.
Acknowledgements We would like to thank the Greek State Scholarships Foundation (I.K.Y.) for funding the above work.
References 1. Yang, Y.H., Buckley, M.J., Duboit, S., Speed, T.P.: Comparison of methods for Image Analysis on cDNA Microarray Data. Journal of Computational and Graphical Statistics 11, 108–136 (2002) 2. Schena, M., Shalon, D., Davis, R.W., Brown, P.O.: Quantitative monitoring of gene expression patterns with a complementary DNA microarrray. Science 270, 467–470 (1995) 3. Jain, A., Tokuyasu, T., Snijders, A., Segraves, R., Albertson, D., Pinkel, D.: Fully Automatic Quantification of Microarray Image Data. Genome Res, 325–332 (2002) 4. Katzer, M., Kummert, F., Sagerer, G.: Automatische Auswertung von Mikroarraybildern. In: Workshop Bildverarbeitung für die Medizin, Leipzig (2002) 5. Steinfath, M., Wruck, W., Seidel, H.: Automated image analysis for array hybridization experiments. Bioinformatics 17, T. 7 S., 634–641 (2001) 6. Ingrid Daubechies, Ten Lectures on Wavelets, CBMS-NSF Regional Conference Series in Applied Mathematics), Soc for Industrial & Applied Math (1992) 7. DeRisi, J.L., Iyer, V.R., Brown, P.O.: Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale. SCIENCE 278, 680 (1997) 8. Athanasiadis, E., Glotsos, D., Daskalakis, A., Bougioukos, P., Kostopoulos, S., Theocharakis, P., Spyridonos, P., Kalatzis, I., Nikiforidis, G., Cavouras, D.: Microarray Image Enhancement Techniques Using the Discrete Wavelet Transform. In: Second International Conference From Scientific Computing to Computational Engineering (2nd IC-SCCE 2006), Athens (July 5 - 8 2006) 9. Blekas, K., Galatsanos, N., Likas, A., Lagaris, I.E.: Mixture Model Analysis of DNA Microarray Images. IEEE Transactions on Medical Imaging 24, 901–907 (2005)
Image Sifting for Micro Array Image Enhancement Pooria Jafari Moghadam and Mohamad H. Moradi Biomedical faculty of AmirKabir University of Technology, Tehran, Iran Bioinstrumentation & Signal processing Lab, AmirKabir University of Technology, Tehran, Iran
[email protected] Abstract. cDNA micro arrays are more and more frequently used in molecular biology as they can give insight into the relation of an organism's metabolism and its genome. The process of imaging a micro array sample can introduce a great deal of noise and bias into the data with higher variance than the original signal which may swamp the useful information. As imperfections and fabrication artifacts often impair our ability to measure accurately the quantities of interest in micro array images, image processing for analysis of these images is an important and challenging problem. How to eliminate the effect of the noise imposes a challenging problem in micro array analysis. In this paper we implemented a novel algorithm for image sifting which could remove objects with definite size from macro array images. We used regular moving grids to sift noise object and obtained clean images for segmentation. The results have been compared with SWT, DWT and wiener filter denoising. Keywords: sifting, micro array, enhancement, image, denoising.
1 Introduction Micro arrays have become the tool of choice for the global analysis of gene expression. Powerful statistical tools are now available to analyze this expression and to gain an understanding of how changes in gene expression patterns impact biological systems. Currently, several different platforms have evolved from the origin of this imaging technique which goes back to the 1970’s [1]. The analysis of such data has become a computationally-intensive task that requires technological developments at various stages, from the design of the array, to image analysis, database storage, data processing and clustering and information extraction. Further innovations were made by M. Schena et al [2], by using nonporous solid support to facilitate miniaturization and fluorescent-based detection. Another improvement was the methods introduced by S. Fodor et al. [3], for the high density spatial synthesis oligonucleotide which makes it possible to monitor changes in the expression patterns of thousands of genes. Image analysis is an important aspect for micro array experiments that can affect subsequent analysis such as identification of differentially expressed genes. Image processing for micro array images includes three tasks: spot gridding, segmentation and information extraction. In recent years, large number of commercial tools has been developed in micro array image processing [4, 5, 6, 7, and 8]. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 871 – 877, 2007. © Springer-Verlag Berlin Heidelberg 2007
872
P.J. Moghadam and M.H. Moradi
Micro arrays are a scientific tool that should be viewed in a similar fashion to any other laboratory technique with careful experimental planning, replication, and proper statistical analysis. There is a need to examine these data with statistical techniques to help discern possible patterns in the data. Such statistical methods include analysis of variance introduced by Kerr [9], ratio distribution by Chen et al [10] and Ermolaeva et al. [11] and Gamma distribution by Newton 2001 [12]. These methods deal mainly with measurement error, such as preparation of the sample, cross hybridization, and fluctuation of fluorescence value from gene to gene. But none deals particularly with the effect of the noise. By giving information on the levels of gene expression of thousands of genes at the same time, DNA micro arrays allow researchers, for example, to relate the effect of a disease to particular genes. The basic procedure for a micro array experiment is simply described is as follows. RNA is extracted from a cell or tissue sample and then is converted to cDNA. Fluorescent tags, (usually Cy3 and Cy5) are enzymatically incorporated into the newly synthesized cDNA or can be chemically attached to the new strands of DNA. A cDNA molecule that contains a sequence complementary to one of the single-stranded probe sequences on the array will hybridize, via base pairing, to the spot at which the complementary reporters are affixed. The spot will then fluoresce when examined using a micro array scanner. The fluorescence intensity of each spot is then evaluated in terms of the number of copies of a particular mRNA, which ideally indicates the level of expression of a particular gene. A schematic diagram for this process created. Methods to reduce the noise source include usage of clean glass slide and a higher laser power rather than a higher PMT voltage. However, these are not adequate for the required image qualities and an enhanced software procedure embedded within the process is a much better alternative. The results of the micro array experiment are two 16-bit tagged mage files, one for each fluorescent dye. The image is not perfect and includes noisy sources that blur such images for further gene expression experimentation. The noise source originates from different sources during the course of experiment, such as photon noise, electronic noise, treatment of the glass slide, background fluorescence, laser light reflection, and dust on the slide. Hence, it is crucial to denoise the resultant image within this process. In this paper, we implemented a novel algorithm for image sifting which could remove objects with definite size from macro array images. We used regular moving grids to sift noise object and obtained clean images for segmentation. The results have been compared with SWT, DWT and wiener filter denoising.
2 Material and Methods Here, we have implemented a novel algorithm for image sifting which could remove objects with definite size from micro array images. One example of these distortions for micro array images is known as impulse noise. We know that background correction is an important task in the analysis of micro arrays. This is necessary in order to remove the contribution in intensity which is not due to the hybridization of the cDNA samples to the spotted DNA, so first we must uniform the background of micro array images.
Image Sifting for Micro Array Image Enhancement
873
Thresholding is important in image processing for extracting objects from their background. If the histogram has a deep and sharp valley between two peaks then it is easy to threshold the histogram by simply choosing the valley bottom as the threshold point. Also, in some cases, the valley of histogram is flat and broad or two peaks are extremely unequal in height, making it difficult to trace the valley bottom. We have used optimal multi-threshold method to overcome some of the above mentioned problems. Maximizing of interclass variance between bright and dark [13]. After thresholding, binary image was sifted by moving grid. We have created a proper grid to remove desired objects from image.
Fig. 1. Grid structure
Grid equation can be shown as:
∑ ( y = i.W
V
where
+ y0 ) + ∑ ( x = i.WH + x0 )
WV = WH can be 2,4,8,…,256 and ( x0 , y0 ) is initial point and if P =
(1)
512 , WV
then i = 0,1,2,..., P , In order to extract new edges for each of grid windows, the edges of grid windows each were traced. Tracer moved on edge of grid windows in such a way that always the background was seen in its left side, so when edge of grid windows is affected by objects of image, the tracer would follow these edges of objects to make a new edge for windows (See Fig. 2). Thus, the edges of grid windows are distorted and reshaped according to edges of the objects detected in the path of moving tracer. Likewise sifting things, we have removed all of objects which were encircled inside these new grid windows. Then grid was moved along x and y axis, pixel by pixel to sift the whole of image. For this purpose, we have set x 0 from zero to W H when y 0 was constant and then
874
changed
P.J. Moghadam and M.H. Moradi
y 0 in the range of zero to WV and have set x 0 from zero to W H again for
each step. This helped us to sift the whole of image completely (see Fig. 3).
Fig. 2. Grid window get new structure because of object edge effect
Fig. 3. Grid movement along x-axis
3 Result Figure 4 shows the resulted denoised image using our sifting approach. As it can be seen, contrary to other approaches this method has not any serious effect on image clearness such as blur or darkness. This could help us to continue our processing without need of any alignment. In order to provide a quantitative measure of the resulted images, we used the universal index proposed in [14]. Let x = {x i & i = 1,2,..., N } and
y = {y i & i = 1, 2,..., N } be the original and the test image signals, respectively. The proposed quality index is defined as
Image Sifting for Micro Array Image Enhancement
(a)
(b) Fig. 4. (a) Original image (b) Enhanced image by sifting
875
876
P.J. Moghadam and M.H. Moradi
Q=
4σ xy x . y
(2)
(σ + σ y2 )[( x )2 + ( y )2 ] 2 x
N
N
i =1
i =1
where x = (1/ N ) ∑ x i , y = (1/ N ) ∑ y i
σ x2 =
1 N ∑ ( x i − x )2 N − 1 i =1
σ y2 =
1 N ( y i − y )2 ∑ N − 1 i =1
σ xy =
1 N ∑ (x i − x )( y i − y ) N − 1 i =1
Using this index, the quality of the denoised images by Stationary Wavelet Transform (SWT), Discrete Wavelet Transform (DWT), Wiener filter and sifting were 0.7837, 0.7790, 0.7885 and 0.8073 respectively. As it can be clearly observed, the highest denoising was achieved by the sifting. For further testing the method we have carried out several other simulations using different micro array images. The results are shown in Table 1. Table 1. Comparative performance of the mean of quality indexes
DWT 0.7416
SWT 0.7531
Wiener filter 0.7723
Sifting 0.8142
4 Discussion It is well known that micro array technology can monitor thousand of DNA sequences in a high density array on a glass. Two mRNA samples are reverse-transcribed into cDNA, labeled using different fluorescent dyes then mixed and hybridized with the arrayed DNA sequences. After this competitive hybridization, the slides are imaged by a scanner to measure fluorescent intensity of each dye. The image is not perfect and includes noisy sources that blur such images for further gene expression experimentation. The noise source originates from different sources during the course of experiment, such as photon noise, electronic noise, laser light reflection, dust on the slide, and so on. Hence, it is crucial to denoise the resulted images of this process. Exciting methods to reduce the noise source include usage of clean glass slide and a higher laser power rather than higher PMT voltages. However, these are not adequate for the required image qualities and an enhanced software procedure embedded within the process would be a much better alternative. As imperfections and fabrication artifacts often impair our ability to measure accurately the quantities of interest in micro array images, image processing for analysis of these images is an important and challenging problem. How to eliminate the effect of the noise imposes a challenging problem in micro array analysis. In this
Image Sifting for Micro Array Image Enhancement
877
paper we implemented a novel algorithm for image sifting which could remove objects with definite size from macro array images. We have used regular moving grids to sift noise object and obtained clean images for segmentation. The simulation results applied on micro array image examples verified this enhanced characteristic and denoising quality of the image analysis. The sifting provides a better performance in denoising micro array image than traditional transform methods. The results also show that the sifting denoising has a 10% better performance than Wiener filter which is widely used in commercial denoising software system. The application of this method would improve the accuracy of gene expression, and therefore easily identify the diseased gene for diagnosing critical diseases.
References 1. Southern, E.M.: Detection of specific sequences among DNA fragments separated by gel electrophoresis. J. Mol. Biol., 98, 503–517 (1975) 2. Schena, M., Shalon, D., Davis, R.W., Brown, P.O.: uantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995) 3. Fodor, S.P.A., Read, J.L., Pirrung, M.C., Stryer, L., Lu, A.T., Solas, D.: Light-directed, spatially addressable parallel chemical synthesis. Science 251, 767–773 (1991) 4. Steinfath, M., Wruch, W., Seidel, H., Lehrach, H., Radelof, U., O’Brien, J.: Automated image analysis for array hybridization experiments. Bioinformatics 17(7), 634–641 (2001) 5. Bozinov, D., Rahnenfuhrer, J.: Unsupervised technique for robust target separation and analysis of DNA microarray spots. Bioinformatics 18(5), 747–756 (2002) 6. Wruch, W., Griffiths, H., Steinfath, M., Lehrach, H., Radelof, U., O’Brien, J.: Xdigitise: Visualization of hybridization experiments. Bioinformatics 18(5), 757–760 (2002) 7. Zapala, M.A., Lockhart, D.J., Pankratz, D.G., Garcia, A.J., Barlow, C., Lockhard, D.J.: Software and methods for oligonucleotide and cDNA array data analysis. Genome Biol. 3(6) (2002) 8. Jain, A.N., Tokuyasu, T.A., Snijders, A.M., Segraves, R., Albertson, D.G., Pinkel, D.: Fully automatic quantification of microarray image data. Genome Res., 12, 325–332 (2002) 9. Kerr, M.K., Martin, M., Churchill, G.A.: Analysis of variance gene expression microarray data. J.Comput. Biol., 7, 819 (2001) 10. Chen, Y., Dougherty, E.R., Bittner, M.L.: Ratio-based decision the quantitative analysis of cDNA microarray images. J. Biomed. Optics, 364–374 (1997) 11. Ermolaeva, O., Rastogi, M., Pruitt, K.D., Schuler, G.D., Bittner, M.L., Chen, R., Simon, P.M, Trent, J.M., Boguski, M.: Data management and analysis for gene expression arrays. Nature Genetics 20, 19–23 (1998) 12. Newton, M.A., Kendziorski, C.M., Richmond, C.S., Blattner, F.R., Tsui, K.W.: On differential variability of expression ratios: Improving statistical inference about gene xpression changes from microarray data. J. Computat. Biol., 8, 37–52 (2001) 13. Otsu, N.: A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1), 62–66 (1979) 14. Wang, Z., Bovik, A.: A universal image quality index. IEEE Trans. Signal Processing Letter 9, 81–84 (2002)
Wavelet Based Local Coherent Tomography with an Application in Terahertz Imaging Xiao-Xia Yin1 , Brian W.-H. Ng1 , Bradley Ferguson2 , and Derek Abbott1 1
Centre for Biomedical Engineering, School of Electrical & Electronic Engineering, The University of Adelaide, SA 5005, Australia Tel.: +61-8-8303-5748 Fax: +61-8-8303-4360
[email protected] 2 Tenix Defence Systems Pty Ltd, Technology Park, Mawson Lakes, SA 5095
Abstract. Terahertz Computed Tomography (THz-CT) is a form of optical coherent tomography, which offers a promising approach for achieving non-invasive inspection of solid materials, with potentially numerous applications in industrial manufacturing and biomedical engineering. With traditional CT techniques such as X-ray tomography, full exposure data are needed to produce cross sectional images, even if the region of interest is small. For time-domain terahertz measurements, the requirement for full exposure data is impractical due to the slow measurement process. In this paper, we apply a wavelet-based algorithm to locally reconstruct THz-CT images with a significant reduction in the required measurements. The algorithm recovers an approximation of the region of interest from terahertz measurements within its vicinity, and thus improves the feasibility of using terahertz imaging to detect defects in solid materials and diagnose disease states for clinical practice.
1
Introduction
Terahertz (T-ray) radiation spans the frequency range from 0.1 THz to 10 THz in the electromagnetic spectrum. T-rays have promising potential for both in vivo and in vitro biosensing [7] owing to (i) their non-invasive property, and (ii) the fact that biomolecules [9] have rich resonances in the terahertz region. While one- and two-dimensional applications with time-domain terahertz spectroscopy have been well-demonstrated in the past [4], the ability to non-destructively probe the inner three-dimensional structure of optically opaque structures is less well studied. There has been a relative scarcity of terahertz tomography work in the literature. The aim of our current work is to perform terahertz tomographic reconstructions, particularly localised reconstructions based on limited data. The main goal of this paper is to present a wavelet based reconstruction algorithm for terahertz computed tomography and to show how this algorithm can be used to rapidly reconstruct the region of interest (ROI) with a reduction in the measurements of terahertz responses, compared with a standard reconstruction. This algorithm generates the approximation and detail images separately, W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 878–885, 2007. c Springer-Verlag Berlin Heidelberg 2007
Wavelet Based Local Coherent Tomography
879
and the final reconstruction is found by inverse wavelet transform. The current algorithm computes the approximate and detailed portions of the reconstructed image using ramp filtered scaling and wavelet functions within the more familiar filtered back-projection (FBP) framework. Compared to previous algorithms in terahertz computed tomography [3], the current reconstruction algorithm accelerates terahertz image scanning and reduced computation complexity, since it only requires the knowledg of a small number of projections on lines passing through the ROI and its small immediate neighbourhood. In this paper, uniform exposure is assumed at all angles for simplicity in analysis as well as potential implementation in hardware. Reconstruction ability at off-centered and centered regions of interest are also explored. 1.1
A Brief Introduction to Terahertz Imaging
A terahertz CT system is based on a chirped terahertz time domain spectroscopy scanned imaging system. The target is mounted on a motion stage that allows it to be translated along both axes in the horizontal plane, as well as being rotated about is vertical axis. This setup is capable of producing tomographic data sets of an object with full variability of angle and distance offset distance from its centre. The detailed information for the chirped pulse scanning and relative rectangular coordinate system and polar coordinates for imaging reconstruction can be found in Ferguson et al [2].
2 2.1
Methodology An Overview of CT and Terahertz CT
Normally, a filtered back projection algorithm (FBP) begins with a collection of sinograms obtained from projection measurements. A sinogram s(ξ, θ) is simply generated via a collection of the projections at all angles: s(ξ, θ) = o(x, z)dξ, (1) where all points on projection offset ξ satisfy the equation: x cos θ + z sin θ = ξ and o denotes the measured image intensity of a target object, which is a function of pixel position in the x − z plane . The FBP algorithm for terahertz CT reconstruction is expressed as follows: π ∞ I(x, y) = S(θ, β)|β|exp[i2πβξ]dβ dθ, (2) 0
−∞
where S(θ, β) is the spatial Fourier transform of the parallel projection data, β is the spatial frequency in the ξ direction. It should be noted that the operation of the ramp filter |β| in Eq. (2) is equivalent to a differentiation followed by a Hilbert transform, which introduces a discontinuity in the derivative of the Fourier transform at zero frequency. This is the cause of the non-locality of the FBP algorithm.
880
2.2
X.-X. Yin et al.
Calculation of Terahertz Parameters for Reconstruction of Terahertz CT
One of the advantages that terahertz CT has over X-ray CT is that s(θ, ξ) may be one of several parameters derived from terahertz pulses. Fundamentally, a terahertz CT setup is capable of measuring the transmitted terahertz pulse as a function of time t, for a given projection angle and projection offset. In principle, terahertz sinograms can be obtained in both time and frequency domains. In this paper, we review the calculation of terahertz sinograms in the time domain. This method is based on the assumption that the target is dispersionless and therefore the THz pulse shape is unchanged after propagation through the target apart from attenuation and time delay. A reference terahertz pulse pi (t) is measured without the target in place. To estimate the phase shift t of a terahertz pulse pd (t) going through the target, the two signals are resampled using lowpass interpolation. The two interpolated signals are then cross-correlated, and the maximised cross-correlation product at each angle as the lag is taken as the estimation of the phase delay of pd (t). This delay is used as the extracted parameter for use in CT reconstruction. 2.3
Two Dimensional Wavelet Based CT Reconstruction
Two Dimensional Wavelet Transform (2D DWT). Wavelet transforms play an important role in many image processing algorithms. Fundamentally, wavelet decomposition corresponds to a multiresolution analysis of a signal. This has the advantage of much improved joint time-frequency localisation over Fourier based techniques. In practice, it is nearly always implemented using digital filters and downsamplers. In two dimensions, the discrete version of a wavelet transform can be realised by a 2D scaling function, φ(x, y), and three 2D wavelets, ψ1 (x, y), ψ2 (x, y), and ψ3 (x, y), which are calculated by taking the 1D wavelet transform along the rows of f (x, y) and the resulting columns. The 2D scaling function and 2D wavelet functions satisfy the equations represented in Gonzalez and Woods [5]. Our current experiment uses symmetric (linear phase) filters for the analysis of tomographic reconstruction. Let h0 , h1 , denote a pair of linear phase low˜ 0, h ˜ 1 denote the corresponding reconstruction and high-pass wavelet filters and h filters. The discrete approximation at resolution 2j−1 can be obtained by combination of the details and approximation at resolution 2j using reconstructed wavelet filters: ˜ 0 (k − 2m)h ˜ 0 (l − 2n)cj (m, n) + h ˜ 0 (k − 2m)h ˜ 1 (l − 2n)dH (m, n) cj−1 (k, l) = h j m,n
˜ 1 (k − 2m)h ˜ 0 (l − 2n)dV (m, n) + h ˜ 1 (k − 2m)h ˜ 1 (l − 2n)dD (m, n). +h (3) j j Our method is to focus the wavelet application on recovering local images from wavelet approximate and detail coefficients. In order to support the reconstructed filter for the recovery of the local target area, the calculation of these reconstructed coefficients includes the region of interest and a margin area.
Wavelet Based Local Coherent Tomography ramp filter in the frequency domain
scaling and wavelet filters
1
1.5
0.8
scaling and wavelet ramp filters
scaling ramp filter horizontal wavelet ramp filter vertical wavelet ramp filter diagonal waveletr ramp filter
0.9
881
0.7
scaling ramp filter horizontal wavelet ramp filter vertical wavelet ramp filter diagonal waveletr ramp filter
0.8
0.6 1
0.6
0.4
0.5 0.2
0.5
0.4 0.3
0 0
0.2
−0.2
0.1 0
50
100
150
200
250
−0.5
5
10
(a)
15
20
(b)
25
30
35
5
10
15
20
25
30
35
(c)
Fig. 1. (a) Illustration of a traditional ramp filter. (b) and (c) Illustration of the scaling and wavelet ramp filters at the sixth projection angle (43.2 degree) using BiorSplines 2.2 wavelet, respectively.
Two Dimensional Wavelet Reconstruction. The idea of a wavelet based reconstruction is to use scaling (or wavelet) modified ramp filter in the FBP algorithm to generate the appropriate subbands of the reconstructed image [8]. Since the 2D wavelet transform is separable, the modified ramp filters are simply convolutions of appropriately rotated 1D scaling and wavelet functions. Locality is achieved since most common scaling and wavelet functions have compact support, as do their Hilbert transforms [1]. The modified FBP algorithm for terahertz CT reconstruction is expressed as follows: π ∞ I(x, y) = S(θ, β)|β|G2j (β cos θ, β sin θ) exp(i2πβξ)dβ dθ, (4) 0
−∞
where S(θ, β) and G2j (β1 , β2 ) are the spatial Fourier transforms of s(θ, ξ) and g2j (the scaling or wavelet ramp filter in the time domain), respectively. An example of a ramp filter and a modified wavelet ramp filter are shown in Fig. (1)(a)-(c). For the current paper, only one level of wavelet transform step is considered. This is mostly due to a single level WT is sufficient to allow clear reconstruction of an image in the ROI and it avoids extra computational complexity of more WT levels [8]. The wavelet reconstruction formulas in Eq. (4) allow for such reconstruct by setting j = 1. The 2D inversion of the wavelet transform (IWT) is performed on the reconstructed approximate and detail subbands, after the relevant filters are applied in the FBP.
3
Implementation Issues
The current research based on terahertz imaging is inspired by Rashid-Farrokhi et al [8]. In this work, we experiment with the 2D wavelet technique using terahertz tomographic data by modifying the measured projections. This modification uses an extrapolation technique to reduce edge effects due to sinogram truncation. It is observed that approximate coefficients of a scaling function shows good localised features in the local reconstruction using our algorithm,
882
X.-X. Yin et al. Projection profile at 25th angle 40
Projection profile at 25th angle after Extrapolation
ramp filtered projection wavelet ramp filtered projection
7
30 6 5
amplitude (a.u.)
amplitude (a.u.)
20 10 0 −10
4 3 2 1
−20 0 −30
−1
−40
−2 10
20
30
projection (mm)
(a)
(b)
40
50
10
scaling ramp filtered projection ramp filtered projection 20 30 40
50
projection (mm)
(c)
Fig. 2. (a) An optical image of a target with 2 mm diameter holes drilled into a polystyrene cylinder with varying interhole distances. (b) Projection filtered by a scaling ramp filter and a traditional ramp filter, respectively. (c) Projection extrapolation outside the ROI after filtered projections.
where the reconstructed intensity of an image varies much between different target materials. It should be noted that, in the application of terahertz data for local reconstruction, it is found that artifacts at the edges of the region of exposure (ROE) in terahertz projections, where nonlocal data is set to zero, are very significant whether traditional ramp filter or scaling and wavelet ramp filters are used. Thus, the ROE must have a radius re = ri + ra , where ri is the ROI radius, and ra is the radius of the region of artifacts (ROA). Fig 2(b) shows sharp variation along the borders of the ROE after applying scaling ramp filter and ramp filter, respectively, on one of the 1D projections, which result in the image appearing relatively weakened in intensity compared to a large constant bias that exists along the reconstructed edges in the region of interest. The constant extrapolation we use is given by Eq. 25 in Rashid-Farrokhi et al [8]. Fig. 2(c) shows the extrapolated projection at the 25th projection angle after the application of a scaling ramp filters and a ramp filter. The extrapolated projection removes spikes at the edge of the ROE. This extrapolation procedure is also suitable for the reconstruction of an off-center area. In order to recover the cross-sectional image in the region of interest, the values of the sinograms outside of the ROE are set to zero. The traditional filtered back projection formulas and wavelet based reconstruction are applied to the remaining projections, respectively for analysis and comparison. With the extrapolation step, the overall algorithm becomes: 1. The original projections through the ROE are calculated in the time domain from measured pulses; 2. Each projection is filtered by three different modified wavelet filters at all projection angles; 3. Each projection is filtered by modified scaling filter at all projection angles; 4. Filtered projections are extrapolated with constants to limit artifacts at the boundaries of the projections;
Wavelet Based Local Coherent Tomography
883
5. Extrapolated filtered projections are back projected to every other point to obtain the approximate and detail sub-images at resolution 2−1 ; all remaining points are set to zero; 6. Image is reconstructed from these sub-images using a 2D IWT
4
Reconstruction Results
In this paper, terahertz data measured from a simple structure: a cylinder with holes inside is used as the target (Fig 2(a)). The sample consists of 101 projections at each of 25 projection angles covering a 180◦ projection area in a 100 × 100 image. Two situations are analyzed for this target sample: (i) an ROE of diameter 42 pixels at the center of the image and (ii) an ROE of diameter 67 pixels offcenter to the image. 4.1
Local Reconstruction of Centered Region
Fig. 3 shows local reconstruction results for a centered region with a radius of 16 pixels using the local reconstruction algorithm. Fig. 3(a) is the local reconstruction. Fig. 3(b) shows four reconstructed subimages (approximate and detailed coefficients) using wavelet and scaling ramp filters, constant extrapolation and BP. Fig. 3(c) and (d) show the reconstruction profiles at an arbitrarily chosen horizontal row and vertical column of pixels corresponding to local and traditional reconstruction. It is observed that the reconstruction inside the ROI is similar in quality to a full exposure reconstruction using traditional FBP. Reconstructed profile at 12th horizontal pixel row
Wavelet Based Tomography 0
0
10
10
4
30
30 40
3.5
40
50 0
3
50
30
5.5
40 20
50 0
40
1.5
80
1
90 20
40 60 X (mm)
(a)
80
0
10
10 Y (mm)
2
70
0
6
0
20 30 40 50 0
20 X (mm)
40
(b)
scaling ramp filtered LCT ramp filtered GCT ramp filtered LCT wavelet based LCT
5
4 3.5 3 2.5
scaling ramp filtered LCT ramp filtered GCT ramp filtered LCT wavelet based LCT
5.5
4.5
30
50 0
40
6
5
20
4.5 4 3.5 3 2.5 2 1.5
40 20 X (mm)
Reconstructed profile at 12th vertical pixel column
6.5
20
X (mm) 2.5
60
Y (mm)
Y (mm)
20
Magnitude (dB)
20
Y (mm)
4.5
Y (mm)
5
Magnitude (dB)
0 10
2 20 X (mm)
40
5
10
15
20
Number of pixels
(c)
25
1
5
10
15 Number of pixels
20
25
(d)
Fig. 3. (a) Reconstructed image localised to a region of interest from the inverse wavelet transform. (b) Centered approximate and three detail reconstruction subimages (clockwise). (c) Reconstructed profiles at the 12th horizontal pixel row. (d) Reconstructed profiles at the 12th vertical pixel column.
4.2
Local Reconstruction of Off-Centered Region
Fig. 4(a)-(d) shows locally reconstructed images of an off-centered region with a radius of 61 pixels using the current local reconstruction method. The subfigures illustrates in turn: local reconstruction from extrapolated wavelet and scaling filtered projection after decomposition; the reconstruction of extrapolated approximate and detail coefficients after BP; the reconstruction profiles
884
X.-X. Yin et al.
30 40
20 X (mm)
50 0
40
20 X (mm)
40
2.5
60
0
0
10
10
2
70
0.5
20
40 60 X (mm)
80
(a)
Y (mm)
1
90
Y (mm)
80
30
6
4.5
5
4
3
50 0
2
1
30
0
40 20 X (mm)
50 0
40
3
20
1.5 40
4
2
2.5
1.5
20
5
3.5
scaling ramp filted LCT ramp filtered LCT ramp filtered GCT
7
Magnitude (dB)
30
50 0
3
8
scaling ramp filted LCT ramp filtered LCT ramp filtered GCT
5.5
20
Magnitude (dB)
40 50
20
40
3.5
Y (mm)
0 10 Y (mm)
4
Y (mm)
4.5
30
0
0 10
6 5
20
Reconstructed profile at 12th vertical pixel column
Reconstructed profile at 28th horizontal pixel row
Extrapolated wavelet based LCT 0 10
1 20 X (mm)
40
(b)
5
10 15 Number of pixels
(c)
20
10
20
30 40 Number of pixels
50
60
(d)
Fig. 4. (a) A reconstructed image from the inverse wavelet transform without decomposition for clarity. (b) Off-centered approximate and three detail reconstructed subimages (clockwise). (c) Reconstructed profiles at the 28th horizontal row of pixels. (d) Reconstructed profiles at the 12th vertical column of pixels.
at an arbitraily chosen horizontal row; and another arbitrarily chosen vertical column. For comparison purposes, the profiles for both proposed and traditional FBP are shown in Fig. 4(c)-(d). It should be noted that the reconstruction from wavelet approximate coefficients shows strong contrast in intensity for different media and traditional FBP based local reconstruction shows a little higher intensity than FBP based global reconstruction.
5
Conclusion and Future Work
We have presented an algorithm to reconstruct the wavelet and scaling coefficients of a function from its signogram image of terahertz signals. The scheme is motivated by the observation that for some wavelet bases, with sufficient zero moments, the scaling and wavelet functions have essentially the same support after ramp filtering. Two targets are recovered from terahertz measurements, which demonstrates the quality of the reconstructions possible from the presented local reconstruction reconstruction method. Since the current work involves only the one level of 2D DWT, it will be interesting to explore the reconstruction algorithm with more levels of decomposition in the future. Moreover, a research area of much current interest is the development of statistical based local tomography algorithms and techniques [6]. There is much room for the integration of such a statistical estimation framework with wavelet techniques, which have been shown here to be critical for local reconstruction.
References 1. Delaney, A.H., Bresler, Y.: Multiresolution tomographic reconstruction using wavelets. IEEE Transactions on Image Processing 4(6), 799–813 (1995) 2. Ferguson, B., Wang, S., Gray, D., Abbott, D., Zhang, X.C.: Identification of biological tissue using chirped probe THz imaging. Microelectronics Journal 33(12), 1043–1051 (2002) 3. Ferguson, B., Wang, S., Gray, D., Abbott, D., Zhang, X.C.: T-ray computed tomography. Optics Letters 27(15), 1312–1314 (2002)
Wavelet Based Local Coherent Tomography
885
4. Galv¨ ao, R.K.H., Hadjiloucas, S., Bowen, J.W., Coelho, C.: Optimal discrimination and classification of thz spectra in the wavelet domain. Opt. Express 11, 1462–1473 (2003) 5. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice-Hall, Inc, New Jersey (2002) 6. Hanson, K.M., Wecksung, G.W.: Bayesian estimation of 3-D objects from few radiographs. Journal of Optical Society of America 73(11), 1501–1509 (1983) 7. Mittleman, D., Neelamani, R., Baraniuk, R.G., Rudd, J.V., Koch, M.: Recent advances in terahertz imaging. Appl. Phys. B. 68, 1085–1094 (1999) 8. Rashid-Farrokhi, F., Liu, K.J.R., Berenstein, C.A., Walnut, D.: Wavelet-based multiresolution local tomography. IEEE Transactions on Image Processing 6(10), 1412– 1430 (1997) 9. Siegel, H.P.: Terahertz technology in biology and medicine. IEEE transactions on microwave theory and techniques 52(10), 2438–2447 (2004)
Nonlinear Approximation of Spatiotemporal Data Using Diffusion Wavelets Marie Wild Austrian Research Centers GmbH Video and Safety Technology Tech Gate Vienna Donau-City-Str.1 1220 Vienna, Austria
[email protected] Abstract. We present a multiscale, graph-based approach to 3D image analysis using diffusion wavelet bases, which were presented in [1]. Diffusion wavelets allow to obtain orthonormal bases of L2 functions on graphs. This permits the study of classical wavelet algorithms (such as compression and denoising of functions in L2 (Rn ), n ∈ N, via nonlinear approximation) in this setting. In this paper, we describe how this could be used in structure-preserving compression of image sequences, modelled as a whole as a weighted graph, as a first step towards structural spatiotemporal wavelet segmentation. We further discuss the possibilities for using this abstract approach in computer vision tasks.
1
Introduction
Wavelet based methods have been proven to be successful in signal and image analysis, where the main applications are edge-preserving smoothing and denoising of functions in L2 (Rn ), n ∈ N, via nonlinear approximation. The recent concept of diffusion wavelets [1,2] allows the construction of wavelet bases for functions defined on other than Rn , such as certain domains, manifolds and graphs. In place of the usual dilation on Rn , a diffusion operator on the data serves as scaling tool. In this paper, we study the use of classical wavelet algorithms, lifted to a graph based setting, concentrating on an algorithm for nonlinear approximation of 2d+ time image data using diffusion wavelet bases. In this sense, we build a bridge from applied mathematics to computer vision, describing how theoretical results on wavelet theory could be used in tasks such as graph-based compression and denoising, as well as segmentation and tracking in future work. We suppose the input data to be an image sequence, regarded as a 3d, or more precisely a 2d + time data set. We model the whole image sequence as a weighted graph, where the edge weights describe local similarity between certain data
Supported by the Austrian Science Fund under the grant FWF-P18716-N13.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 886–894, 2007. c Springer-Verlag Berlin Heidelberg 2007
Nonlinear Approximation of Spatiotemporal Data Using Diffusion Wavelets
887
points, which represent the nodes of the graph. Notice that the data points herein are not necessarily the whole set of pixels, but could also be subsets of pixels such as feature points, or groups of pixels obtained from a pre-segmentation. The weights on the edges can be based on intensity, feature properties, motion information or a combination of the above. On the graph G, we build a diffusion wavelet basis on it. Instead of using the geometry of the data set in R3 (pixel distances on a regular grid), we use the ‘intrinsic geometry’ or ‘structure’ of the data given by the connectivity of the graph nodes during a diffusion process, which is given by increasing powers of a diffusion operator. This diffusion process can be regarded as ‘learning’ the global structure of the data using local relationships. As the operator is formed from the graph representing the input data, the wavelet basis depends on the data and the whole process is data-adaptive and non-linear. We can analyze functions on the graph by computing coefficients from the constructed diffusion wavelet basis (ψj,l )j≥1,l∈Z . As this basis is an orthonormal basis, we have that for a function in L2 (G), f 2L2(G) =
|f, ψj,l |2 .
(1)
j,l
In this way, all the information of f is maintained in the sequence of coefficients. Moreover, as this information is simply kept in the squared sum of the coefficients, the salient information is reflected in the largest coefficients just like for the usual wavelet transform. The main difference to the classical wavelet algorithms is that ‘information’ now defers to structural similarity of the data, encoded as a graph, instead of properties on the fixed grid Rn . The wavelet basis depends on the graph, which itself depends on the data, such that the shortcomings of the classical wavelet transform (i.e. the lack of efficiency in coding functions with edges along smooth curves) can be overcome as these shortcomings mainly arise from the fixed geometry of Rn . In a sense, we adapt to the intrinsic geometry of the data. Nevertheless, we can employ classical wavelet methods on the diffusion wavelet coefficients, although the information they encode is a totally different one. Due to the norm equality (1), we can efficiently approximate functions on a graph using only the N largest coefficients for reconstruction. This method, known as N -term approximation, does more than just approximate a function by another one on a coarser resolution, it sums wavelet series according to large coefficients across different resolution levels and is thereby nonlinear. A similar method in the context of denoising is the so-called Wavelet Shrinkage method: the N largest coefficients are further decreased in size towards the noise level. For both of these approaches exists a large theory (e.g. see [3,4,5]), which relates the error of approximation to the function’s properties. Mostly important to mention is the fact that these algorithms compress the data performing a sort of a discontinuity preserving smoothing, which in our graph setting in a sense corresponds to structure-preserving smoothing: the ‘discontinuities’ which
888
M. Wild
are preserved in this case correspond to abrupt changes of edge weights in a neighborhood of ‘continuously’ changing weights, which are smoothed out. This can be seen as a first step towards structure-based spatiotemporal segmentation. After this short overview, we go into details, where the rest of the paper is organized as follows: in the next section, we will comment on related work in computer vision concerning spatiotemporal segmentation and spectral graph methods. In section 3, we decribe the concept of diffusion wavelet multiresolution. The algorithm is presented in section 4. We give an abstract example to visualize our theoretical approach in section 5 before we close this article with a conclusion and outlook to further work concerning structure-based spatiotemporal segmentation and its application to computer vision problems.
2
Related Work and Context
Diffusion wavelets can be seen as a particular instance of spectral graph theory. Methods based on this theory have been widely studied and applied to computer vision tasks, see e.g. [6,7,8,9,10,11,12]. Spectral graph methods start from the knowledge of the local geometry of the given data and infer a global representation, they allow a non-linear re-organizing and dimension reduction of data sets (on graphs and more general manifolds) and are well-suited for subsequent tasks such as visualization, grouping and partitioning of the data. We shortly sum up the basics, for details see [13]. Let G = (V, E) be an undirected graph, where V and E are finite sets of vertices and edges. LetG be a weighted graph equipped with a weight matrix W : E → R+ . Let dv = u∈V W (u, v) denote the degree of vertex v. Let the (normalized) Laplacian on G be defined as ⎧ W (v,v) if V u = v and dv = 0, ⎪ ⎨1 − d v W (u,v) √ L= − d d if u, v adjacent, u v ⎪ ⎩ 0 otherwise. and let the diffusion matrix on G be defined as K = I − L. Note that on the one hand, these matrices describe the graph just like a generalized adjacency matrix, on the other hand, they can be seen as (locally averaging) operators acting on functions on a graph. We will thus refer to K as the diffusion operator in the following. K reflects local similarity in the graph and global properties can be explored from its eigenvalues. A global analysis of the data corresponds to the study of large powers of K and is efficiently performed using the eigenvectors with the largest eigenvalues, which are the ones with the lowest frequency. A projection on these top eigenvectors corresponds to an approximation on low frequency approximation. In these terms, this projection can be regarded as an analog to a Fourier approximation, where the basis functions are the eigenfunctions, sorted by their frequency. In [14], this is used for dimensionality reduction of high dimensional data, in [11], this is applied to image denoising and semi-supervised classification, whereas in [12], it is used for data fusion and multi-cue data matching.
Nonlinear Approximation of Spatiotemporal Data Using Diffusion Wavelets
889
The normalized cut method [8,10], used for graph-based image segmentation and spatiotemporal segmentation for tracking fits into this spectral context. It partitions a weighted graph into disjoint sets, finding a cut of relatively small weight between two subsets with strong internal connection. The algorithm calculates eigenvalues and eigenvectors of the Laplacian matrix and uses the eigenvector of the second smallest Eigenvalue to bipartition the graph. In this sense, functions on a graph are projected on a subspace of lowfrequency, Fourier-like approximations. Our goal is to derive an analog spatiotemporal segmentation based on a multiresolution analysis on a graph rather than on a kind of Fourier transform, lifting the classical advantages by projecting on wavelet subspaces across scales rather than on Fourier-like linear subspaces to this setting.
3
Multiresolution Analysis and Diffusion Wavelets
In mathematical terms, the discrete wavelet algorithm transform algorithm [15] corresponds to a fast implementation deriving coefficients of a wavelet series. Wavelets provide an orthonormal basis for functions in R (also in Rn , n ∈ N, for 2 simplicity of presentation, we concentrate here on the 1d case). Any f ∈ L (R) can be written as f = j,l∈Z f, ψj,l ψj,l , where ψj,l are dilated and translated versions of a mother wavelet ψ ∈ L2 (R). A remark: the wavelet series resembles a Fourier series, but there are crucial differences. As Fourier series decompose functions into sines and cosines having infinite support, they serve as a frequency representation, analyze functions globally and are well-suited for smooth functions (e.g. the Sobolev class). The wavelet series decomposition allows a decomposition into localized functions at different scales, is thereby a time-frequency representation. N -term approximation and Wavelet Shrinkage are classical wavelet algorithms, having well-known efficiency properties (for details see [3]). It can be shown that the decay of the approximation error is directly related to a certain decay of the wavelet coefficients. For functions on R, Z and other settings, this decay is reached if and only if the function is in a (Besov-type) smoothness space [3,16], provided some properties of the wavelet basis. As the Besov spaces (in contrast to Sobolev spaces) may contain rather rough functions, even with discontinuities, one could state that the approximation smooths out the data while preserving discontinuities. Furthermore, Wavelet Shrinkage can be interpreted as the solution of a certain variational problem [5], including a regularization term which is close to the total variation. This gives an heuristic evidence why to use wavelet methods in the graph setting: the data on the graph will be smoothed by the approximation, but not that rigorously (preserving certain discontinuities, i.e. abrupt changes in the edge weights) as using Fourier based methods (for a discussion on smoothness for functions on graphs in relation to Sobolev spaces based on the ‘smooth flow’ of the diffusion process see [11]). The approximation will in this sense be better able to preserve the intrinsic structure of the data.
890
M. Wild
We now shortly explain the construction of the diffusion wavelets. Dilations (stretching and squeezing) like on R cannot be used for scaling in this setting. j Instead, one uses increasing powers of the diffusion operators (K 2 )j>0 as a scaling tool to define a multiresolution analysis (for an exact definition and discussion of this concept see e.g. [17]). Let {λi }i≥0 be the (decreasingly ordered) spectrum of K with eigenvectors {ξi }, 0 < ε < 1, tj := 2j+1 − 1, j ≥ 0. The recipe is to divide the spectrum j into ‘low-pass’ portions σj (K) := {λ ∈ σ(K) : λt ≥ ε}. For j ≥ 0 define the approximation spaces by Vj := span{ξλ : λ ∈ σj (K)} and wavelet spaces by Vj−1 = Vj ⊕ Wj . Projections on the spaces Vj are approximations of a function f at different resolutions and projections on the Wj (partial wavelet series) describe the difference between two approximation levels. The system (ψj,l )j≥0,l∈Z , formed from orthonormal bases (ψj,l )l∈Z for each Wj , is an orthonormal basis for L2 (G), called diffusion wavelet basis. Building localized orthonormal bases for Vj , Wj is possible because for K with a fast decaying spectrum, the rank of j K t decreases and the powers of the operator can be compressed compressibility. For technical details we refer to [1]. With this construction one can now calculate scaling and wavelet coefficients of functions on a graph (see again [1]). This permits a true multiscale transform in the spirit of classical wavelet analysis on non-linear structures such as graphs, manifolds or general metric spaces. The approximation algorithm based on this diffusion wavelet transform is presented in the following section.
4
Nonlinear Approximation on Graphs Using Diffusion Wavelets
As described in the introduction, in this paper we want to study in nonlinear approximation of image sequences, modelled as a whole as a weighted graph. Employing the whole bunch of theory we presented in the above sections, we proceed as following: First, we have to build a weighted graph G from the 3d image data. The vertices can be chosen as the whole set of pixels, or due to complexity considerations as a subset, e.g. using a downsampled version of the sequences, by filtering or by a feature point selection procedure. The crucial point is to define the edges of G and their corresponding weights: here we encode the local relations between vertices from which the algorithm will ‘learn’ the global and multiscale structure from. A standard choice in our setting is w(u, v) = exp(−ρ(u, v)2 ), where ρ(u, v) may be the difference of intensities in u, v, the distance in space, feature point properties, information from a motion prediction or a combination of the above. For complexity reasons, it is reasonable to set ρ(u, v) = ∞, if v is outside a certain finite neigborhood of u. For an elaborate discussion on these choices, see [11] and [10]. A function f on G imposes additional attributes on the vertices. In our context, f can describe intensities (if the vertices are pixels), feature properties or other
Nonlinear Approximation of Spatiotemporal Data Using Diffusion Wavelets
891
Algorithm 1. Diffusion wavelet approximation Input: f ∈ L2 (G), function on a weighted graph G Result: f˜ ∈ L2 (G), compressed function on G choose T > 0; // threshold parameter: the larger T , the less coefficients constitute the approximation build a diffusion wavelet basis (ψj,l )j≥0,l∈Z on G ; compute wavelet coefficients (f, ψj,l )j≥0,l∈Z ; forall j ≥ 0, l ∈ Z do case Hard Thresholding if |f, ψj,l | ≥ T then f˜, ψj,l ← f, ψj,l ; else f˜, ψj,l ← 0 ; case Soft Thresholding if |f, ψj,l | < T then f˜, ψj,l ← 0 ; else if f, ψj,l > T then f˜, ψj,l ← f, ψj,l − T ; else if f, ψj,l < −T then f˜, ψj,l ← f, ψj,l + T ; end f˜ ←
˜
j,l f , ψj,l ψj,l
// Build approximation f˜ from (f˜, ψj,l )
additional information. For analyzing the ‘pure’ graph we constructed from the data, choose f ≡ 1. The graph G and the function f (∈ L2 (G) as G is supposed to have a finite set of nodes) are the input to the approximation algorithm.
5
Example
We visualize the algorithm by an example. Figure 1 indicates one possibility how to build a weighted graph from an image sequence, for other possibilities see section 4. In figure 2 a), there is an abstract graph which could be a (simplified) cutout from a graph that has been built from an image sequence. In b) and c), we show the expected outcome for different thresholding parameters in an idealized way: the function is smoothed in regions where the function values vary ‘smoothly’ while preserving ‘jumps’ in form of sudden changes in intensities and edge weights. The larger the threshold, the less diffusion wavelet coefficients constitute the approximation and the higher is the rate of compression. In real image data, the ‘discontinuities’ or ‘jumps’ may correspond to object borders in single images or to the main direction of movement, depending on the information encoded in the weigths. Experimenting with different choices of vertices, weights and thresholds on real data will be part of our future work.
892
M. Wild
Fig. 1. Graph formed from a sequence of images, taking pixels as nodes, edges connecting nodes only in a certain neighborhood and the weights corresponding to the difference gray values
Fig. 2. Different thresholds on an abstract graph formed from the data
6
Conclusion and Further Work
In this article, we present an algorithm for nonlinear approximation using diffusion wavelets. This algorithm is derived from classical wavelet methods now lifted to a graph-based setting. We explain how to apply the algorithm to spatiotemporal data, modelled as a function on a graph. The data will be smoothed by the approximation, where abrupt changes in the edge weights (which can describe object borders in single images or the main direction of movement) are preserved. The approximation should in this sense be better able to preserve the intrinsic structure of the data. Right now, the implementation and testing of the algorithm, as well as the enhancement towards segmentation is in the focus of our work. Also the exact approximation properties have still to be worked out in the graph setting, studying properties of the diffusion wavelets in order to relate smoothness properties (analog to Besov spaces) of the data in L2 (G) to the decay of the approximation error. Structure-preserving compression on a graph can be seen as a first step towards structural spatiotemporal wavelet segmentation. In order to obtain a true
Nonlinear Approximation of Spatiotemporal Data Using Diffusion Wavelets
893
segmentation, i.e. providing the nodes of the graph with labels indicating the segment they live in, we have to go a step further. A first idea is to use a Hidden Markov Model on the coefficient tree, classifying the coefficients as ‘large’ or ’small’ and following them across resolution levels (see [18]). The approximation algorithm can be used for compression of data, while preserving structural properties as encoded in the graph. The segmentation method we work on could be used as a way of a combined spatial and motion segmentation of a whole image sequence and thereby as an instance of an ’offline’ tracking method similar to the normalized cut tracking, involving a true multiresolution analysis on the data.
References 1. Coifman, R.R., Maggioni, M.: Diffusion wavelets. Appl. Comput. Harmon. Anal. 21(1), 53–94 (2006) 2. Coifman, R.R., Lafon, S., Lee, A.B., Maggioni, M., Nadler, B., Warner, F., Zucker, S.W.: Geometric diffusion as a tool for harmonic analysis and structure definition of data: Multiscale methods. In: Proceedings. vol. 102, PNAS, pp. 7432–7437 (2005) 3. DeVore, R.: Nonlinear Approximation. Acta Numerica, 51–150 (1998) 4. Donoho, D., Johnstone, I.: Minimax Estimation Via Wavelet Shrinkage. Ann. Stat. 26(3), 879–921 (1998) 5. Chambolle, A., DeVore, R.A., Lee, N.-Y., Lucier, B.J.: Nonlinear Wavelet Image Processing: Variational Problems, Compression, and Noise Removal Through Wavelet Shrinkage. IEEE Trans. Image Process 7, 319–335 (1998) 6. Qiu, H., Hancock, E.R.: Robust Multi-body Motion Tracking using Commute Time Clustering. In: ECCV, pp. 160–173 (2006) 7. Luo, B., Hancock, E.R.: Structural graph matching using the EM algorithm and singular value decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 23(10), 1120–1136 (2001) 8. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) 9. Meila, M., Shi, J.: Learning Segmentation by Random Walks. In: NIPS, pp. 873– 879 (2000) 10. Shi, J., Malik, J.: Motion Segmentation and Tracking Using Normalized Cuts. In: ICCV pp. 1154–1160 (1998) 11. Szlam, A.D., Maggioni, M., Coifman, R.R.: A general framework for adaptive regularization based on diffusion processes on graphs (2006) 12. Lafon, S., Keller, Y., Coifman, R.R.: Data Fusion and Multi-Cue Data Matching by Diffusion Maps. IEEE Trans. Pattern An. and Mach. Int. 28-11, 1784–1797 (2006) 13. Chung, F.R.: Spectral Graph Theory. CMBS-AMS 92 (1997) 14. Coifman, R.R., Lafon, S.: Diffusion maps. Appl. and Comp. Harm. Anal. 21, 5–30 (2006) 15. Mallat, S.: Multiresolution Approximations and Wavelet Orthonormal Bases in L2 (R). Trans. Amer. Math. Soc. 315, 69–87 (1989)
894
M. Wild
16. F¨ uhr, H., Wild, M.: Characterizing Wavelet Coefficient Decay of Discrete-Time Signals. Appl. and Comp. Harm. Anal. 20(2), 184–201 (2006) 17. Daubechies, I.: Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics (1992) 18. Choi, H., Baraniuk, R.G.: Multiscale image segmentation using wavelet-domain hidden Markov models. IEEE Trans. Signal Process 101(92) (September 2001)
A New Wavelet-Based Texture Descriptor for Image Retrieval Esther de Ves1 , Ana Ruedin2 , Daniel Acevedo2 , Xaro Benavent3 , and Leticia Seijas2 1
2
Computer Science Department, University of Valencia, Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, 3 Robotics Institute, University of Valencia {Esther.Deves,Xaro.Benavent}@uv.es
Abstract. This paper presents a novel texture descriptor based on the wavelet transform. First, we will consider vertical and horizontal coefficients at the same position as the components of a bivariate random vector. The magnitud and angle of these vectors are computed and its histograms are analyzed. This empirical magnitud histogram is modelled by using a gamma distribution (pdf). As a result, the feature extraction step consists of estimating the gamma parameters using the maxima likelihood estimator and computing the circular histograms of angles. The similarity measurement step is done by means of the well-known Kullback-Leibler divergence. Finally, retrieval experiments are done using the Brodatz texture collection obtaining a good performance of this new texture descriptor. We compare two wavelet transforms, with and without downsampling, and show the advantage of the second one, which is translation invariant, for the construction of our texture descriptor. Keywords: Texture descriptor, Wavelet Transform, Image retrieval.
1
Introduction
The increasing amount of information available in today’s world raises the need to retrieve relevant data efficiently. Unlike text-based retrieval, where key words are successfully used to index documents, content-based image retrieval poses up-front the fundamental questions of how to extract useful image features and how to use them for intuitive retrieval [11] [5]. Interest in content-based image retrieval (CBIR) systems has been growing in the last few years. This interest has been motivated by the increasing number of image databases which need effective and efficient techniques for retrieving multimedia information. Attempts have been made to develop general purpose image retrieval systems based on multiple features (e.g. color, shape and texture), which describe the image content [10]. The new visual information retrieval systems extract visual image features usually related to color, texture and shape from each image stored in the database, and use this representation to compare images by means of a similarity measure. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 895–902, 2007. c Springer-Verlag Berlin Heidelberg 2007
896
E. de Ves et al.
The features extracted from an image can be classified into low and high level features, which are normally obtained by combining low level features with a reasonable predefined model. The low level features are obtained by preprocessing each image in the database. Among these characteristics we can mention those with chromatic information, those related to textures present in the image and those related to the shape of objects in the image. Generally speaking, the structure of a CBIR system is composed of two main modules: the feature extraction module and the similarity measurement module. In the latter a distance between the query image and each image in the database is computed, by making use of the extracted features. Each image obtains a relevance score related to the query image in order to rank the database. Texture analysis plays an important role in many image processing tasks. There are a great number of references for texture analysis and particularly texture classification. A central point in texture analysis is the definition of good features to characterize textures, that can be useful for content-based retrieval. Gabor features for texture classification followed by a linear discriminant analysis was used in [2]. Ayala proposes in [1] a new descriptor for binary and gray-scale textures based on defined spatial size distributions (SSD). In [7] a functional descriptor of the multivariate random closed set defined from the texture is proposed as a features to describe grayscale textures. Randen presents an interesting comparative study of different filtering approaches where the features used for texture classification are obtained from signal processing techniques like Law and Gabor filters, wavelet transforms or the discrete cosine transform [12]. Wavelets have been successfully applied to different image processing applications. When transformed, an image is represented as the sum of its details at different scales and orientations, plus a coarse approximation of the image. This naturally led to consider wavelets as a possible tool for texture classification. Some traditional approaches use the energy of the wavelet coefficients in each subband as texture descriptors, under the assumption that the energy of the distribution in the frequency domain identifies textures. A natural extension of the mentioned method consists in shaping every texture by means of marginal densities of the wavelet coefficients in every subband. This is justified by psychological studies that suggest that two homogeneous textures are difficult to distinguish if they produce similar marginal distributions as response to a bank of filters ([8]). A number of authors have observed that wavelet subband coefficients have highly non-Gaussian statistics [3], [14]. Numerous tests show that the normalized histograms of wavelet coefficients in each detail subband can be well approximated with a Generalized Gaussian Distribution (GGD). This model has been used with the Kullback-Leibler divergence as a similarity measure among distributions [6]. In this work, by applying a wavelet transform and considering at each scale the pair (horizontal detail, vertical detail) associated with each position, we extract information on the importance (contrast) and orientation of the edges in the image. We present a novel texture descriptor, used for content-based retrieval, based on the wavelet transform. The basic idea consists in modeling the
A New Wavelet-Based Texture Descriptor for Image Retrieval
897
distributions of moduli of wavelet detail coefficients in each decomposition level, and computing their empirical angle histogram at each level (assuming vertical and horizontal coefficients at the same position are the components of a bivariate random vector). Our tests indicate that the gamma density function (pdf), described by two parameters, is a reasonable model for the moduli coefficients histogram. Thus, we propose to characterize each texture in the database by information extracted both from the moduli and the orientation of the coefficients. Section 2 introduces an illustrative example of our work, and explains our approach in detail. In section 3 we present experimental results which evaluate the performance of our texture descriptors. Finally, in section 4 we have concluding remarks.
2 2.1
Modelling Coefficients Distribution of Wavelet Transform Analyzing the Joint Histograms of Wavelet Coefficients
For our tests we have chosen the orthogonal Daubechies 4 wavelet. In the traditional wavelet transform ([4] [9]), coefficients are calculated via convolutions with 2 filters (lowpass and highpass). This is followed by downsampling operations, which prevent the wavelet from being invariant to translations. For comparison, in our tests we have included another wavelet transform that is translation invariant, proposed by Mallat [9]. It is calculated with the so–called à trous algorithm, via convolutions with 2 filters, but has no downsampling operations, so that it gives a redundant representation of an image. To capture lower frequencies at each step of the transform, the filters are upsampled. 1 Daubechies 4 filters are lowpass [ h3 h2 h1 h0 ] = 4√ [3 + e, 1 − e, 3 − e, 1 + e], 2 √ with e = 3, and highpass [−h0 h1 − h2 h3 ]. The filters for Mallat’s translation invariant transform correspond to a biort√ hogonal wavelet: lowpass 2[ a b b a ] and highpass [ c − c ], with a = 0.125, √ b = 0.375 and c = 2/2. A previous step before proposing a new model for the joint histogram of wavelets coefficients for each detail level, is to study this distribution function for some simple images. The image in figure 1(a), chosen for the analysis, corresponds to a natural image from the Brodatz collection. It represents a texture with a privileged orientation. Both mentioned wavelet transforms have been applied to this image up to three levels of decomposition. In each level, four subbands are obtained: the approximation subband, the horizontal, vertical and diagonal details. In our study, we shall only use the vertical and horizontal detail subbands. It is worth mentioning that for the traditional wavelet transform, the subbands for the first level are a fourth (in size) of the image; when there is no downsampling step, they have the same size as the original image. In figure 1(a) we have an original texture, to which we have applied 3 levels of the biorthogonal wavelet transform without downsampling. In figure 1(b) are
898
E. de Ves et al.
(a)
200
300
150
100
200
100
50
100
50
(b)
0
0
0
−50
−50
−100
−100
−200
−100
−150 −100
0
100
−300
−200
−100
0
100
200
150
0
100 200
800 300
600
100 50
(c)
−200 −100
200
400
100
200
0
0
0
−50
−100
−200
−100
−200
−400 −600
−300
−150 −100
0
100
−200
0.2
0
200
400
−400 −200
0.09
0.06
0.06
0.04
0.03
0.02
0
200 400
0.15
(d)
0.1
0.05
0
0
50
100
150
0.1
0
0
50
100
150
200
0.07
0
100
200
300
400
0.08
0.06
0.08
0.06
0.05
(e)
0
0.06
0.04
0.04
0.03
0.04 0.02 0.02 0
0.02
0.01 0
40
80
120 140
0
0
100
0
300
400
0
0
150
0
0.136
300
0.054
0.054
0.068
0.036
0.036
0.018
0.018
0.000
0.000
π/2
π
Level 1
0.000
π/2
3π/2
π
Level 2
600 700
0.072
0.034
3π/2
450
0
0.072
0.102
(f)
200
π/2
3π/2
π
Level 3
Fig. 1. Empirical distributions for image (a). First row (b): joint distribution of Mallat’s à trous wavelet coefficients, second row (c): joint distribution of Daubechies 4 detail coefficients, third row (d): magnitude histogram of Mallat’s à trous coefficients, fourth row (e): magnitude histogram of Daubechies 4 transform coefficients, (f): circular histogram of Mallat’s à trous coefficients.
A New Wavelet-Based Texture Descriptor for Image Retrieval
899
plotted the joint empirical distributions of the detail coefficients for 3 levels, assuming that horizontal and vertical detail coefficients at the same position are the components of a bivariate random vector. From these joint distributions it can be inferred that a correlation exists between the horizontal and vertical wavelet subbands. This correlation is noticeable in the second and third levels of the transform, whereas the finer detail coefficients do not give much information: they are dominated by noise. The magnitude and angle of the random vector samples can be computed and analyzed. The empirical magnitude histogram (figure 1(d)) and the circular histogram (figure 1(f)) are shown for this wavelet. The circular histogram clearly indicates a privileged orientation of the edges in the texture. We have similar results for images with clearly oriented edges. The same study has been done applying the traditional Daubechies 4 wavelet transform to the same image. The joint empirical distributions of detail coefficients are shown in figure 1(c). It is evident that in this case there is no noticeable correlation between horizontal and vertical subbands. The histograms behave differently. 2.2
The Proposed Model for Wavelet-Coefficients Distributions
The previous section shows the importance of treating horizontal and vertical details jointly. However, it is a very difficult task to find a model to which the empirical joint histogram will fit reasonably well for any kind of texture. There are some papers in this approach, as [13], where the distribution of the subband coefficients is modeled using a joint alpha-stable sub-Gaussian distribution. Our approach is different. We do want to make use of the existing relation between the vertical and horizontal coefficients at a certain position, but we also want a model which is simple to fit and capable of characterizing different types of textures. We want a model for the moduli histograms. Observe that for both wavelet transforms, the shape of the empirical distribution changes for each level, in such a way that the peak of the histogram is shifted to the right. This behavior may be modelled by means of the gamma distribution (pdf). This distribution is defined by two parameters: k and θ. f (x; k, θ) = xk−1
e−(x/θ) θk Γ (k)
(1)
where k > 0 is a shape parameter, and θ > 0 is related to the scale of the distribution (if θ is large, then the distribution will be more spread out). The gamma distribution is related to many other distributions. The chi-square and exponential distributions, which are children of the gamma distribution, are oneparameter distributions that fix one of the two gamma parameters. The idea is to fit our empirical moduli histograms to a gamma distribution using the maxima-likelihood estimator. A goodness-of-fit Kolmogorov-Smirnov test has been applied to the different samples in order to justify the use of this model, giving very high p-values (larger than α = 0.05) for most of the samples
900
E. de Ves et al.
analyzed, for both wavelet transforms considered. It seems that the gamma distribution may be a reasonable model for the moduli of wavelet coefficients. As seen in the previous section, the circular histogram gives valuable information about privileged orientations in textures. Thus, these histograms can also be used to describe textures. An image with a privileged orientation will present a bimodal circular histogram with 180 degrees separated statistical modes whereas a random texture, without privileged orientations, will present a uniform circular histogram. Thus, we propose to characterize each texture in the database by: – Information from the moduli: parameters kn , θn , of the gamma pdf for n = 1 . . . 3, where n is the wavelet decomposition level. – Information from the angles given by the empirical circular histogram. Angles are quantized by dividing interval [0 , 2 π] into 40 bins. Pn (r) is the observed frequency of the angles in the rth bin for level n. Retrieval experiments in section 3, have been performed by using the gamma distribution parameters and the circular histograms.
3
Experimental Results
The objetive of this section is to test the new wavelet-based texture descriptor. Images from the well-known Brodatz image database have been used for the experiments; this image database is composed of 105 images representing different kinds of textures. Thirty six images from this collection were selected and each original image of size 512 × 512 was partitioned into sixteen 128 × 128 subimages by randomly choosing the left corner, thus creating a test database of N = 36 × 16. The texture descriptor was computed for each image in the test database. In our experiments, a simulated query image was anyone in our test database. The relevant images for each query were defined as the other 15 subimages from the same original Brodatz image. We evaluated the performance in terms of the average rate of retrieval of relevant images. We have done the same experiment three times: with the Daubechies 4 filters and the traditional wavelet transform, with the Daubechies 4 filters without downsampling, and Mallat’s biorthogonal wavelet without downsampling. The Kullback-Leibler divergence (KLD), commonly used to compare 2 distributions, was the chosen similarity measure between query image Iq and each image in the database {Ij , j = 1 . . . N }: we retrieved the top 16 closest images to the query, and counted how many corresponded to the same texture as the query image. The score used to measure the similarity between image Ij and Iq was computed as S(j, q) = S1 (j, q) + S2 (j, q),
where S (j, q) =
3 n=1
S (j, q, n), = 1, 2,
(2)
A New Wavelet-Based Texture Descriptor for Image Retrieval
901
f (x; knj , θnj ) P j (r) Pnj (r) log nq . q q dx, S2 (j, q, n) = f (x; kn , θn ) Pn (r) r (3) The term S1 (j, q, n) is the KLD (or cross entropy) of the moduli pdf at level n for images j and q. The term S2 (j, q, n) is the KLD of the empirical distribution of the quantized angles at level n for images j and q. The final score was the sum of both scores (modulus and angle KLD scores) at the first 3 levels. S1 (j, q, n) =
f (x; knj , θnj ) log
Table 1. Average retrieval rate in the top 16 Wavelet Transform Wavelet filters
Algorithm
Extracted Features S1
Daubechies 4 with downsampling 0.75 Daubechies 4 no downsampling 0.75 Mallat’s biorthogonal no downsampling 0.79
S 1 + S2 0.84 0.89 0.89
Table 1 shows the percentage of relevant images retrieved in the top 16 matches. The maximun and minimun percentage rates were 89% and 75% respectively. The main points that we can observe from these results are that the downsampling operation of the wavelet transform and the features used are very significant in the retrieval performance. Omitting the downsampling steps in the wavelet transform seems very meaningful for the results. This is because the resulting transform is translation invariant. Our results in the first column of the table, which do not take into account the angle information, and only base the results on the moduli information, are similar to the ones obtained by modelling the marginal histograms (vertical and horizontal independently) with the GGD distribution. Moreover, in our proposed descriptor we need only half the number of parameters. The inclusion of the orientation information in the features improves the average percentage rate in 10% independently of the wavelet considered and the presence– or absence– of the downsampling operation. It means that the orientation information is a valuable feature to discriminate textures.
4
Conclusions
We have presented a novel wavelet-based texture descriptor for visual information retrieval. The basic idea to characterize each texture in the image database is to use the information from moduli and orientations of wavelet coefficients, assuming that vertical and horizontal coefficients at the same position are the components of a bivariate random vector. The Kullback-Leibler divergence has been chosen as measure of similarity. The good performance of this new descriptor is revealed in retrieval experiments using the Brodatz database.
902
E. de Ves et al.
Acknowledgement. This work has been partially supported by grants GV04177 (for research stays), MCYT TIN2006-10134, UBACYT X166 and BID 1728/ OC-AR-PICT 26001.
References 1. Ayala, G., Domingo, J.: Spatial size distributions: Applications to shape and texture analysis. IEEE Trans Image Processing 23, 1430–1442 (2001) 2. Azencott, R., Wang, J.P., Younes, L.: Texture classification using windowed fourier filters. IEEE Trans Pattern Analysis and Machine Intelligence 19(3), 148–153 (1997) 3. Buccigrossi, R., Simoncelli, E.: Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Signal Proc 8, 1688–1701 (1999) 4. Daubechies, I.: Ten lectures on wavelets. Society for Industrial and Applied Mathematics (1992) 5. Del Bimbo, A.: Visual Information Retrieval. Morgan Kaufmann, San Francisco (1999) 6. Do, M.N., Vetterli, M.: Wavelet-based texture retrieval using generalized gaussian density and kullback-leibler distance. IEEE Trans Image Processing 11, 2 (2002) 7. Epifanio, I., Ayala, G.: A random set view of texture classification. IEEE Transactions on Image Processing 11, 859–867 (2002) 8. Heeger, D., Berger, J.R.: Pyramid-based txture analysis/synthesis. In: Proc. ACM SIGGRAPH (1995) 9. Mallat, S.: A Wavelet Tour of Signal Processing. Academic Press, London (1999) 10. Pentland, A., Picard, R.W., Sclaroff, S.: Photobook: Content-based manipulation of image databases. International Journal of Computer Vision 18(3), 233–254 (1996) 11. Smeulders, A.W.M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE transactions on Pattern Analysis and Machine Intelligence 22(12), 1349–1379 (2000) 12. Husoy, J., Randen, T.: filtering for texture classifciation: A comparative study. IEEE Trans Pattern Analysis and Machine Intelligence 20, 115–122 (1999) 13. Tzagkarakis, G., Beferull-Lozano, B., Tsakalides, P.: Rotation-invariant texture retrieval with gaussianized steerable pyramids. IEEE Trans Image Processing 15(2006) 14. Wouwer, G.V., Scheunders, P., Dyck, D.V.: Statistical texture characterization from discrete wavelet representation. IEEE Trans Image Processing 8, 592–598 (1999)
Space-Variant Restoration with Sliding Discrete Cosine Transform Vitaly Kober and Jacobo Gomez Agis Department of Computer Science CICESE, Ensenada, B.C. , Mexico
[email protected] Abstract. A local adaptive restoration technique using a sliding discrete cosine transform (DCT) is presented. A minimum mean-square error estimator in the domain of a sliding DCT for image restoration is derived. The local restoration is performed by pointwise modification of local DCT coefficients. To provide image processing in real time, a fast recursive algorithm for computing the sliding DCT is utilized. The algorithm is based on a recursive relationship between three subsequent local DCT spectra. Computer simulation results using a real image are provided and compared with that of common restoration techniques.
1 Introduction Many different restoration techniques (linear, nonlinear, iterative, noniterative, deterministic, stochastic, etc.) optimized with respect to different criteria have been introduced [1-7]. The techniques may be broadly divided in two classes: (i) fundamental algorithms and (ii) specialized algorithms. One of the most popular fundamental techniques is a linear minimum mean square error (LMMSE) method. It finds the linear estimate of the ideal image for which the mean square error between the estimate and the ideal image is minimum. The linear operator acting on the observed image to determine the estimate is obtained on the basis of a priori second order statistical information about the image and noise processes. In the case of stationary processes and space-invariant blurs, the LMMSE estimator takes the form of the Wiener filter. A Kalman filter determines the causal LMMSE estimate recursively. It is based on a state-space representation of the imaging system, and image data are used to define the state vectors. Specialized algorithms can be viewed as extensions of the fundamental algorithms to specific restoration problems. In this paper we deal with restoration of images degraded by space-variant blurs. Basically, all fundamental algorithms apply to the restoration of images degraded by space-variant blurs. However, because Fourier transforms cannot be utilized when the blur is space-variant, space-domain implementations of these algorithms may be computationally formidable due to large matrix operations. Several specialized methods were developed to attack the spacevariant restoration problem. The first class referred to as sectioning is based on assumption that the blur is approximately space-invariant within local regions of the image. Therefore, the entire image can be restored by applying well-known spaceinvariant techniques to the local image regions. A drawback of sectioning methods is W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 903 – 911, 2007. © Springer-Verlag Berlin Heidelberg 2007
904
V. Kober and J. Gomez Agis
the generation of artifacts at the region boundaries. The second class is based on a coordinate transformation [7], which is applied to the observed image so that the blur in the transformed coordinates becomes space-invariant. Therefore, the transformed image can be restored by a space-invariant filter and then transformed back to obtain the final restored image. However, the statistical properties of the image and noise processes are affected by the coordinate transformation. In particular, the stationarity in the original spatial coordinates is not preserved in the transform coordinate system. In this paper, we carry out the space-variant restoration using a sliding discrete cosine transform (DCT) coefficients. The sliding DCT is based on the concept of short-time signal processing [8]. The short-time orthogonal transform of a signal xk is defined as
X sk =
∞
∑x
k + n wnψ
(n, s ) ,
(1)
n = −∞
where wn is a window sequence, ψ(n,s) represents the basis functions of an orthogonal transform. We use one-dimensional notation for simplicity. Equation (1) can be interpreted as the orthogonal transform of xk+n as viewed through the window wn. X sk displays the orthogonal transform characteristics of the signal around time k. Note that while increased window length and resolution are typically beneficial in the spectral analysis of stationary data, for time-varying data it is preferable to keep the window length sufficiently short so that the signal is approximately stationary over the window duration. Assume that the window has finite length around n=0, and it is unity for all n∈[-N1, N2]. Here N1 and N2 are integer values. This leads to signal processing in a sliding window. In other words, local filters in the domain of an orthogonal transform at each position of a moving window modify the orthogonal transform coefficients of a signal to obtain only an estimate of the pixel xk of the window. The choice of orthogonal transform for sliding signal processing depends on many factors. The DCT is one the most appropriate transform with respect to the accuracy of power spectrum estimation from the observed data that is required for local filtering, the filter design, and computational complexity of the filter implementation. Linear filtering in the domain of DCT followed by inverse transforming is superior to that of the discrete Fourier transform (DFT) because a DCT can be considered as the DFT of a signal evenly extended outside its edges. This consequently attenuates boundary effects caused by circular convolution that are typical for linear filtering in the domain of DFT. The presentation is organized as follows. In Section 2, we review recursive algorithms for computing the sliding forward and inverse DCTs. In Section 3, a local adaptive filter minimizing the minimum mean-square error defined in the domain of the sliding DCT is derived. In section 4, we test the filter performance to restore a real aerial image degraded by nonuniform motion blur. Section 5 summarizes our conclusions.
2 Fast Forward and Inverse Algorithms of Sliding DCT The discrete cosine transform is widely used in many signal processing applications. This is because the DCT performs close to the optimum Karhunen-Loeve transform
Space-Variant Restoration with Sliding Discrete Cosine Transform
905
for the first-order Markov stationary data, when the correlation coefficient is near 0.9 [1]. Recently, fast forward and inverse algorithms for fast computing of DCTs were proposed [9]. The sliding cosine transform (SCT) is defined as ⎛ ( n + N1 + 1 2 ) s ⎞ xk + n cos ⎜⎜ π ⎟⎟ , N n =− N ⎝ ⎠ 1 N
X sk =
2
∑
(2)
where N=N1+N2+1, { X sk ; s=0, 1,…, N-1} are the transform coefficients around time
k. The coefficients of the DCT can be obtained as { C 0k = X 0k 2 ; C sk = X sk , s=1,…, N-1}. The SCT on the base of a recursive relationship between three subsequent local DCT spectra is given by
(
s ⎛πs ⎞ ⎛ πs ⎞ X sk +1 = 2 X sk cos ⎜ ⎟ − X sk −1 + cos ⎜ ⎟ xk − N1 −1 − xk − N1 + ( −1) xk + N2 +1 − xk + N2 N 2 N ⎝ ⎠ ⎝ ⎠
(
)) . (3)
We see that the computation of the DCT at the window position k+1 involves values of the input sequence xk as well as the DCT coefficients computed in two previous positions of the moving window. Table 1. Number of arithmetical operations for computing of sliding DCT
Fast DCTs [10]
Number of additions 3MN/2 - N+1
Number of multiplications MN/2+1
Recursive algorithm
2N+5
2N-1
Tables 1 provides a comparison of the computational complexity of the recursive algorithm with fast DCT algorithms. The length of a moving window for the recursive algorithm is an arbitrary integer value determined by characteristics of a signal to be processed. In contrast, fast DCT algorithms require the length to be of a power of 2, N=2M. If xk is the central pixel of the window, that is, N1=N2 and N=2N1+1, then the inverse transform is written as xk =
1 N
⎛ N1 ⎞ ⎜ 2 (− 1)s X k + X k ⎟ . 2s 0 ⎜ ⎟ ⎝ s =1 ⎠
∑
(4)
We note that in the computation only the spectral coefficients with even indices are involved. The computation requires one multiplication and N1+1 additions.
3 Signal Restoration in the Domain of Sliding DCT First we define a local criterion of the performance of filters for image and signal processing and then derive optimal local adaptive filters with respect to the criterion. One the most used criterion in signal processing is the minimum mean-square error (MMSE). Since the processing is carried out in a moving window, then for each
906
V. Kober and J. Gomez Agis
position of a moving window an estimate of the central element of the window is computed. Suppose that the signal to be processed is approximately stationary within the window. The signal may be distorted by sensor’s noise. Let us consider a generalized linear filtering of a fragment of input onedimensional signal (for instance for a fixed position of the moving window). Let a=[ak] be undistorted real signal, x=[xk] be observed signal, k=1,…, N, N be the size of the fragment, U be the matrix of the discrete cosine transform, E{.} be the expected value, superscript T denotes the transpose. Let a = Hx be a linear estimate of the undistorted signal, which minimizes the MMSE averaged over the window
{
MMSE = E ( a − a )
T
( a − a )}
N.
(5)
The optimal filter for this problem is the Wiener filter [1]:
{
} {
}
−1
H = E a x T ⎡⎣ E x x T ⎤⎦ .
(6)
Let us consider the known model of signal:
xk = ∑ wk , n an + vk ,
(7)
n
where W=[wk,n] is a distortion matrix, ν=[vk] is additive noise with zero mean, k,n=1,…N, N is the size of fragment. The equation can be rewritten as
x = Wa + v ,
(8)
and the optimal filter is given by −1
H = K aa W T ⎡⎣WK aa W T + Kνν ⎤⎦ ,
{ }
{ }
{
(9)
}
where K aa = E aaT , Kνν = E νν T , E a ν T = 0 are the covariance matrices. It is assumed that an input signal and noise are uncorrelated. The obtained optimal filter is based on an assumption that an input signal within the window is stationary. The result of filtering is the restored window signal. This corresponds to signal processing in nonoverlapping fragments. Now suppose that the signal is processed in a moving window in the domain of the sliding DCT. For each position of the window an estimate of the central pixel should be computed. Using the equation for inverse sliding DCT presented in the previous section, the pointwise MSE for reconstruction of the central element of the window can be written as follows: 2 ⎧⎪ ⎡ N 2 ⎤ ⎫⎪ PMSE ( k ) = E ⎡⎣ a ( k ) − a ( k ) ⎤⎦ = E ⎨ ⎢ ∑ α ( l ) ( A ( l ) − A ( l ) ) ⎥ ⎬ , (10) ⎦ ⎭⎪ ⎪⎩ ⎣ l =1
{
}
where A = ⎡⎣ A ( l ) = H ( l ) X ( l ) ⎤⎦ is a vector of signal estimate in the domain of the DCT, HU = ⎡⎣ H ( l ) ⎤⎦ is a diagonal matrix of the scalar filter, α = ⎡⎣α ( l ) ⎤⎦ is a diagonal matrix of the coefficients of inverse sliding cosine transform (4). Minimizing (10), we obtain HU = [ Px x ] Pax Iα . −1
(11)
Space-Variant Restoration with Sliding Discrete Cosine Transform
907
where Pax = ⎡⎣ E { A ( l ) X ( k )}⎤⎦ , Px x = ⎡⎣ E { X ( l ) X ( k )}⎤⎦ , Iα is the identity matrix of the dimension of α. Note that matrix of coefficients α = ⎡⎣α ( l ) ⎤⎦ for the inverse sliding transform (4) is singular. The inverse sliding cosine transform (4) possesses the dimension of the matrix twice less than the size of the window signal. Therefore, the computational complexity of the scalar filters in (11) and signal processing can be significantly reduced comparing to the complexity for the filter in (6). For the model of signal distortion in (8) the filter matrix is given as
(
)
−1
HU = ⎡⎣U WK aa W T + Kνν U T ⎤⎦ U K aa W T U T Iα .
(12)
If a signal has a high correlation coefficient and a smoothed version of the signal is corrupted by additive, weakly-correlated noise, then the matrix U WK aa W T + Kνν U T in (12) is close to diagonal. Figure 1 shows the covariance
(
)
matrix of a smoothed, noisy, one-dimensional signal having the correlation coefficient of 0.95 as well as the discrete cosine transform of the covariance matrix. The linear convolution between a signal x and the matrix K aa W T in the domain of the sliding
(
)
DCT can be well approximated by a diagonal matrix Diag UK aa W T U T Iα X . Therefore, the matrix of the scalar filter in (12) is close to diagonal, and the filter can be written as P1 ( l ) H (l ) ≈ , (13) P2 ( l ) + Pνν ( l ) where
P1 ( l ) , P2 ( l ) , Pnn ( l ) T
T
are diagonal elements of the following matrices T
U K aa W U Iα , U WK aa W U T , U Kνν U T , l=1,…N1, N1 is the dimension of the matrix Iα.
(a)
(b)
Fig. 1. (a) Covariance matrix of a noisy signal, (b) DCT of the covariance matrix
For the design of local adaptive filters in the domain of a sliding DCT the covariance matrices and power spectra of fragments of a signal are required. Since they are often unknown, in practice, these matrices can be estimated from observed signals [3-6].
908
V. Kober and J. Gomez Agis
4 Computer Simulation Results The objective of this section is to develop a technique for local adaptive restoration of images degraded owing to nonuniform motion blur. Assume that the blur is owing to horizontal relative motion between the camera and the image, and it is approximately space-invariant within local regions of the image. It is known that point spread functions for motion and focus blurs do have zeros in the frequency domain, and they can be uniquely identified by the location of these zero crossings [4]. We assume also that the observation noise is a zero-mean, white Gaussian process that is uncorrelated to the image signal. In this case, the noise field is completely characterized by its variance, which is commonly estimated by the sample variance computed over a lowcontrast local region of the observed image. Next, we design local adaptive filters on the base of a sliding DCT to restore the image. Let { X k ( l ) , X k ( l ) , H wk ( l ) , l= 1,…, N} be the DCT transform coefficients around time k of the observed signal, filtered signal, and linear motion degradation, respectively. Here N=2N1+1 is the length of the DCT. Note that N1 is an arbitrary integer value, which is determined by the minimal size of details to be preserved after filtering. We use the criterion of the PMSE around time k which is defined in the domain of sliding DCT. Taking into account (13) and (4), the estimate of the reconstructed image signal can be written as xk =
1 ⎛ N1 s k k⎞ ⎜ 2∑ ( −1) X 2 s + X 0 ⎟ , N ⎝ s =1 ⎠
(14)
where ⎧⎛ Pxxk ( l ) − Pννk ( l ) ⎞ k k k k k ⎪⎜ ⎟ X ( l ) , Pxx ( l ) > Pνν ( l ) +B and H w ( l ) ≠ 0 X k ( l ) = ⎨⎜⎝ Pxxk ( l ) H wk ( l ) ⎟⎠ , ⎪ 0, otherwise ⎩
(15)
and Pxxk ( l ) , Pννk ( l ) , l= 1,…, N1 are estimates of the power spectra of the observed signal and noise in the domain of the sliding DCT, respectively, Bk is a small signaldependent bias value. The designed filter can be considered as a modified spectral subtraction method in the domain of sliding DCT. In general, spectral subtraction methods, while reducing the wide-band noise, introduce a new narrow-band noise due to the presence of remaining spectral peaks. To attenuate the remaining noise, one can suggest over subtraction of the power spectrum of noise by introducing a nonzero power spectrum bias Bk. A real test aerial image is shown in Fig. 2(a). The size of image is 512x512, each pixel has 256 levels of quantization. The signal range is [0, 1]. The image quadrants are degraded by sliding 1D horizontal averaging with the following sizes of the moving window: 5, 6, 4, and 3 pixels (for quadrants from left to right, from top to bottom). The image is also corrupted by zero-mean additive white Gaussian noise. The degraded image with the noise standard deviation of 0.05 is shown in Fig. 2(b). In our tests the window length of 15x15 pixels is used. Since there exists difference in spectral distributions of the image signal and wide-band noise, the power spectrum of noise can be easily measured from the experimental covariance matrix.
Space-Variant Restoration with Sliding Discrete Cosine Transform
(a)
(b)
(c)
(d)
(e)
909
(f)
Fig. 2. (a) Test image, (b) space-variant degraded test image, (c) global Wiener restoration, (d) local adaptive restoration in domain of sliding DCT, (e) difference between the original image and restored by global Wiener algorithm, (f) difference between the original image and restored by proposed algorithm
The results of image restoration by the global parametric Wiener filtering [2] and the proposed method are shown in Figs. 2(c) and 2(d), respectively. Figures 2(e) and 2(f) show a difference of the original image with the image restored by global Wiener
910
V. Kober and J. Gomez Agis
algorithm, and the image restored with proposed algorithm, respectively. We see that the proposed algorithm is capable to perform a good space-variant image restoration and noise suppression. Finally, we investigate the robustness of the tested restoration techniques to additive noise. The performance of the global parametric Wiener filtering and the local adaptive filtering is shown in Fig. 3. 0.018 0.016 0.014
MSE
0.012
Global restoration
0.01 Local adaptive restoration
0.008 0.006 0.004 0.002 0 0.00
0.05
0.10
0.15
0.20
STANDARD DEVIATION
Fig. 3. Performance of the restoration algorithms in terms of MSE versus the standard deviation of additive noise
5 Conclusions In this paper, we presented a new technique for space-variant restoring linearly degraded and noisy images. The MMSE estimator in the domain of sliding DCT is derived. To provide image processing at high rate, a fast recursive algorithm for computing the sliding DCT was utilized. Extensive testing using various parameters of degradations (nonuniform motion blurring and corruption by noise) has shown that the original image can be well restored by proper choice of the algorithm parameters.
References 1. Jain, A.K.: Fundamentals of digital image processing. Prentice Hall, Englewood Cliffs (1989) 2. Gonzalez, R.C., Woods, R.E.: Digital image Processing. Prentice Hall, Englewood Cliffs (2002) 3. Pratt, W.K.: Digital image processing. Wiley-Interscience Publication, New York (1991) 4. Biemond, J., Lagendijk, R.L., Mersereau, R.M.: Iterative methods for image deblurring. Proceedings of the IEEE 78(5), 856–883 (1990) 5. Flusser, J., Suk, T., Saic, S.: Recognition of blurred images by the method of moments. IEEE Trans. on Image Processing 5(3), 533–538 (1996) 6. Bertero, M., Boccacci, P.: Introduction to inverse problems in imaging. Institute of Physics Publishing. Bristol and Philadelphia (1998)
Space-Variant Restoration with Sliding Discrete Cosine Transform
911
7. Sawchuk, A.A.: Space-variant image restoration by coordinate transformations. J. Opt. Soc. Am., 64(2), 138–144 (1974) 8. Oppenheim, A.V., Shafer, R.W.: Discrete-time signal processing. Prentice Hall, Englewood Cliffs (1989) 9. Kober, V.: Fast algorithms for the computation of sliding discrete sinusoidal transforms. IEEE Trans. on Signal Process 52(6), 1704–1710 (2004) 10. Hou, H.S.A: fast recursive algorithm for computing the discrete cosine transform. IEEE Trans. Acoust. Speech Signal Process 35(10), 1455–1461 (1987)
Comparative Evaluation of Classical Methods, Optimized Gabor Filters and LBP for Texture Feature Selection and Classification Jaime Melendez1, Domenec Puig1, and Miguel Angel Garcia2,* 1
Intelligent Robotics and Computer Vision Group Department of Computer Science and Mathematics Rovira i Virgili University Av. Països Catalans 26, 43007 Tarragona, Spain {jaime.melendez, domenec.puig}@urv.cat 2 Department of Informatics Engineering Autonomous University of Madrid Ctra. Colmenar Viejo Km 15, 28049 Madrid, Spain
[email protected] Abstract. This paper builds upon a previous texture feature selection and classification methodology by extending it with two state-of-the-art families of texture feature extraction methods, namely Manjunath & Ma’s Gabor wavelet filters and Local Binary Pattern operators (LBP), which are integrated with more classical families of texture filters, such as co-occurrence matrices, Laws filters and wavelet transforms. Results with Brodatz compositions and outdoor images are evaluated and discussed, being the basis for a comparative study about the discrimination capabilities of those different families of texture methods, which have been traditionally applied on their own.
1 Introduction Pixel-based texture classification aims at recognizing the texture patterns to which the pixels of a given image belong. In order to accomplish this task, it is necessary to compute a set of texture features by evaluating one or more texture feature extraction methods (texture methods in short) in a neighborhood of every pixel. This neighborhood is usually defined as a square window centered at that pixel. A wide variety of texture methods have been proposed in the literature. More recently, Gabor wavelet filters following the optimized design proposed in [1] and Local Binary Patterns [2] have proven to be very successful in different application domains. These techniques are considered the state-of-the-art in texture analysis. This paper builds upon previous work on pixel-based texture classification [3] and texture feature selection [4], by integrating the two aforementioned families of texture methods along with more classical families, and by evaluating the obtained results in *
This work has been partially supported by the Spanish Ministry of Education and Science under project DPI2004-07993-C03-03.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 912 – 920, 2007. © Springer-Verlag Berlin Heidelberg 2007
Comparative Evaluation of Classical Methods, Optimized Gabor Filters and LBP
913
the scope of pixel-based texture classification. A total of 68 texture methods independently evaluated on six window sizes (in practical terms, the number of methods is thus 68 × 6 = 408) have been integrated. The major contribution of this work is an exhaustive evaluation and comparative analysis that shows the pros and cons of those families of texture methods in terms of texture discrimination capabilities. These families have been traditionally applied on their own or sometimes combined without a clear criterion. Section 2 of this paper summarizes both the texture classification and feature selection techniques that are the basis for the current work. Section 3 shows experimental results after applying the above techniques with different families of texture methods, including optimized Gabor filters and LBP. These results are analyzed and discussed in section 4. Conclusions and future work are given in section 5.
2 Pixel-Based Texture Classification and Automatic Texture Feature Selection This section summarizes the pixel-based texture classifier proposed in [3] and the feature selection algorithm proposed in [4]. Details are omitted due to space limitations. (a) Pixel-Based Texture Classification. Given a set of T texture patterns of interest, a statistical model for each pattern is built based on texture features extracted by M texture methods applied over square windows of W different sizes. This is done by computing M × W × T likelihood functions and by integrating them through a linear opinion pool scheme that uses a set of reliability measures as weights. Next, the Bayes rule is applied in order to compute T posterior probabilities. Every image pixel is then classified into the texture pattern that receives the maximum posterior probability, provided the latter is above an automatically computed significance level. Otherwise, the pixel is classified as unknown. (b) Automatic Feature Selection. Given T different texture patterns of interest, each represented by a sample image, M texture methods and W window sizes per method, the significance of every method is first determined based on its performance in classifying the given patterns. Then, a list of the M texture methods sorted in descending order of significance is created per window size. Finally, a sequential forward generation procedure adapted to the proposed classifier keeps adding methods from the top of the sorted list until a performance criterion is maximized. This procedure is repeated W times in order to choose the most significant methods for every window size.
3 Experimental Evaluation The two techniques summarized in the previous section have been evaluated on composite images made of patches of well-known Brodatz textures [5], based on those used in [6], and real outdoor images. The first column of Fig. 1 shows seven of the input images used for the experiments. Six window sizes have been considered in all
914
J. Melendez, D. Puig, and M.A. Garcia
cases: 3 × 3, 5 × 5, 9 × 9, 17 × 17, 33 × 33 and 65 × 65. The number of integrated texture methods depends on the actual experiments carried out as will be explained below. For the sake of comparison, the evaluated texture methods have been organized into three broad classes: “Classical” methods, optimized Gabor wavelet filters proposed in [1], which will be denoted as “GaborMM”, and Local Binary Patterns, denoted as “LBP”. While the last two classes constitute families of methods by themselves, the Classical class groups heterogeneous families of methods that have already been evaluated in [3][4]. Fourteen methods belonging to five different families are considered per window size: (1) four Laws filter masks, (2) two wavelet transforms, (3) four non-optimized Gabor filters, (4) three statistics and (5) the fractal dimension. The GaborMM class is built as a filter bank with six scales and four orientations. The texture features that will characterize every pixel and its surrounding neighborhood are the mean and standard deviation of the module of the Gabor wavelet coefficients obtained after filtering an input image. Thus, the GaborMM class considers a total of 6 × 4 × 2 = 48 methods. These settings come from [1][7]. The kernel size is set to be the same as the window size. The LBP class is constituted by the three available uniform rotation invariant Local Binary Pattern operators. The texture features used in these experiments are the mean and standard deviation of the values produced after applying each LBP operator to an input image for every considered window size. Hence, the number of texture methods corresponding to the LBP class becomes 3 × 2 = 6. 3.1 Experiment 1: Integration of Multiple Methods In the first experiment, the classifier described in section 2 is used to test the performance of each class of methods (Classical, GaborMM, LBP) when they are either independently integrated or integrated all together. Columns 1 to 3 in Table 1 show the classification rates for every class of methods and test image, as well as the average classification rates for every class by considering all test images. Column 4 shows the same results when considering all the available methods. In turn, Fig. 1 (columns 3 and 4) shows some of the segmentation maps produced after the previous classification. Classification rates have been computed against the ground-truth images shown in the second column. 3.2 Experiment 2: Automatic Selection of Methods The second experiment applies the selection scheme summarized in section 2 in order to reduce the number of methods to be integrated. Without this selection scheme, the number of methods to be integrated can be very large, reaching its maximum when all methods and window sizes are considered: (14 + 48 + 6) × 6 = 408. The classification rates corresponding to this experiment are shown in Table 1 (column 5). The number of methods used in each case is shown between parenthesis. The segmentation maps produced after classification with the selected methods are shown in Fig. 1 (column 5). In order to analyze the significance of each family (not class) of methods in the classification results, the ratio between the number of selected methods per family and the number of times all methods from that family could have been chosen has been computed for every particular test image by considering all window sizes. The
Comparative Evaluation of Classical Methods, Optimized Gabor Filters and LBP
915
Fig. 1. Test images (column 1), ground-truth (column 2), and segmentation maps produced by integrating: the best class (column 3), all methods (column 4), and the methods chosen after the selection process (column 5)
916
J. Melendez, D. Puig, and M.A. Garcia Table 1. Classification rates for different groups of integrated methods Test image 1 2 3 4 5 6 7 Average
Classical 75.76(84) 60.84(84) 66.30(84) 42.09(84) 53.17(84) 74.88(84) 69.86(84) 63.27(84)
Class of texture methods GaborMM LBP 92.76(288) 75.71(36) 84.06(288) 63.47(36) 91.76(288) 72.59(36) 73.97(288) 63.90(36) 70.98(288) 74.85(36) 88.68(288) 63.48(36) 85.68(288) 48.14(36) 84.98(288) 66.02(36)
All methods
Selection of methods
92.46(408) 78.82(408) 88.06(408) 77.96(408) 72.29(408) 86.18(408) 86.46(408) 83.18(408)
89.61(55) 88.97(54) 94.45(62) 79.98(78) 77.31(136) 83.37(136) 83.85(136) 85.36(87)
Table 2. Significance of each family of methods per image Test image 1 2 3 4 5 6, 7 Average
Ratio of selected methods per family vs. total number of methods within a family (per image) Laws Statistic Gabor Fractal Wavelet GabMM LBP 0.67 0.00 0.08 0.17 0.17 0.11 0.08 0.29 0.00 0.00 0.00 0.00 0.15 0.08 0.46 0.00 0.00 0.00 0.00 0.17 0.00 0.50 0.06 0.17 0.17 0.00 0.18 0.22 0.83 0.17 0.13 0.33 0.00 0.31 0.56 0.17 0.25 0.67 0.00 0.58 0.37 0.00 0.49 0.07 0.17 0.11 0.13 0.22 0.16
computed ratios (shown in Table 2) range between zero and one, with zero meaning that a family is useless since its methods are never selected, and one meaning that an entire family is necessary because all of its methods are selected every time. Finally, a similar study (shown in Table 3) about the significance of each family is conducted, but this time by taking into account each window size for all the given test images.
4 Discussion From the first experiment conducted in section 3, it can be concluded that if just a single class of methods is to be utilized for texture classification (see Table 1), the GaborMM class is clearly the best option. This is consequent with the fact that this family has been widely used in the literature and even proposed as a texture descriptor in the MPEG-7 standard [8]. However, the next option is not as obvious, since the LBP operators produce acceptable results for the Brodatz textures, while the classical methods perform better for the outdoor images. Even though the LBP family is currently considered within the state-of-the-art in texture analysis, this experimentation shows that the performance of the LBP methods can be seriously influenced by the nature of the texture patterns upon which these methods are applied. On the other hand, the LBP family has the additional property of being small and, hence, of having
Comparative Evaluation of Classical Methods, Optimized Gabor Filters and LBP
917
Table 3. Significance of each family of methods per window size Window size 3×3 5×5 9×9 17 × 17 33 × 33 65 × 65
Ratio of selected methods per family vs. total number of methods within a family (per window size for all test images) Laws Statistic Gabor Fractal Wavelet GabMM LBP 0.75 0.11 0.54 0.00 0.08 0.19 0.19 0.63 0.06 0.17 0.00 0.17 0.21 0.11 0.50 0.06 0.17 0.00 0.17 0.23 0.08 0.33 0.06 0.17 0.17 0.17 0.22 0.14 0.46 0.06 0.00 0.33 0.08 0.17 0.17 0.29 0.06 0.00 0.00 0.08 0.18 0.25
a low associated computational cost. These characteristics make the LBP methods more suitable for those tasks in which time constraints are an issue. Another important result is the fact that the integration of the three tested classes does not always guarantee the maximum classification rates as can be observed in the rates corresponding to most of the images in Table 1. Furthermore, this alternative is not justifiable in terms of computational cost, since, in average, it is possible to obtain better classification rates by only integrating methods from the GaborMM class, which constitute a 70% of all available methods. In addition, the first noticeable result after applying the feature selection algorithm in the second experiment described in section 3 is that it not only preserves classification rates to a large extent, but also improves those rates in a number of cases (see column 5, rows 2 to 5, in Table 1). Even more, the highest average classification rate is the one corresponding to the integration of the selected methods. Notwithstanding, the main contribution of the feature selection technique can be appreciated in the fact that the same good performances highlighted above have been reproduced with a much lower number of texture methods. Actually, the number of integrated methods is, in average, more than three times lower than the number of methods that constitute the best family (GaborMM) and more than four and a half times lower than the number of available methods (see the numbers in parenthesis in Table 1). The only cases in which the number of selected methods is larger than the number of methods that constitute a single family are when comparing with the Classical class for the patterns used in images 5 to 7, and when comparing with the LBP class, since the latter has just six members. In turn, the number of selected methods is similar in most cases, except for images 5 to 7 (see Table 1, column 5, rows 5 to 7) where this number is more than twice the one of previous cases. The reason is that these images are so complex that the feature selector cannot yield appropriate classification rates with a low number of methods. In fact, some textures in those images are difficult to distinguish even to the human eye. Furthermore, the study of the performance of each family of methods according to the ratios discussed in section 3.2 is also quite revealing. The first conclusion from Table 2 is that all families are potentially useful in different contexts, as all of them have been chosen by the selection algorithm more than once. Hence, they should not be discarded a priori. The Statistics and Wavelet families and the fractal dimension
918
J. Melendez, D. Puig, and M.A. Garcia
have the lowest performance, with very low to low ratios, as they have only been selected in particular cases. The LBP and classical Gabor families perform in the middle, while the Laws masks and the GaborMM family are the main components of the integration strategy after the selection process, as they are always selected. At first glance, the ratios obtained by the Laws masks can be contradictory, as they are superior to those obtained by the GaborMM family, while the classification rates produced by the Classical class, which includes the Laws family, are, by far, much lower than those produced by the GaborMM methods. The reason for those high ratios is that they are not computed based on classification rates, but in terms of how significant a family is with respect to its size. For example, the Laws family has a ratio close to 0.5, meaning that, even though its number of members is very small (4 methods), at least two of them are always selected in average and their participation is crucial for obtaining good classification results with a reduced number of methods. In fact, a member of this Laws family scores first many times. Alternatively, the GaborMM family has a small average ratio, even though it is the best family to be integrated on its own. The reason is that not all its members contribute to the final classification, and only a few of them (about a 20% in average) do a good classification job. Notwithstanding, it is important to highlight that the methods selected within each family are not always the same, as this selection is highly dependent on the texture patterns of interest. This is the reason why the feature selection algorithm must be applied starting with the complete set of available texture methods as input. In such a way, the selector is able to choose the best methods according to the specific situation. The last point that deserves discussion is the effect of window size on the selection process. Table 3 shows that the Laws family and specially the classical Gabor family tend to perform best with small to very small window sizes, while the Fractal dimension only does well with medium window sizes. Alternatively, the LBP family reaches its maximum with very small and medium to big window sizes. The latter can explain the success of LBP in tasks related to image retrieval. Finally, the GaborMM, Statistics and Wavelet families seem to perform almost equally with all window sizes.
5 Conclusions This paper presents the results of a thorough evaluation conducted on a previously proposed technique for pixel-based classification through integration of multiple families of texture methods and selection of the most significant ones. The families of texture methods analyzed in previous works have been complemented with two stateof-the-art families of methods: optimized Gabor wavelet filters designed by Manjunath and Ma, and LBP. A comparative analysis of the performance of all those families has been carried out. This evaluation shows that the best family in terms of classification rates corresponds to the optimized Gabor filters. On the other hand, Local Binary Patterns have proven to yield acceptable results with a reduced computational cost, although, in these experiments, they have shown a lower performance when applied to the outdoor images. In general, they appear to be a good choice for texture analysis applications with strong time constraints.
Comparative Evaluation of Classical Methods, Optimized Gabor Filters and LBP
919
Another conclusion from this study is that it is not adequate to integrate all the available texture methods, since this approach leads to lower classification rates in most cases and with a significant computational cost. Besides the novelty of the comparative evaluation of those different families of well-recognized texture methods, another important contribution of this paper is to show the clear benefits of applying the previously proposed texture feature selector prior to classification. Indeed, when a subset of texture methods is automatically selected based on the given texture patterns of interest, the final classification rates are similar or even better than when the best families of texture methods are utilized on their own, but at a significantly lower computational time. In addition, these results also show that the GaborMM family is oversized as only a small percentage of its members are really contributing to each specific classification problem. However, the methods that contribute the most are highly dependent on the texture patterns to be recognized. Thus, the application of the selection scheme is highly beneficial for the particular case of this family, as it is able to select the specific Gabor filters that are most suitable for the problem at hand, leaving aside the ones that do not make a significant contribution. This implies that the power of the GaborMM family is kept with a very significant reduction of the required computational cost. Finally, the results presented in this paper also show that some texture methods are most suitable for particular window sizes, while others behave almost independently of this parameter, and that the classical methods (Laws filters, fractal dimension, etc.) cannot be discarded a priori, as they prove to be useful (they are selected) in some occasions. This is where the selection algorithm shows its strength and utility. Further research will consist of extending the feature selection technique evaluated in this paper to application-independent selection of texture methods in order to obtain a more general subset of methods that produce acceptable classification results when the classifier is applied to different problems. In this way, it would be possible to avoid a specific selection of texture methods for each different classification problem. We also aim at extending the proposed technique to unsupervised pixel-based segmentation.
References 1. Manjunath, B.S., Ma, W.Y.: Texture Features for Browsing and Image Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 837–842 (1996) 2. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002) 3. Garcia, M.A., Puig, D.: Supervised Texture Classification by Integration of Multiple Texture Methods and Evaluation Windows. Image Vis. Comput. 25(7), 1091–1106 (2007) 4. Puig, D., Garcia, M.A.: Automatic Texture Feature Selection for Image Pixel Classification. Pattern Recogn. 39(11), 1996–2009 (2006) 5. Brodatz, P.: Textures: A Photographic Album for Artists and Designers. Dover & Gree Publishing Company, Mineola, NY (1999)
920
J. Melendez, D. Puig, and M.A. Garcia
6. Randen, T., Husoy, J.H.: Filtering for Texture Classification: A Comparative Study. IEEE Trans. Pattern Anal. Mach. Intell. 21(4), 291–310 (1999) 7. Chen, L., Lu, G., Zhang, D.: Effects of Different Gabor Filter Parameters on Image Retrieval by Texture. In: Proc. of Int. Conf. MMM’04, pp. 273–278 (2004) 8. Manjunath, B.S., et al.: Color and Texture Descriptors. IEEE Trans. Circ. Syst. Video Tech. 11, 703–715 (2001)
A Multiple Classifier Approach for the Recognition of Screen-Rendered Text Steffen Wachenfeld, Stefan Fleischer, and Xiaoyi Jiang Department of Computer Science, University of M¨ unster Einsteinstrasse 62, D-48149 M¨ unster, Germany
Abstract. The lower the resolution of a given text is, the more difficult it becomes to segment and to recognize it. The resolution of screenrendered text can be very low. With a typical x-height of 4 to 7 pixels it is much lower as in other low resolution OCR situations. Modern OCR approaches for such very low resolution text use a classification-based segmentation where the underlying classifier plays an important role. This paper presents a multiple classifier system for the classification of single characters. This system is used as a subsystem for the classification-based segmentation within a system to read screen-rendered text. The paper shows that the presented multiple classifier system outperforms the best former single classifier system on single characters by far and it shows the impact of using the multiple classifier system on the word reading performance.
1
Introduction
.5Screen-rendered characters are of very low resolution, are often smoothed, broken or touch each other and have a variable appearance according to their subpixel position. This makes a reliable segmentation of screen-rendered text into characters prior to classification impossible. Classification-based segmentation algorithms can be found in all fields of low resolution text recognition which have to deal with similar problems, e.g. in the field of handwriting recognition. They are also called integrated segmentation and recognition with hypothesisand-test strategy [4], recognition-based segmentation [1], hybrid segmentation [5], or internal segmentation in contrast to external segmentation. These algorithms rely strongly on the underlying classifier whose performance is thus very important. In this paper we present a multiple classifier system (MCS) based on multiple type-3 classifiers for the recognition of single characters. We have created the MCS to improve the classification-based segmentation of our screen-text recognition system. Our early paper [6] gives an overview of the former system which used a single classifier. The system allows to read text from screenshot images as it is necessary e.g. to provide copy-and-paste functionality for text in images or for translation tools which allow users to click on any text on the screen and give a translation. Besides some commercial OCR programs which start to address W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 921–928, 2007. c Springer-Verlag Berlin Heidelberg 2007
922
S. Wachenfeld, S. Fleischer, and X. Jiang
Fig. 1. Classification-based segmentation: image of a rendered word, oversegmentation, resulting primitive segments, merged segments with highest classification scores, and ground truth
the problem of reading screenshots we are not aware of other systems reported in research literature which address this task. In Section 2 we define the classification task and show the context for which the classifier is needed. Section 3 shows the design and evaluation of single classifiers for the later combination. Section 4 shows how the classifiers are combined and presents experimental results on single characters, which show that our MCS outperforms the best former single classifier by far. Section 5 discusses the impact of integrating the created MCS into our recognition system on the word reading performance. Section 6 concludes this paper with a discussion of our achievements and possible future work.
2
The Classification Task
The classifier shall serve within our recognition system for screen-rendered text. The system reads screen-rendered text from screenshot images in multiple steps. After the preprocessing, where word boundaries are determined and the text is separated from the background, the main step is the classification-based segmentation which is shown in Figure 1. During this classification-based segmentation the image of a displayed word is oversegmented into a sequence of primitive image segments s1 , . . . , sn so that each segment can be expected to correspond to a character or to a part of a character. The primitive segments are then combined to candidate character pattern. The classifier has the important task to assign
A Multiple Classifier Approach for the Recognition of Screen-Rendered Text
s12
si j
s33
923
s46 6
1 5
2 4
3 j
4
3
i
2
5 6
1
Fig. 2. Example graph for the word arm with a path indicated, which represents the correct segmentation and recognition
appropriate plausibility scores to the candidate patterns. In a following step a graph of possible combinations of candidate patterns can be searched for the optimal path which has the highest plausibility score (see Figure 2). This optimal path implies the final segmentation, the recognized characters and thus the recognized word. If s1 , . . . , sn is the sequence of the primitive image segments, the candidate character pattern sij result from merging neighbored primitive segments sij = si si+1 . . . sj , 1 ≤ i ≤ j ≤ n. The classification task for a given candidate character pattern is the assignment of plausibilities p(ck |sij ) for all character classes ck . The described context leads to several requirements concerning the classifier. As a classification result for each segment is needed, features have to be extracted from each segment. Thus, the first requirement is that the classifiers use features which are computable in acceptable time. Further, the classifier outputs are used to compute the optimal path. The qualities of the paths is computed as the product of the plausibilities of all segments in the path, weighted by their relative width: p(c(1) , . . . , c(m) |s(1) , . . . , s(m) ) =
m
p(c(i) |s(i) )ws(i) /w .
i=1
Thus, it is evident that the classifiers as well as the MCS assign plausibilities p(ck |sij ) for all character classes ck , which requires the classifier combination to be a type-3 combination.
3
Single Classifier Design and Evaluation
The single classifier takes a segment which is an image of a character or a part of a character, extracts features and assigns plausibilities p(ck |sij ) for all character classes ck . To design different classifiers, it is possible to change the classification method, the features used, or both.
924
S. Wachenfeld, S. Fleischer, and X. Jiang
1 0.8 0.6
a)
b)
0.4 0.2 0 0 0.2 0.4 0.6 0.8 1
1
0,72
c)
0,56
d)
0,36 0,16 0
0,21 0,49 1 0,29 0,73
Fig. 3. Projection of a character image and resulting feature vector: equidistant (a and b), based on gray value mass (c and d)
In the search for a good single classifier for our recognition system we have experimented with several features and different classification methods. The former recognition system uses a 5 × 5 feature vector, which results from a gray value projection of segments onto a uniform square (see Fig. 3) and a single k-nearestneighbor-like classifier. Using more complex features such as black-and-white transitions, gradient information or projection measures as well as other classifiers, e.g. a Bayesian classifier with a Parzen as well as with a uniform estimator resulted in lower performance. The knn-like classifier used assigns plausibility values for each class c by considering the relation between the average distance of a sample x to the kc nearest neighbors of class c and the maximum distance between any two samples. This leads to kc
p(c|x) = max(1 −
d(x, xc,i )
i=1
dmax kc
, 0)
where d(xi , xj ) is the Euclidean distance between two samples xi and xj . The number of considered nearest neighbors kc is class dependent and chosen in relation to the number of samples Nc belonging to class c. In experiments kc = αNc with α = 0.01 has proven to be a good value. Now we show how we create a pool of many single classifiers based on this classifier. Later classifiers are selected from this pool to be integrated into the MCS. Experiments turned out, that the robustness against slight segmentation errors can be increased by changing the way to project the segments. Instead of equidistantly choosing the zones for the projection (as shown in Figure 3a, b)
A Multiple Classifier Approach for the Recognition of Screen-Rendered Text
925
the zones can be chosen based on gray value mass. If the size of the zones is chosen in a way that each row and each column holds exactly the same gray value mass (as shown in Figure 3c, d) the robustness against additional single pixels, which otherwise strongly move the zones, is increased. We name the equidistant features which have 5 × 5 zones knn 5x5e and the mass based knn 5x5m. As the character size changes within the database, some characters are better recognized with a lower number of zones and some with a higher number. To let the later MCS benefit from classifiers with different numbers of zones we create classifiers knn 5x5[e|m] up to knn 10x10[e|m]. All these classifiers project onto a uniform square which proved to be effective to recognize narrow as well as wide fonts. Anyway, the aspect ratio of the character is lost. To be able to compensate this, we create classifiers having the aspect ratio as additional feature. So to each classifier knn zxz[e|m] in the pool we add the pendant knn zxz[e|m]a which has the additional aspect rate. Table 1. Recognition performance of the 24 single character classifiers on 15,808 isolated characters z 5 6 7 8 9 10
knn zxze correct false 98.9119% 1.0881% 98.9246% 1.0754% 98.9689% 1.0311% 98.9562% 1.0438% 98.9119% 1.0881% 98.8803% 1.1197%
knn zxzea correct false 98.9246% 1.0754% 98.8613% 1.1387% 98.9752% 1.0248% 98.9436% 1.0564% 98.8930% 1.1070% 98.8677% 1.1323%
knn zxzm correct false 98.8993% 1.1007% 98.9183% 1.0817% 99.0195% 0.9805% 98.9626% 1.0374% 98.9183% 1.0817% 98.9499% 1.0501%
knn zxzma correct false 98.9309% 1.0691% 98.9815% 1.0185% 99.0385% 0.9615% 98.9879% 1.0121% 98.9815% 1.0185% 98.9562% 1.0438%
Table 1 shows the performance of the 24 classifiers on 15.808 images of characters using a leave-one-out technique. The characters belong to the freely available Screen-Char database which consists of annotated screenshot images of single characters [7] [8]. It can be seen that on these characters the knn 7x7ma achieves the highest performance of 99.04%, which is the highest performance reported for this database so far. Not counting confusions between the two similar looking letters I (’I’) and l (’l’) the performance was 99.72%.
4
Classifier Combination
The goal is to increase the already very good performance on single characters by combining multiple classifiers. For a given screenshot image s of a character our classifiers return plausibility values p(ck |s) for each class ck . We use multiple classifiers C1 , . . . , Cn to classify the same screenshot image and thus get n plausibility values for each class which can be combined by a function p(ck |s) = F [p1 (ck |s), . . . , pn (ck |s)] (Figure 4 shows the setup). The functions we have used for combination are
926
S. Wachenfeld, S. Fleischer, and X. Jiang
Fig. 4. Type-3 combination of the classification outputs for all classes
– Product (F =prod): p(ck |s) =
n
pi (ck |s)
i=1
– Simple mean (F =mean): 1 pi (ck |s) n i=1 n
p(ck |s) =
– Minimum, Median, Maximum (F =min, median, max). For example: p(ck |s) = max{pi (ck |s)}. i
We have tested the performance of several subsets from the 24 classifiers. Table 2 shows the results of combining the six classifiers for z = 5, ..., 10 of one type (e.g. mass based projection with aspect) for the functions named before. The performance of 99.2346% (99.7964% without counting confusions between I and l), which was achieved by the combination of the knn zxzm classifiers using the median function, is the best performance on single characters so far. Combining too many classifiers is computationally expensive and does not yield better results. For example if all 24 classifiers are combined using the median function,
A Multiple Classifier Approach for the Recognition of Screen-Rendered Text
927
Table 2. Recognition performance of combined classifiers on 15,808 isolated characters using different functions. The six classifiers for z = 5, . . . , 10 have been combined for each listed classifier type.
prod mean median min max
knn zxze correct false 99.1776% 0.8224% 99.1966% 0.8034% 99.1460% 0.8540% 99.1270% 0.8730% 99.0891% 0.9109%
knn zxzea correct false 99.1840% 0.8160% 99.1966% 0.8034% 99.1713% 0.8287% 99.1270% 0.8730% 99.1397% 0.8603%
knn zxzm correct false 99.1776% 0.8224% 99.2156% 0.7844% 99.2346% 0.7654% 99.1713% 0.8287% 99.1903% 0.8097%
knn zxzma correct false 99.1903% 0.8097% 99.1903% 0.8097% 99.2093% 0.7907% 99.1713% 0.8287% 99.1334% 0.8666%
the performance is 99.1840%. The median function proved to be a good choice to combine classifier outputs in this situation. When the four types of classifiers for a fixed number of zones (e.g. 5 × 5, 6 × 6, ...) have been combined the resulting performance was above 99.1397% in all cases and thus as well very good. We have integrated MCSs consisting of the best combinations into our recognition system.
5
System Performance
We took a MCS consisting of the six knn zxzm classifiers and integrated it into our recognition system, replacing the single classifier. To find a measure for the impact on the recognition quality is hard as the recognition result is always a list of candidate words ranked by their paths plausibility (see Figure 5). One way to measure the quality is to count how often the candidate on the first rank is correct. But as soon as language models such as n-grams or dictionaries are used, it becomes more important that the correct word is in the ranked list at all or even better somewhere on the first places. To take this into account we compute the average rank of the correct words and the number of times where the rank is ≤ 10.
Fig. 5. Application example and resulting list of ranked word candidates
928
S. Wachenfeld, S. Fleischer, and X. Jiang
In tests using screenshot images of screen-rendered words taken from the freely available Screen-Word database [7] [8] it could be seen that rank 1 was only slightly more often correct but that the average rank of the correct words was significantly better. On 2,400 words the average rank is 1.53 (was 1.65 before).
6
Conclusion and Future Work
In this paper we have discussed the importance of the single character classifier, which is used for the classification-based segmentation within our system for the recognition of screen-rendered text. This was taken as motivation to systematically test features and classifier combinations. In experiments on screenshot images of single characters, taken from the freely available Screen-Char database, we have achieved a recognition performance of 99.2346% which is the best performance reported for this database (was 98.91 before). The theoretical maximum performance can only be increased by using classifiers for combination which make different errors but correctly classify yet falsely classified samples. Thus, we will try to develop such features which lead to better overall performance. Also we will increase the number of samples in our test databases to have a better foundation for our measurements.
References 1. Burges, C.J.C., Matan, O., LeCun, Y., Denker, J.S., Jackel, L.D., Stenard, C.E., Nohl, C.R., Ben, J.I.: Shortest Path Segmentation: A Method for Training a Neural Network to Recognize Character Strings. In Proc. of Int. Joint Conf. Neural Networks 3, 165–172 (1992) 2. Casey, R., Lecolinet, E.: A Survey of Methods and Strategies in Character Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 690–706 (1996) 3. Guyon, I., Haralick, R.M., Hull, J.J., Phillips, I.T.: Database and Benchmarking. In: Bunke, H., Wand, P.S.P. (eds.) Handbook of Character Recognition and Document Image Analysis, pp. 779–799. World Scientific, Singapore (1997) 4. Liu, C.-L., Sako, H., Fujisawa, H.: Effects of Classifier Structures and Training Regimes on Integrated Segmentation and Recognition of Handwritten Numeral Strings. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), 1395–1407 (2004) 5. Nagy, G.: Twenty Years of Document Image Analysis in PAMI. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 38–62 (2000) 6. Wachenfeld, S., Klein, H.-U., Jiang, X.: Recognition of Screen-Rendered Text. In: Proc. of 18th Int. Conf. on Pattern Recognition (ICPR). vol. 2, pp. 1086–1089 (2006) 7. Wachenfeld, S., Klein, H.-U., Jiang, X.: Annotated Databases for the Recognition of Screen-Rendered Text. In: Proc. of 9th Int. Conf. on Document Analysis and Recognition (ICDAR) (2007) 8. Wachenfeld, S., Klein, H.-U., Jiang, X.: The official Screen-Char and Screen-Word database website. http://cvpr.uni-muenster.de/research/ScreenTextRecognition/
Improving Stability of Feature Selection Methods Pavel Kˇr´ıˇzek1, Josef Kittler2 , and V´ aclav Hlav´ aˇc1 1
2
Czech Technical University in Prague, Center for Machine Perception, Karlovo n´ am. 13, 121 35 Prague 2, Czech Republic University of Surrey, Centre for Vision, Speech, and Signal Processing, GU2 7XH Guildford, United Kingdom {krizekp1, hlavac}@fel.cvut.cz,
[email protected] Abstract. An improper design of feature selection methods can often lead to incorrect conclusions. Moreover, it is not generally realised that functional values of the criterion guiding the search for the best feature set are random variables with some probability distribution. This contribution examines the influence of several estimation techniques on the consistency of the final result. We propose an entropy based measure which can assess the stability of feature selection methods with respect to perturbations in the data. Results show that filters achieve a better stability and performance if more samples are employed for the estimation, i.e., using leave-one-out cross-validation, for instance. However, the best results for wrappers are acquired with the 50/50 holdout validation. Keywords: Feature selection, stability, entropy.
1
Introduction
Many tasks in statistical pattern recognition are characterised by high dimensional data which have to be processed and analysed using statistical tools. A data sample is a vector formed generally by several hundreds of measurements, called features. Examples of such data are measurements arising in text recognition, genetic engineering, astronomy, etc. The problem of analysing and processing such multivariate sensory information can be aggravated by a relatively small sample size available for learning. A small sample statistics results in inaccurate parameter estimates of the data models. Thus, a poor generalisation is achieved on unknown data. This phenomenon is known as the curse of dimensionality [1]. A common solution is to reduce the dimensionality and employ, for example, only those features that are relevant to a given problem. This task is known as the feature selection [1]. There are many approaches to feature selection, however, all in principle involve two main ingredients: i) an objective function (criterion) which reflects
This work was supported by the EU INTAS project PRINCESS 04-77-7347 and by the Czech Ministry of Education under Project 1M0567.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 929–936, 2007. c Springer-Verlag Berlin Heidelberg 2007
930
P. Kˇr´ıˇzek, J. Kittler, and V. Hlav´ aˇc
informativeness of feature subsets, and ii) a search strategy. The search strategy is fixed and employs usually feature ranking [7] or subset search [1,11] methods. Both approaches are premised on certain principles which guide the search through the feature space and determine what results can be ideally achieved. The objective function guiding the search can be either classifier specific [9,5] (so called wrappers) or classifier independent [7,5] (so called filters). In general, learning algorithms designed using features selected by wrappers have been shown to achieve a better predictive accuracy. Nevertheless, wrappers are computationally very intensive and tend to over-fit [9]. Filters can be seen more as a preprocessing step for a subsequent learning, because the objective function is not directly linked to minimising the error rate of a particular classifier. Filters usually execute quite fast and provide a general approach to feature selection. One issue that has been relatively neglected in the literature is the stability of feature selection algorithms, i.e., the sensitivity of the solution (selected features) to perturbations in the input data. The motivation behind exploring the stability is to provide an evidence that the selected features are relatively robust to slight changes in the data. Feature selection methods producing consistent solution on the given data are preferable to those with highly volatile outputs. Possibly, the only related publications discussing the feature selection stability topic are [2,6,10]. However, the proposed stability measures have many limitations, unclear motivation, and empirically estimated bounds. Furthermore, these papers do not clearly motivate remedies for improving the stability and performance. We show that for any given feature selection algorithm, the stability and performance can be significantly improved by a careful use of the data available. The paper is organised as follows. Section 2 formulates the stability problem. In the same section, we propose and theoretically justify a measure which can assess the stability of feature selection algorithms with respect to perturbations in the data. We also suggest a concept how the stability and performance can be improved. The experimental set-up is described in Section 3. Section 4 presents and discusses the numerical results. Conclusions are drawn in Section 5.
2 2.1
Stability of Feature Selection Methods The Stability Problem
It is not often realised in the literature on feature selection that values of the objective function guiding the search for the best feature set are random variables with some probability distribution, no matter what type of the search strategy or criterion is adopted. This originates from the randomly sampled data at the feature selection input. The exact criterion value is unknown and the search strategy has to work just with an estimate acquired on the available data. Note that the accuracy of the criterion estimate can considerably influence the result of the feature selection. Suppose that we carry out T ∈ N runs of a feature selection algorithm on randomly sampled data. We would like to find some kind of measure which would allow us to assess the feature selection method stability with respect to
Improving Stability of Feature Selection Methods
931
perturbations in the data, i.e., to assess the sensitivity of the selected features to variations in the input data. We will consider a concept which reflects the most frequently selected feature subsets. It has to be emphasised that the stability does not say anything about the performance of the selected features. It just indicates the sensitivity of a feature selection method output to random perturbations in the input data. Large variations in the selected feature subsets signify that something is wrong. For instance, the feature selection algorithm is not appropriate for a given data, or there are not enough samples, or too many correlated variables or feature subsets with very similar information content, etc. Thus, less confidence should be assigned to feature sets that change radically with slight variations of the input data or perhaps it is advisable to refrain completely from the feature selection. 2.2
Stability of Feature Sets
Notice that various feature selection techniques select different feature subsets with a certain probability if the input data for training are randomly sampled from the original data set. One extreme case is a random feature selection which selects every feature subset with the same probability and thus produces a uniform probability distribution. The other extreme is a perfectly stable feature selection method which all the time selects the same feature subset and thus creates a single peak probability distribution. It appears that the stability of feature selection algorithms can be assessed through the properties of the generated probability distributions of the selected feature subsets. Our interest is, of course, in feature selection algorithms that produce probability distributions far from the uniform and close to the peak one. A convenient measure quantifying randomness of a system is entropy [12]. Entropy is a real function defined on a set of probability distributions. In information theory, the concept of entropy indicates the amount of uncertainty about an event associated with a given probability distribution. The entropy is maximal for a uniform probability distribution (i.e., outcome of random feature selection). If the event is certain (i.e., outcome of perfectly stable feature selection) then the entropy is zero. There are several entropy measures in information theory. We derive the stability measure from the Shannon entropy, see [12], H(X) = −
m
p(xi ) log p(xi ) .
(1)
i=1
Here X is a discrete random variable with possible states X = {x1 , x2 , . . . , xm } (i.e., particular feature subsets), m ∈ N is the number of all possible states (i.e., the number of all different feature subsets), and p(xi ) is the probability of the i-th state occurrence (i.e., probability of selecting a particular feature subset). Let n ∈ N be the problem dimensionality, k ∈ {1, 2, . . . , n} indicates the size of the feature subset, and T ∈ N is the number of evaluation trials with randomly sampled data. The frequencies of selected feature subsets are recorded
932
P. Kˇr´ıˇzek, J. Kittler, and V. Hlav´ aˇc
+ over T trials in a histogram given by a structure with entries Gjk ∈ Z n, where + Z are non-negative integers, j = 1, 2, . . . , C(n, k), and C(n, k) = k is the number of all possible feature combinations. The histogram structure can be implemented by recording different feature subsets and their frequencies Gjk . The probability estimates of particular feature subsets occurrence can be determined by normalising the histogram entries by the number of trials, i.e., Gjk = Gjk /T . C(n,k) Thus, all bin values Gjk are scaled into the interval [0, 1] and j=1 Gjk = 1. Based on the Shannon entropy (1), the following stability measure can be constructed for a feature subset of a fixed size k,
C(n,k)
γk = −
Gjk log Gjk .
(2)
j=1
In reality, the histogram structure is sparse, because the number of evaluation trials T is small compared to the theoretical combinatorial amount of all possible feature combinations which yields the maximal number of the histogram entries Gjk > 0 to be jmax = min [T, C(n, k)]. Thus, the stability measure (2) ranges in the interval 0 ≤ γk ≤ log min [T, C(n, k)]. Deriving both bounds is a straightforward task. The lower bound expresses that there is only one uniquely selected feature subset of size k over all T trials which corresponds to a perfectly stable feature selection algorithm, hence zero entropy. The theoretical upper bound is based on the assumption that an arbitrary feature subset of size k is selected over T trials with the same probability Gjk = (min [T, C(n, k)])−1 . It can be seen that the stability measure (2) creates an ordering and thus, the stability of examined feature selection algorithms can be compared. 2.3
Improving Stability and Performance
The primary key for the improved stability is an appropriate estimate of the objective function values. With a better criterion estimate, the search algorithm is also more likely to converge to its optimal solution with respect to the unknown data underlying probability distribution. Selected features thus achieve a better generalisation and performance. Surprisingly, it is very hard to find a research paper which would follow this basic rule. However, the price we may have to pay for a better estimate is an exponentially increasing execution time. The commonly adopted estimation methods in statistical pattern recognition are data re-sampling techniques like cross-validation, holdout validation, or bootstrap, see [1,3,8]. Only wrapper design sometimes applies five or ten-fold cross-validation to avoid over-fitting of a classifier employed in the objective function definition, see [9,10]. No estimation is ever done in filter approaches.
3
Experiment Design and Description
The motivation behind the following experiments is to investigate how the choice of a data re-sampling technique for the objective function estimation influences the stability and performance of filter and wrapper feature selection approaches.
Improving Stability of Feature Selection Methods
3.1
933
Data Used in Experiment
A simple artificial two-class problem is synthesised for the purpose of this study, since we would like to have a control all over the experiment. The data are designed so that ordinary ranking or greedy feature selection methods fail to find an optimal feature subset of a minimal size. Samples are derived from a 20-dimensional normal probability distribution with a common covariance matrix Σ and mean values μ = μ1 = −μ2 . First class consists of a component N (μ, Σ), and second class of a component N (−μ, Σ). The common covariance matrix and the mean values comprise several blocks which simulate different qualities of features. The first block contains statistically independent features with identical discriminatory ability of the particular features. The parameters are the following Σ 1,...,3 = I3
and μ1,...,3 = [0.635, 0.635, 0.635] ,
where Id is the d × d identity matrix for d ∈ N. Upper indices in Σ i,j,k,... and μi,j,k,... indicate the corresponding coordinates of the block. A nested pair of features with indices {4, 6} is hidden in the second block. The parameters are given by ⎡ ⎤ ⎡ ⎤ 1.05 0.48 0.95 0.5 Σ 4,...,6 = ⎣ 0.48 1.0 0.20 ⎦ and μ4,...,6 = ⎣ 0.4 ⎦ . 0.95 0.20 1.05 0 The third block contains statistically independent features with decreasing discriminatory ability of the particular features. The parameters are Σ 7,...,13 = I7
and μ7,...,13 = [0.636, 0.546, 0.455, 0.364, 0.273, 0.182, 0.091] .
The last block contains only noise with parameters Σ 14,...,20 = I7
and μ14,...,20 = [0, 0, 0, 0, 0, 0, 0] .
Finally, all the dimensions of the problem are randomly permuted to make sure that an examined feature selection algorithm will not follow feature ordering created in the design above. Data in our experiment contain 500 samples per class. It is interesting to note that the theoretical classification error of such data is about 2.3%. 3.2
Experiment Set-Up
In experiments, we examine a filter and wrapper variant of the state-of-the-art feature selection technique known as the Sequential Forward Floating Search (SFFS) algorithm [11]. The filter version applies the Mahalanobis distance [1] in the objective function definition. The wrapper form uses prediction accuracy of a linear decision rule created by the Gaussian classifier [1]. Both feature selection algorithms fit the data so they should find the exact solution.
934
P. Kˇr´ıˇzek, J. Kittler, and V. Hlav´ aˇc
For the objective function estimation, we consider five-fold, ten-fold, and leave-one-out cross-validation, repeated holdout validation scheme with 50/50, 60/40, 70/30, 80/20, and 90/10 splits of data to the training/validation part, respectively, and estimation methods based on sampling with replacement such as bootstrap, .632 bootstrap, and out-of-bootstrap, see [1,3,8] for more details. Holdout and bootstrap techniques employ 1, 5, 10, 50, 100, and 200 trials, respectively, for the objective function estimation. All methods are stratified [8] so the class a priori probabilities are preserved within the data re-sampling. The feature selection is applied in all experimental set-up conditions described above. To acquire a good statistics on the probability estimates of the selected feature subsets, one hundred evaluation trials are performed with 90% of the data randomly sampled from the original data set for training. The stability measure in Equation (2) is determined from the acquired probability estimates. The performance is assessed by the Hamming distance [4,2] between the exact solution and the selected features in order to check the real quality of the result. The exact solution is found by the SFFS algorithm which employs the true data probability distribution and the Mahalanobis distance in the objective function definition. The final performance is averaged over all one hundred trials.
4
Numerical Results and Discussion
The stability results for the filter and wrapper are depicted in Figures 1a and 1b. The performance of the selected features measured by the average Hamming distance is shown in Figures 1c and 1d. Graphs are depicted only for a feature subset of size k = 8 as similar behaviour was observed for all subset sizes.
4
Number of estimation trials 5/10/loo 1 5 10 50 100
4
200
3 γ
γ
k
k
3 2
2
1
1
0 5fold 10fold loo 50/50 60/40 70/30 80/20 90/10 boot estimation technique
0 5fold 10fold loo 50/50 60/40 70/30 80/20 90/10 boot 632b outb estimation technique
a)
b) 15
10
10
h
h
k
k
15
5
5
0 5fold 10fold loo 50/50 60/40 70/30 80/20 90/10 boot estimation technique
0 5fold 10fold loo 50/50 60/40 70/30 80/20 90/10 boot 632b outb estimation technique
c)
d)
Fig. 1. The stability factor γk for a) filter, b) wrapper, and the averaged Hamming distance hk for c) filter, d) wrapper, displayed with respect to the re-sampling technique involved in the objective function estimation for a feature subset of size k = 8
Improving Stability of Feature Selection Methods
935
The filter variant of the SFFS algorithm achieves better stability if more samples are employed in the objective function estimation, i.e., using techniques like ten-fold or leave-one-out cross-validation, for instance. Such result can be interpreted by the general fact that more samples available for learning lead to better parameter estimates and thus to a more consistent solution. This effect is clearly visible for the holdout validation, where the stability gets worse with less samples available for the estimation. However, the estimate gets better with the increasing number of trials. It appears that at some point the absolute stability factor saturates and becomes more or less independent of the examined re-sampling techniques if the number of estimation trials is sufficiently large. The situation is quite different for the wrapper approach. The best stability and the performance result is obtained for the repeated 50/50 holdout validation employed in the objective estimation. The popular ten-fold or leave-one-out cross-validation does not achieve such a good solution. Our interpretation is as follows. Although a classifier designed with more samples has better parameter estimates, the objective function is based on the prediction accuracy using the validation data. Having fewer samples available for validation implies higher variance of the prediction accuracy. Thus, the feature selection algorithm becomes more sensitive to random perturbations in the data and fails to find a consistent solution. The number of the training and validation samples compete against each other which explains why the 50/50 holdout validation achieves the best results. It seems again, that the stability factor saturates at some point if the number of estimation trials is sufficiently large. Wrappers appear to be much more sensitive to the correct objective function estimate than filters. Notice that the .632 bootstrap achieved the best stability factor, however, its performance is by far the worst. Bootstrap techniques are supposed to give estimates with low variance [3] which explains a good stability. Nevertheless, the bias of the estimate is high and as a result the wrapper converged to a wrong solution. As for the performance, the non-zero Hamming distance indicates that none of the solutions is identical with the theoretical best feature subset. The slight bias (Hamming distance “2”) is probably caused by an in-accurate estimate of the weakest features discriminatory power.
5
Conclusions
The feature selection results should always be strengthen by some confidence in the solution. For this purpose, we designed and theoretically justified an entropy based measure which can assess the stability of any kind of a feature selection algorithm with respect to random perturbations in the input data. Furthermore, we derived bounds on the stability measure analytically. For a given feature selection method and data, the stability factor has to be determined empirically over a number of evaluation trials with randomly sampled training data. A large number of trials is required to guarantee a sufficient accuracy of the probability estimates of the selected feature subsets occurrence.
936
P. Kˇr´ıˇzek, J. Kittler, and V. Hlav´ aˇc
The stability and performance of feature selection methods can be improved to a certain degree by a suitable algorithm design. This can be achieved by an appropriate estimate of the objective function guiding the search for the best subset of features. Our experiments showed that filters achieved better stability and performance if more samples were employed in the objective function estimation, i.e., using re-sampling techniques like ten-fold or leave-one-out cross-validation, or 90/10 holdout validation. For wrappers, however, the best estimation technique appeared to be the 50/50 holdout validation. Wrappers also turned up to be much more sensitive to incorrect objective function estimation than filters. Nevertheless, a good stability of a feature selection algorithm on the given data is just a necessary condition for a good performance of the selected features. Both, the stability and performance should always be analysed together, because the stability itself does not say anything about the selected features quality.
References 1. Devijver, P.A., Kittler, J.: Pattern Recognition: A Statistical Approach. Prentice Hall, Englewood Cliffs (1982) 2. Dunne, K., Cunningham, P., Azuaje, F.: Solutions to instability problems with sequential wrapper-based approaches to feature selection. Technical Report TCDCD-2002-28, Dept. of Computer Science, Trinity College, Dublin, Ireland (2002) 3. Efron, B., Tibshirani, R.: Estimating the error rate of a prediction rule: Improvement on cross-validation. Technical Report TR-477, Dept. of Statistics, Stanford University (1995) 4. Hamming, R.W.: Error detecting and error correcting codes. Bell System Technical Journal 26(2), 147–160 (1950) 5. Jain, A., Zongker, D.: Feature selection: Evaluation, application and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(2), 153–158 (1997) 6. Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms. Proceedings of the 5th IEEE International Conference on Data Mining, Houston, Texas, pp. 218–225. IEEE Computer Society, Los Alamitos (2005) 7. Kirra, K., Rendell, L.A.: The feature selection problem: Traditional methods and a new algorithm. In: Proceedings of the 10th National Conference on Artificial Intelligence, San Jose, CA, pp. 129–134. MIT Press, Cambridge, MA (1992) 8. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the Joint Conference on Artificial Intelligence, pp. 1137–1145. Morgan Kaufmann, Montreal, Canada, San Mateo, CA (1995) 9. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997) 10. Kuncheva, L.I.: A stability index for feature selection. In: Proceedings of the 25th International Multi-Conference on Artificial Intelligence and Applications, pp. 390– 395 (February 2007) 11. Pudil, P., Novoviˇcov´ a, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognition Letters 15, 1119–1125 (1994) 12. Shannon, C.: Mathematical theory of communication. Bell System Technology Journal 27, 379–423, 623–656 (1948)
A Movie Classifier Based on Visual Features Hui-Yu Huang1 , Weir-Sheng Shih2 , and Wen-Hsing Hsu2 1
Department of Computer Science and Information Engineering, National Formosa University, Huwei, Yunlin 632,Taiwan
[email protected] 2 Department of Electrical Engineering, National Tsing Hua University, Hsinchu 300, Taiwan
Abstract. In this paper, we propose an approach to classify the film categories by using low-level features and visual features. The goal of this approach is to classify the films into genres. Our current domain of study is the movie preview. A film preview often emphasizes the theme of a film and hence provides suitable information for classification process. In our approach, we classify films into three broad categories: Action, Dramas, and Thriller films. Four computable video features (average shot length, color variance, motion content and lighting key) and visual effects are combined in our approach to provide the advantage information to demonstrate the movie category. Our approach can also be extended for other potential applications, including browsing, retrieval of videos on the internet, video-on-demand, and video libraries. Keywords: shot detection, movie genres.
1
Introduction
With recent advances in digital video coding and transmission, larger and larger amounts of digital videos are becoming from various source in every day. The advent of digital television and internet is yet another motivating factor for automated analysis of digital video. Although digital video can be labeled at the production stage, there is still need for automatic classification of videos. Manual annotation or analysis of video is an expensive and arduous task that will not be able to keep up with this rapidly increasing volume of video data in the near feature. Hence, people need some technologies such as video database, video browsing, indexing, and data mining to manage the videos and discover knowledge from videos. Application of scene-level classification would allow departure from the prevalent system of movie ratings to a more flexible system of scene ratings. For instance, a child would be able to watch movies containing a few scenes with excessive violence, if a pre-filter system can prune out scenes that have been rated as violent. In order to effective classification of movie protecting the children download from the internet, a film classification processing is needed to develop. Pickering and Ruger [1] investigated the application of a variety of contentbased image retrieval method to video retrieval. Low et al. [2] presented an W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 937–944, 2007. c Springer-Verlag Berlin Heidelberg 2007
938
H.-Y. Huang, W.-S. Shih, and W.-H. Hsu
overview in developing an integrated and content-based solution for computerassigned video parsing, abstraction, retrieval and browsing. The most important feature of this approach is the use of low-level visual features as a representation of video content and automatic abstraction process. In order to bridge low-level media features and high-level semantics, there are many algorithms propose to bridge low-level features to object level search, and then connect the object level to event level. Chua and Ruan [3] designed a system to support the entire process of video information management: segmenting, logging, retrieving, and sequencing of video data. It is mainly developed a semiautomatic tool to divide video sequences into the meaningful shots. The rest of the paper is organized as follows. The proposed method is described in Section 2. Experimental results are given in Section 3. Finally, the paper is concluded in Section 4.
2
Proposed Method
In this section, we present the film classification method using the movie preview within the feature-based paradigm. For clustering video categories, we collect all movies which were played in Taiwan from 2004 to 2006 at [4], there are many genres can be found such as Action, Adventure, Comedy, Crime, Drama, Animation, Thriller, War, etc. According to these data, the Action, Drama and Thriller genres almost occupy 88% in every year. Hence, if we can classify those of categories, the great part of movies will be indicated. Based on this viewpoint, we identified three major genres: Action/Adventure, Drama (including comedy, drama, romance) and Thriller (or Horror) to demonstrate our approach. 2.1
Color Space Selection
So far, YUV is used in television broadcasting and MPEG compression standard. HSV model is commonly used in computer graphics applications and extremely fits for human color perception [5]. The function of the movie is to provide humans recreation. Hence, we adopt the YUV and HSV color spaces to extract color information. 2.2
Shot Boundary Detection
Shot boundary detection is used to video processing scheme to estimate the relationship between frames. In our approach, we will extend this method for detecting shot boundaries [6] using color histogram intersection in HSV color model. First, video is transformed to many frames, which can be seen as still images. Then, each histogram consists of 16 bins which are eight components for hue, four components for the saturation, and four components for the value. Let S(i) represent the intersection of histograms Hi and Hi−1 of frames ith and i − 1th, respectively. It is expressed as S(i) = min(Hi (j), Hi−1 (j)). (1) j∈all bins
A Movie Classifier Based on Visual Features
939
The magnitude S(i) is often used as a measure of shot boundary in related works. The values of i where S(i) is less than a fixed threshold are assumed to be the shot boundaries. Applying a fixed threshold to S(i) when the shot transition occurs with a dissolve generates several outliers because the consecutive frames differ from each other until the shot transition is completed. To improve the accuracy, an iterative smoothing of the 1-D S function is performed first. We have adapted the algorithm proposed by Perona and Malik [7] based on anisotropic diffusion. S is smoothed iteratively using a Gaussian kernel such that the variance of the Gaussian function varies with the signal gradient. Formally S t+1 (i) = S t (i) + λ[cE · ∇E S t (i) + cW · ∇W S t (i)], (2) where t is the iteration number and 0 < λ < 1/4 with ∇E S(i) ≡ S(i + 1) − S(i)
(3)
∇W S(i) ≡ S(i − 1) − S(i).
(4)
The condition coefficients are a function of the gradients and are updated for every iteration ctE = g(|∇E S t (i)|) ctW = g(|∇W S t (i)|), |∇E | 2
(5) (6)
|∇W | 2
where g(∇E S) = e−( k ) and g(∇W S) = e−( k ) . The constants were set to λ = 0.1 and k = 0.1. Finally, the shot boundaries are detected by finding the local minima in the smoothed similarity function S. Thus, a shot boundary will be detected where two consecutive frames will have minimum color similarity. 2.3
Visual Feature Extraction
The low-level features and visual features will be extracted. The low-level features contain average shot length, color variance, and lighting key [6]. In addition, we found that human feeling can be affected by movie rhythm. While a shot change, the movie editor will use video editing techniques to enhance the visual effect. In order to analyze the rhythm more detail, we propose the visual effect features which are slow moving effect and fast moving effect, the film editor can utilize shot change to create these visual effects. Slow moving effect : This visual effect is gradual change from shot to shot. Fade and Dissolve are classified as this effect, because they usually need a long time duration in shot change. The Fade and Dissolve are shown in Fig. 1. Fast moving effect: This visual effect will have two or more hard changes in a short time duration. This effect heavily used in Action and Thriller films. These frames possess high contrast to their neighbor frames. In addition, it can also be seen as several Abrupt Cuts occurred in a short time, in our experiments, we set
940
H.-Y. Huang, W.-S. Shih, and W.-H. Hsu
Fig. 1. Slow moving effect from movie Fig. 2. Fast moving effect. (a)A flash ocSilent Hill. (a) Fade in and Fade out. (b) cur in a short time from movie Silent Hill. Dissolve. (b)A short shot from movie i Robot.
if there are at least 2 Abrupt Cuts occurred in 0.2 seconds, it possesses a fast moving effect. Two fast moving effects are shown in Fig. 2. Abrupt Cut : The total brightness difference will almost equal to brightness difference at change frame, because there has only one rapidly change in these durations. Fade: Film minima brightness can be found in these durations because the fade must have to change between original shot to black frames. Besides that, brightness change is gradually increased or decreased. Dissolve: This technique is hard to determine by brightness. The brightness change can be gradually increased, decreased, or unchanged. If a shot change is detected and cannot recognize as Abrupt Cut or Fade, we set this shot change as Dissolve. We only want to find fast moving effect and slow moving effect, the fast moving effect is defined as Abrupt Cut occurred in 0.2 seconds, slow moving effect is including Fade and Dissolve. Thus, we design a rule to classify these two change modes (denoted as CH) which present Abrupt Cut and Gradual Change (including Fade and Dissolve). From the change frame, the neighboring frame’s brightness is denoted B, the brightness difference between two frames is denoted Bd , the classification rule is defined as: CH ∈ Abrupt Cut, if 0.8 < T, CH ∈ Gradual Change, otherwise,
(7)
where T = (max(B) − min(B))/max(abs(Bd )) < 1.2. With this classification rule and the visual effect definition, we can extract the fast moving effect and slow moving effect.
A Movie Classifier Based on Visual Features
941
In addition, we will give slow moving effect and fast moving effect to a visible value in order to represent the visual effect characteristics. By using the frame numbers which were detected as fast/slow moving, we calculate the distance between two neighboring frames and then obtain a new vector. Quantizing this vector, we calculate the relative histograms normalized to 1. Then, we can obtain two new features which are defined as fast and slow moving effect distribution, fv and sv , respectively. In our experiments, we set the total length of visual effect distribution as 100. We will further estimate the values of these two features denoted Fme and Sme , and defined as Fme =
n
l · fv (l),
l=1
Sme =
n
l · sv (l),
(8)
l=1
where n is set 100. If one kind of visual effect values is never occurred in a film, we set its visual effect value as 100. The visual effect value can serve as the expected value of how often this visual effect occurred in a film. The less the value, the more occurrence the visual effect. 2.4
Classification Method
Thus far, we have discussed the relevance of various low-level features of video data based on feature-space analysis. And then we choose the visual effect values and lighting key be features, building a classifier model, as shown in Fig. 3. A decision tree is a flowchart of tree structure, where each node denotes a test on the corresponding attribute, each branch represents the test outcome, and leaf nodes represent the classes or class distributions. The first decision rule is decided by visual effect value, including fast moving effect value Fme and slow moving effect value Sme . For a non-drama film, its fast moving effect value will less than drama film. We construct the classification trees to calculate the threshold Tf m and Tsm and further to distinguish a film which belongs to drama or non-drama film. For a film F (i), the decision rule is defined as F (i) ∈ Non-drama, if Fme < Tf m and Sme < Tsm , F (i) ∈ Drama, otherwise.
(9)
Fig. 3. Film classification construction diagram (A two-layer classifier structure)
942
H.-Y. Huang, W.-S. Shih, and W.-H. Hsu
The second decision rule is decided by lighting information. Although its significance is less than other features to distinguish Drama or Comedy films, this feature can be still used to distinguish Action or Thriller films. The lighting value is indicated L. A threshold Tl will be calculated by means of classification trees, and Action and Thriller films can be distinguished by using a decision rule, it is expressed as F (i) ∈ Action, if L > Tl , F (i) ∈ Thriller, if L ≤ Tl .
(10)
Thus, we can express all decision rules as F (i) ∈ Action, if Fme < Tf m and Sme < Tsm and L > Tl , F (i) ∈ Thriller, if Fme < Tf m and Sme < Tsm and L ≤ Tl , F (i) ∈ Drama, otherwise.
3
(11)
Experimental Results
In our experiments, we choose 44 films to test and verify our proposed method, including 9 thrillers, 10 actions, and 25 dramas (containing comedies). For testing the performance of visual effects, we choose six film previews, total number of Abrupt Cut and Gradual Change (including Fade and Dissolve) is 484 and 176, respectively. Fig. 4 shows the variability of average shot length. From Fig. 4, it is clearly known that the average shot length in Action films is shorter than Drama and Comedy films, which means the Action film has faster tempo than Drama and Comedy films, thus they can be distinguished by this feature. Drama and Comedy films are almost the same at this feature. Motion content and color variance distribution are shown in Fig. 5. In Fig. 5, we found out the distribution of motion content behavior. The activity in Action film is higher than Drama film. For color variance, it is a strong correlational structure with respect to genres. However, in our experiments, it is hard to
Fig. 4. Variability of average shot length Fig. 5. Distribution of motion content and color variance
A Movie Classifier Based on Visual Features
943
Fig. 6. Distribution of visual effect value
Fig. 7. Distribution of lighting key
Table 1. Abrupt Cut detection result
Table 2. Gradual Change detection result
Nc Na Nd Precision(%) Recall(%) 467 484 486 96.09 96.49
Nc Na Nd Precision(%) Recall(%) 173 176 232 74.57 98.01
crisply distinguish film genres. In other words, the color information is not a critical condition. Fig. 6 shows the result of visual effect values. The Action and Thriller films tend to lower the fast moving effect value because there will have more shots about flash or short action when the fast moving effect occurs. That is, these frames happened these effects are centralized to the neighboring frames. Fig. 7 shows the distribution of lighting key feature. From Fig. 7, the lighting factor in these films is an insignificant feature, hence, it cannot effectively categorize the film cluster. The Abrupt Cut detection result is presented in Table 1. In addition, we use two parameters, precision and recall [8], to evaluate the performance of visual effect detection. Precision and recall are denoted the accuracy of detection and the ability of detection, respectively, defined by Nc × 100%, Na Nc P recision = × 100%, Nd Recall =
(12) (13)
where Nc is the number of correct detected, Na is the number of actual occurred, Nd is the number of detected. The Gradual Change (including Fade and Dissolve) detection result is given in Table 2. In order to obtain the classification result, we choose two-thirds films for each category randomly for training the classification tree, and we will calculate the decision threshold Tf m , Tsm , and Tl by Eq. (10), the rest of films are used as the test data. And we repeat this step six times for all of films. The accuracy of this classifier can be measured by precision [9] as: precision = t/w,
(14)
where t is the number of samples that actually belongs to this film genre and is correctly classified. w is the number of all samples that is classified as this film
944
H.-Y. Huang, W.-S. Shih, and W.-H. Hsu
genre. We have to notice that the precision is different in which the estimation of classification performance and detection performance.
4
Conclusions
In this paper, we have analyzed the low-level and visual features applied in classification of film genres. The experimental results are presented that the combining visual cues with cinematic principles can provide the powerful tools for genre categorization. In other words, our approach can serve as a pre-filter system in selecting the movie what you want to watch in the internet. In the future, we will combine audio or text cues to modify the classification accuracy and analyze more movie categories.
Acknowledgments This work was supported in part by the National Science Council of Republic of China under Grant No. NSC 94-2213-E-007-055 and NSC 95-2221-E-150-091.
References 1. Pickering, M.J., Ruger, S.: Evaluation of key frame-based retrieval techniques for video. Computer Vision and Image Understanding. 92, 217–235 (2003) 2. Low, C.Y., Zhang, H.J., Smoliar, S.W.: Video parsing, retrieval and browsing: an integrated and content-based solution. In: Proc. of the ACM Int. Conf. on Multimedia, pp. 15–24 (1995) 3. Chua, T.S., Ruan, L.Q.: A video retrieval and sequencing system. ACM Trans. on Inf. Systems. 13, 373–407 (1995) 4. http://www.truemovie.com/ 5. Rafae, G., Richard, E.: Digital image processing, 2nd edn. Prentice-Hall, Englewood Cliffs (2002) 6. Rasheed, Z., Sheikn, Y., Shah, M.: On the use of computable features for film classification. IEEE Trans. on Circuits and System for Video Technology. 15, 52–63 (2005) 7. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Trans. on Pattern Analysis and Machine Intelligence. 12, 629–639 (1990) 8. Lupatini, G., Saraceno, C., Leonardi, R.: Scene break detection: a Comparison. In: Proc. of IEEE Int. Workshop on Research Issues in Data Engineering, pp. 34–41 (1998) 9. Yuan, Y., Song, Q.B., Shen, J.Y.: Automatic video classification using decision tree method. In: Proc. of the First Int. Conf. on Machine Learning and Cybernetics, pp. 1153–1157 (2002)
An Efficient Method for Filtering Image-Based Spam E-mail Ngo Phuong Nhung and Tu Minh Phuong Posts and Telecommunications Institute of Technology, Hanoi, Vietnam
[email protected] Abstract. Spam e-mail with advertisement text embedded in images presents a great challenge to anti-spam filters. In this paper, we present a fast method to detect image-based spam e-mail. Using simple edge-based features, the method computes a vector of similarity scores between an image and a set of templates. This similarity vector is then used with support vector machines to separate spam images from other common categories of images. Our method does not require expensive OCR or even text extraction from images. Empirical results show that the method is fast and has good classification accuracy.
1 Introduction The increasing number of Internet users and the low cost of e-mail make this form of communication very attractive for direct marketers. As a consequence, the volume of unsolicited commercial e-mail (“spam”) has grown tremendously in past few years. In addressing this growing problem, many solutions to spam reduction have been proposed. Among these solutions are automated methods for filtering spam. Using hand-crafted rules or machine learning techniques, anti-spam filters analyze the text content of e-mail to detect spam. Some anti-spam filters are reported to achieve accuracy of up to 99% [9]. To circumvent such systems, spammers have invented many techniques. An example of techniques spammers use is to embed advertising text in images being sent with spam. While the contents of such messages are normally viewed by spam receivers they are shielded from text-based anti-spam filters. By some estimates, up to 25% of spam being sent today contain imagery and this number is expected to increase [1]. Therefore, it is desirable to develop systems that can detect and filter image-based spam. A possible way to detect image-based spam is using a pipeline of an optical character recognition (OCR) system that extracts and recognizes embedded text, followed by a text classifier that separates advertising text from legitimate content. While this solution promises to detect spam with a certain level of accuracy, the existing OCR algorithms are computationally expensive and thus cannot operate on heavily loaded e-mail servers. In a recent paper, Aradhye et al. [1] described an image-based anti-spam filter that does not require full text recognition. Their method starts by extracting regions with overlaid text from images. Based on the text regions and other image elements, the method creates several simple features that are indicative of spam. Images represented by the extracted features are then classified into spam and non-spam using SVM. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 945–953, 2007. © Springer-Verlag Berlin Heidelberg 2007
946
N.P. Nhung and T.M. Phuong
Since the method does not include the text recognition step, it is much faster than systems with full OCR. However, the extraction of text regions is nontrivial and still requires considerable computational resources. In this paper, we propose a fast method to detect spam images. The proposed method does not try to extract embedded text from an image. Instead, it uses an edgebased feature vector, which can be computed efficiently, to represent major shape properties of the image. Since most of spam images contain large proportions of text (Fig. 1), they must have shape representation similar to that of other text intensive images. Our method uses the edge-based feature to compute a vector of similarity measures from an image to a small set of gold standards – images with different proportions of overlaid text. These similarity vectors then serve as input to SVMs that separate spam images from legitimate ones. The method is fast because it does not use computationally expensive image processing and text recognition steps. On a collections of images, the method separates spam images from non-spam ones with accuracy of 80% and higher.
(a) Image containing only text
(b) Image with photographic elements
Fig. 1. Examples of spam images
(a) Natural scene photo
(b) Greetings e-card
Fig. 2. Examples of non-spam images
Related work. The problem of image-based spam detection is a special case of image categorization that has been studied in context of many important applications. Depending on application requirements and the nature of images to categorize, generic image categorization methods can use different image features or their combinations to distinguish between two or more classes. In a work by Hu and Bagga [4], the authors relied on the correlation between image functional categories and
An Efficient Method for Filtering Image-Based Spam E-mail
947
several features such as: whether the images are graphic or photographic, whether the images have text elements. They used frequency domain analysis of image intensity and DCT coefficients to decide about the presence of these features. The learning and classification steps were then done with SVMs. Gavilan et al. [3] represented an image in term of blobs – image regions lighter or darker than background. The blob representations of the images are then used to train neural networks to distinguish among natural, artificial, portrait, or text images. Another interesting application is indoor vs. outdoor classification [10]. Beside spam images, many other categories of images and videos contain text. Since overlaid text contains important information about image content, the extraction of text blocks and recognition of their content have attracted research interest in the image processing and multimedia analysis community. Previous work on text extraction used different characteristics of text regions such as their color [7], the frequent occurrence of vertical edges, or wavelets and spatial variance of texture [13] to locate blocks of text.
2 Algorithm Our proposed spam detection method uses SVM combined with vector representation of images to distinguish between spam and non-spam images. The algorithm consists of three steps outlined next. 1) Feature extraction and normalization using Edge Directions (ED) or Edge Orientation Autocorrelogram (EOAC). This step summarizes shape properties of an image in term of edge orientation and correlation. 2) Calculation of similarity scores between the image and a small set of templates or sample images containing only text. This step allows representing the image as a vector of similarity scores with respect to the templates. 3) Training and classification with SVM. The vector representations of images as computed in step 2 are used to train SVM and subsequently to classify each new image as spam or non-spam. In the following sections, we provide a detailed description for each of these steps. 2.1 Edge Directions and Edge Orientation Autocorrelogram In image categorization, it is important to choose appropriate features to represent images. Since spam images are text intensive, and text elements have special shape characteristics which make them different from that of background or other elements, it is desirable to use features able to capture such characteristics. At the same time, for the method to be practical, a feature of choice must be fast to compute. Although other features like color-based features, texture-based features, blobs etc. have been used with success for other image categorization tasks [3,4,8,10], edgebased features are very indicative of text intensive spam images while remain simple to compute. In this work, we have chosen two edge-based features: ED [5,6] and EOAC [8]. The ED feature summarizes global shape information. EOAC is an extension of ED which captures correlation between text elements over small distances. The ED histogram of an image is computed in three steps, and the EOAC is computed by adding two more steps as follows.
948
N.P. Nhung and T.M. Phuong
1. Edge detection: The Sobel operator is used to generate two edge components Gx and Gy, from which edge amplitude and edge orientation are computed as follows:
G = Gx 2 + G y 2
∠G = tg-1 (Gy / Gx)
and
2. Finding prominent edges: only edges with amplitude higher than a predefined threshold T1 are extracted. In our experiment we used T1 = 25 as in [8]. 3. Edge orientation quantization: Edges are quantized into k segments ∠G1, ∠G2, …, ∠Gk, each segment is five degrees. The result of this step is the ED histogram. 4. Determining distance set: This step constructs a distance set D, member of which are the distances from the current edge. This set is used to compute the correlations in the next step. We used the set of four members as in the original paper D = {1, 3, 5, 7}. 5. Computing EOAC matrix: The EOAC matrix is a two-dimensional array with k rows and |D| columns. The element of this matrix contains the number of similar edges with the orientation ∠Gj , which are i pixel distances apart. Two edges are defined to be similar if the absolute values of their orientations amplitudes differences are less than an angle and an amplitude threshold value. Figure 3 shows examples of EOAC graphs and ED histograms. The ED and EOAC features are translation invariant. This is a desired characteristic in our case because text can be located anywhere within an image. To achieve also scaling invariance, the ED histogram is normalized with respect to the number of edge points of the image, and the EOAC matrix is normalized with respect to the sum of the populations of all EOAC bins. 2.2 Calculation of Similarity Scores If the proportion of overlaid text within an image is large enough, the contribution of the text to the image’s ED and EOAC will dominate that of the other image elements. As a consequence, two images containing large amounts of text tend to share similarities in their shape representations. The algorithm proposed herein exploits this observation to distinguish text intensive images from others. Specifically, in this step, the algorithm computes similarity scores of the image with respect to a small set of n sample images or templates. Here we define a template as a specially constructed image, which contains only text. The proportions of text as well as text characteristics are chosen to vary among different templates so that the set covers a large variety of images with overlaid text. Figure 3 shows a non-spam image (a), a spam image (b), and a template (c) as well as their respective EOAC graphs [(d)-(f)] and ED histograms [(g)-(i)]. As shown in the figure, the EOAC and ED representations of the spam image share some similarities with those of the template while the representations of the non-spam image look very different. To measure the similarity between an image with EOAC X and a template with EOAC Y we use L1 distance computed as follows: k
L1( X , Y ) =
d
∑∑| X
ij
− Yij |
i =1 j =1
For the ED feature, the similarity is computed similarly.
An Efficient Method for Filtering Image-Based Spam E-mail
949
1200 1000 800 600 400 200 0 0
10 20 30 40 50 60 70
(d)
(g)
(a) Non-spam image 5000
1500 4000
1000
3000 2000
500 S1
1000 0
0 0
10
20
30
40
50
60
70
0
10
20
(e)
30
40
50
60
70
(h)
(b) Spam image 5000 4000 3000 6000 4000 2000 0
S1 0
(c) Template image
10
20
30
40
50
60
2000 1000 0
70
(f)
0
10
20
30
40
50
60
70
(i)
Fig. 3. An example of shape representation using EOAC and ED; (a) - (c) show three images, where (a) is non-spam, (b) spam, (c) a template; (d)-(e) their respective EOAC graphs; and (g)(i) their respective ED histogram.
At the end of this step, each image is represented by a vector of length n, elements of which are the similarity scores with respect to the templates. In machine learning community, the idea of representing an object via its similarity to a set of other objects is known as empirical feature map [11]. The merit of the empirical feature map is that it provides a general way to map from similarity scores to vector representation, from which a proper kernel can be constructed to use with SVM. 2.3 Support Vector Machine Learning and Classification
With vector of similarity computed above, we train SVM to differentiate between the two classes: spam and non-spam. The SVM algorithm relies on two main ideas. First, the algorithm maps the given training sets of n-dimensional vectors with positive and negative examples into a (possibly) high-dimensional feature space. Then, in the feature space, the algorithm seeks to locate a hyperplane with two properties: 1) it separates the positive from the negative examples; 2) it maintains a maximum margin from any example in the training set [12]. Having found such a hyperplane, the SVM predicts the label of a new example by mapping it into the feature space and defining on which side of the hyperplane the example is located. The mapping from the input
950
N.P. Nhung and T.M. Phuong
space to the feature space is done by using a so called kernel function. In this work, we used similarity vectors as input to the SVM and tried several kernel functions defined on input vectors.
3 Experiments and Results Dataset. Unlike the situation with text-based spam filtering, to our best knowledge, there is no public benchmark dataset for image-based spam. To create a dataset for experiments, we collected images from spam messages arriving at an e-mail server. We used images from a spam message only if the message does not contain text in its body and hence the image content provides the major source of information to make spam-non-spam decision. Images that exist only for formatting were not included. The spam part of the dataset contains 411 images, about half of them contain only text without complex background (Fig. 1a). The other images contain graphic and photographic elements with different levels of complexity (Fig. 1b). The non-spam images in our dataset were collected from several sources. We asked our friends and colleagues to donate e-cards they received over email. The e-card collection was augmented by free e-cards downloaded from different websites with default text. This resulted in 287 images all contain text. We further randomly selected 300 images from the CorelDraw collection. Finally, following the work presented in [1] other 723 images were collected by querying the Google-images search engine with keywords “nature photo”, “portrait” and “baby” and then randomly selecting from what the engine returned. Evaluation methodology. We assessed the method by using 10-fold cross-validation. The dataset was randomly divided into 10 folds of equal size. One fold is left as the test set and the other folds were used for training. The experiment was repeated 10 times with different folds being the test sets; the classification accuracy is calculated by averaging over 10 runs. In our evaluation, we used spam recall and non-spam recall defined as follows: # of spam correctly classified # of all spam # of non - spam correctly classified non − spam recall = # of all non - spam spam recall =
SVM training. To conduct experiments we used WEKA library (http://www. cs.waikato.ac.nz/ml/weka). All the similarity vectors were normalized to have unit length. In all the experiments we used SVM with soft margin and C = 1. We tried different kernels and found that the linear kernel and the RBF kernel give the best results. In what follows, only results obtained with the linear kernel are reported. Template set construction. Since the algorithm makes sense only when an image contains dominating amount of text, the templates were constructed so that text regions cover more than 50% of each template. Specifically, we used black-white templates with text areas covering from 50% to 90% of the whole images with interval of 10%. Letters were chosen uniformly so that all the letters of the English alphabet appear with the same frequency. We tried several commonly used font families and their italic and bold face variants. Since EOAC is scaling-invariant, the
An Efficient Method for Filtering Image-Based Spam E-mail
951
choice of font size is not critical. In our experiment, we used font size = 10. To avoid unexpected effect when comparing images of different sizes, we used ten sets of templates each of them consists of templates of the same size. The sizes were chosen from 80x60 to 800x600 pixel with interval of 80x80. When computing the similarity vector for the given image, the set with the size closest to the image’s size is used. Results. In the first experiment, we compared the performance of two versions of the method - with ED and EOAC used as image features. We also examined how the use of templates with different font families affected the spam detection accuracy. In addition to three fonts that are commonly used in the Internet namely Times New Romans (TNR), Arial, and Tahoma, we used two other template sets constructed with Gothic and Lucida fonts. The results are summarized in Table 1. The results show that, in average, the two edge-based features have nearly the same non-spam recall, but EOAC gives significantly higher spam detection accuracy. At the same time, computing EOAC is more expensive than computing ED. On a PC with Pentium IV and 512 MB RAM, the computation of EOAC and ED for 1000 800x600 images takes 260 seconds and 125 seconds respectively. Table 1. Spam recall and non-spam recall for different edge-based features and font families
TNR
Tahoma
Arial
Gothic
Lucida
ED
Spam recall Non-spam recall
0.73 0.88
0.75 0.88
0.74 0.88
0.74 0.88
0.79 0.87
EOAC
Spam recall Non-spam recall
0.80 0.87
0.83 0.87
0.80 0.88
0.79 0.87
0.86 0.84
1
Non-spam recall
0.9 0.8 0.7
Nature photo Portrait
0.6
E-cards 0.5 0.5
0.6
0.7
0.8
0.9
1
Spam recall
Fig. 4. Spam and non-spam recalls for different image categories
As expected, the proposed method is not sensitive to the choice of font families used in template construction step. Except small fluctuations in cases of Tahoma and Lucida, the different font sizes give nearly the same classification accuracy. A possible explanation is that the small differences in shapes of the letters from different font families are smoothed during the edge orientation quantization step.
952
N.P. Nhung and T.M. Phuong
The next experiment was designed to evaluate the performance of the method on different categories of non-spam images. The method was run with EOAC as image feature and the template set constructed from Tahoma font family. In figure 4, nonspam recall for each image category is plotted against spam recall. The results show that the method can accurately distinguish spam images from “nature photo” images both spam and non-spam recalls are higher than 93%. At the same time, e-cards proved to be most difficult to distinguish from spam images. Images of this category always contain some amount of overlaid text, which make them similar to spam.
4 Conclusion We have described a new method for detecting spam e-mail with content embedded in images. Given an image, our method first extracts an edge-based feature, which summarizes the global information of the image. It then computes a vector of similarity scores between the image and a set of templates that contain only text. This vector representation of the image is used as input for support vector machines training and classification. The use of edge-based feature allows capturing regularities in shapes of text intensive spam images while does not require costly computations. Empirical tests show that our method achieves overall accuracy of 80% and higher in classifying spam from different categories of images whiles remains fast. Given the complexity of the problem, these results are encouraging and the proposed method can be used as a starting step for the construction of image-base anti-spam filters.
References 1. Aradhye, H.B., Myers, G.K., Herson, J.A.: Image Analysis for Efficient Categorization of Image-based Spam E-mail. In: Proc. of ICDAR’05, Seoul, Korea, pp. 914–918 (2005) 2. Fumera, G., Pillai, I., Roli, F.: Spam filtering based on the analysis of text information embedded into images. J. of Machine Learning Research 7, 2699–2720 (2006) 3. Gavilan, D., Takahashi, H., Nakajima, M.: Image Categorization Using Color Blobs in a Mobile Environment. Computer Graphics Forum (EG 2003) 22(3), 427–432 (2003) 4. Hu, J., Bagga, A.: Categorizing Images in Web Documents. IEEE Multimedia 11(1), 22– 30 (2004) 5. Jain, A.K., Vailaya, A.: Image retrieval using color and shape. Pattern Recognition 29(8), 1233–1244 (1996) 6. Jain, A.K., Vailaya, A.: Shape-basedretrieval: a case study with trademark image database. Pattern Recognition 31(9), 1369–1390 (1998) 7. Lienhart, R., Effelsberg, W.: Automatic Text Segmentation and Text Recognition in Video Indexing. ACM/Springer Multimedia Systems 8, 69–81 (2000) 8. Mahmoudi, F., Shanbehzadeh, J., Soltanian-Zadeh, H.: Image retrieval based on shape similarity by edge orientation autocorrelogram. Pattern Recognition 36, 1725–1736 (2003) 9. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E., Bayesian, A.: Approach to Filtering Junk E-Mail. In: Proc. of AAAI-98 Workshop on Learning for Text Categorization (1998) 10. Szummer, M., Picard, R.W.: Indoor-Outdoor Image Classification. In: Proc. IEEE Intl. Workshop on Content-Based Access of Image and Video Databases, pp. 42–51 (1998)
An Efficient Method for Filtering Image-Based Spam E-mail
953
11. Tsuda, K.: Support vector classification with asymmetric kernel function. In: Proc. of 7-th European symposium on Artificial Neural Networks, pp. 183–188 (1999) 12. Vapnik, V.N.: Statistical Learning Theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York (1999) 13. Li, H., Doermann, D., Kia, O.: Automatic Text Detection and Tracking in Digital Video. IEEE Transactions on Image Processing 9(1), 147–156 (2000)
SVM-Based Active Feedback in Image Retrieval Using Clustering and Unlabeled Data Rujie Liu1 , Yuehong Wang1 , Takayuki Baba2 , Yusuke Uehara2 , Daiki Masumoto2 , and Shigemi Nagata2 1
Fujitsu Research and Development Center, Beijing, China {rjliu, wangyh}@cn.fujitsu.com 2 Fujitsu Laboratories LTD., Kawasaki, Japan {baba-t, yuehara, masumoto.daiki, nagata.shigemi}@jp.fujitsu.com Abstract. In content based image retrieval, relevance feedback has been extensively studied to bridge the gap between low level image features and high level semantic concepts. However, it is still challenged by small sample size problem, since users are usually not so patient to label a large number of training instances. In this paper, two strategies are proposed to tackle this problem: (1) a novel active selection criterion. It takes into consideration both the informative and the representative measures. With this criterion, the diversities of the selected images are increased while their informative powers are kept, thus more information gain can be obtained from the feedback images; and (2) incorporation of unlabeled images within the co-training framework. Unlabeled data partially alleviates the training data scarcity problem, thus can improve the efficiency of SVM active learning. Systematic experimental results verify the superiority of our method over some existing active learning methods.
1
Introduction
The success of content-based image retrieval (CBIR) is largely limited by the gap between low-level features and high-level semantic concepts. To bridge this gap, relevance feedback (RF) initially developed in text retrieval was introduced into CBIR since the 1990’s and big success was achieved[1][2]. Many relevance feedback methods have been developed in recent years along the path from heuristic based techniques to optimal learning algorithms, such as “Query Point Movement” methods[3], discriminant analysis[4], D-EM[5], SVM based methods[6][7]. However, all these methods are challenged by small sample size problem, because few users will be so patient to label a large number of training instances in feedback process. Two strategies are commonly used to address the training data scarcity problem: active learning and adopting unlabeled data. Active learning studies the strategy for the learner to actively select samples to query the use to achieve maximal information gain in decision making. Cox et al.[8] used entropy minimization in search of the set of images that, once labeled, will minimize the expected number of future iteration. Tong and Chang[6] proposed the SVMActive algorithm for applications in text and image retrieval. According to their criterion, good W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 954–961, 2007. c Springer-Verlag Berlin Heidelberg 2007
SVM-Based Active Feedback in Image Retrieval
955
requests should maximally reduce the size of the version space, which can be approximately achieved by selecting the points near the SVM boundary. Semisupervised learning with unlabeled data is another strategy to solve the problem of insufficient training data. Based on the EM framework and discriminant analysis, D-EM[5] relaxes the assumption of the probabilistic structure of the data distribution and learns a generative model in a lower dimensional subspace. The cotraining paradigm[10] trains two classifiers separately on two distinct views and uses the predictions of each classifier on unlabeled examples to augment the training set of another classifier. In this paper, we aim to improve SVMActive [6] solution by following two strategies: (1) Enhance its active selection criterion by incorporating a representative measure. To achieve maximal information gain in feedback process, we argue that the selected images must be informative given the current estimation of the target and the redundancy between these images has to be low. In SVMActive , the most informative images are returned for labeling, however, much redundancy may exist between these examples. To solve this problem, a representative criterion is proposed to maximize the information capacity of the selected images. In each retrieval round, these images that are near the classification boundary are clustered by an unsupervised learning process, and one representative image is selected from each cluster for labeling. With this measure, diversities between the selected images are increased and the feature space covered by them is enlarged, thus, more information gain can be obtained. (2) Argument the training set by exploiting unlabeled data within the cotraining framework. In each retrieval round, a simple Euclidean distance model is built, and its most confident negative examples are used to augment the training pool of the SVM. The rest of this paper is organized as follows. In section 2, we firstly introduce the SVMActive method, then, incorporate our representative measure to generate a combined criterion. Section 3 presents the usage of unlabeled data within the co-training framework. The experimental results are shown in section 4, followed by the conclusion in section 5.
2 2.1
Active Learning Criterion SVMActive - The Informative Measure
SVMActive criterion[6] investigated active learning method by using the version space. In the following, the basic idea of this method is briefly presented. Let {(x1 , y1 ) . . . (xn , yn )} ⊂ (χ × {−1, +1})n be the training data with χ denoting a nonempty input space. A Mercer kernel K implicitly defines a mapping φ from the input space χ to the feature space F such that K corresponds to a dot product in F by K(x, x ) = φ(x), φ(x ). Assume these training data be linearly separable in the feature space, version space is defined as the set of hyperplanes that separate the data in F . More formally:
956
R. Liu et al.
V = {w ∈ W | yi w, φ(xi ) > 0 for i = 1 . . . n and ||w|| = 1}
(1)
where W denotes the parameter space. There exists a duality between the feature space F and the parameter space W : points in F correspond to hyperplanes in W and vice versa. With this definition, learning can be viewed as a search problem with the version space V which contains the solution we are looking for. Tong[6] proposed that one good way of doing this is to choose query point(s) that halves the version space, so that the version space can be reduced as fast as possible. This can be realized by testing each of the unlabeled instance x in the pool to see how close their corresponding hyperplanes in W come to the center of V . The closer a hyperplane in W is to the center of V , the more it bisects V . The SVM hyperplane w provides a reasonable approximation to the center of the version space, thus, Tong’s SVMActive criterion turns out to be selecting a new unlabeled instance with minimal distance to current classification hyperplane. 2.2
The Representative Measure
While Tong’s SVMA ctive strategy provides an effective solution to the selection of the most informative images, it encounters some problems when used to select more than one candidate images. Adding a batch of new examples that is selected exclusively based on the distances to the classification hyperplane does not necessarily yield a much greater reduction of the volume of the version space than simply adding one such example. In other words, this strategy does not consider the redundancies of these images. In this paper, a dynamic clustering process is adopted to enhance the representative abilities of the selected images, as shown in Fig. 1. In each feedback round, these images that are near the classification boundary and all images labeled till now are clustered into several groups, and one image is selected from each cluster for labeling. Thus, each selected image is representative of a small feature space, thus, diversities between the selected images are increased while their informative powers are kept. This clustering process is dynamic as images are selected depending on current estimation. Besides this, both the query image and images that have been labeled in previous feedback rounds are taken into clustering, in order to refine maximally the feature space with the help of available information. Normalized cut (Ncut)[11] is used to realize the clustering process. It formulates clustering into a graph partition problem, and organizes the image objects into groups such that the within-group similarity is high while the betweengroups similarity is low. Given N images to be clustered, a weighted undirected graph G = (V, E) is firstly constructed, where each node represents an image, and edges are formed between each pair of nodes. The weight on each edge, wij , is a function of the similarity between nodes i and j, which is defined as: wij = e−
d2 (i,j) σ2
(2)
SVM-Based Active Feedback in Image Retrieval
957
Fig. 1. Illustration of the representative measure. These images that are near the classification boundary are clustered, as denoted by the dashed ellipses, then, one image is selected from each cluster to query the user.
where d(i, j) is the distance between image i and j, and σ is a scaling parameter which is typically set to 10 to 20 percent of the total range of the image distance. Assume that the graph G is partitioned into two disjoint sets, A, B, A ∪ B = V, A ∩ B = ∅, the Ncut measure[11], i.e. the normalized weights of the edges that connect the two set, is then adopted to quantify this partition: cut(A, B) cut(A, B) N cut(A, B) = + , cut(A, B) = wuv (3) assoc(A, V ) assoc(B, V ) u∈A,v∈B
where assoc(A, V ) = u∈A,v∈B wuv is the total connection from nodes in A to all nodes in the graph and assoc(B, V ) is similarly defined. It was proven in[11] that finding a bipartition with minimum N cut value can be approximately achieved by calculating the second smallest generalized eigenvector with following generalized eigenvalue system: (D − W )y = λDy
(4)
where W is an N × N symmetric matrix with W (i, j) = wij , and D is a diagonal matrix with d(i) = j wij . The eigenvectors obtained above often take on continuous values, thus, a splitting point should be chose to determine the partition. In our implementation, the median value of the second smallest generalized eigenvector is used as the splitting point. In clustering, the Ncut method is recursively applied to get several clusters. But this leads to the questions: (1) which sub-graph should be divided? and (2) how to determine the number of clusters, or, when should this process stop? For the first question, we use a simple heuristic, i.e. the sub-graph with the maximum number of nodes is selected for further partition. As to the second question, a straightforward way is to set the number of clusters to that of images which are returned to query the user. However, it leads to another kind
958
R. Liu et al.
of redundancy. Some clusters usually contain images that have been labeled in previous feedback rounds, thus, images selected from these clusters may be similar with the labeled ones. In this case, only a little more information can be obtained from these newly selected training instances. To avoid this problem, we adopt following criterion to control the clustering process: Repeat clustering until the number of the clusters not containing labeled images equals to that of the instances to be labeled. Furthermore, it is assumed that the labeled images can represent their corresponding clusters that they belong to. Thus, we only select images from these clusters without containing labeled images and return them to query the user. 2.3
Active Selection
To meet both requirements, viz. selected images must be informative and the redundancy between these images must be low, the informative and representative measures are linearly combined for batch images selection. Let c = {x1 , . . . , xM } denote one cluster without containing labeled images. The representative measure of image xi is defined as Rep(xi ) = j∈c wij . The larger the within cluster similarities of an image, the better the representative of this image. The informative measure of image xi is defined as Inf (xi ) = |g(xi )|, where g(xi ) denotes the real-valued output of the SVM learner. By integrating these two factors together, the final score of an unlabeled instance xi can be written as: s(xi ) = −λInf (xi ) + (1 − λ)Rep(xi )
(5)
where, parameter λ is used to adjust the individual influence of each factor. For all images in cluster c, their scores are calculated according to Equ. 5, and the image with largest score is selected for labeling.
3
Exploiting Unlabeled Images
In each feedback round, another learner is constructed based on all images that are labeled till now, then this learner makes prediction to unlabeled images. After that, several most negative images considered by this learner are used as negative instances to augment the training set of the SVM. To meet the real time requirement in image retrieval, a very simple model, i.e. weighted Euclidean distance model, is used as the learner: f (x) = d(x, q) =
d
1/2 (x − q ) ∗ w i
i 2
i
(6)
i=1
where, x and q denote respectively the image to be estimated and the new query point, which are both represented by d-dimensional feature vectors, wi is the weight associated with each feature element.
SVM-Based Active Feedback in Image Retrieval
959
Assume P and N denote respectively the set of labeled positive and negative examples. The new query point can be obtained by: q=
1 1 xk − xk |P | |N | xk ∈P
(7)
xk ∈N
The wi ’s associated with the feature vector reflect the different contribution of each feature component. As in [3], a standard deviation based weight calculation approach is used in this paper. Given all the values of one feature component in the positive examples, the inverse of their standard deviation can be used as a good estimation of the weight. The smaller the variance, the larger the weight and vice versa. Since above learner is not strong, the labels they assign to the unlabeled examples may be incorrect. To alleviate this problem, a conservative strategy is used, that is, only a few most confident negative examples are used to enlarge the SVM’s training pool. More specifically, several examples with the largest distance in Equ. 6 are selected. For a specific query, most of the images in the database are irrelevant while only a small number of images are relevant, as a result, the prediction of the most negative examples should be more reliable. Besides this, another conservative measure is employed to further improve the reliability, i.e. the negative images labeled by this learner are only temporarily used by the SVM. In next feedback round, a new learner will be constructed based on the enlarged labeled data, and then a different set of negative examples will be selected.
4
Experiments
In this section, the effectiveness of the proposed algorithm is evaluated through device parts retrieval from assembly drawings. In order to search the assembly drawings containing similar device parts with the query image, all the device parts contained in the assembly drawings are firstly extracted, thus, the searching problem is converted into an ordinary image retrieval problem. In our application, all assembly drawings are grayscale images, which are obtained through scanning from hardcopy papers. The test dataset is composed by 14,770 device parts images extracted from real assembly drawings. For testing purpose, 1,939 commonly used device parts are manually selected and classified into 8 categories. In experiments, only these selected images are used in turn as queries and their category labels serve as the ground truth: the images in the same category with the query are treated as positive images. A 259-D feature space is used with 23 Zernike moments, 128 grid based features, 108 edge direction histogram. Grid based feature is obtained by firstly mapping the binarized object on a polar grid of 32x8 cells, then counting the number of pixels in each cell to generate a 256-D vector. To realize rotation invariant, Fourier transform is applied to this vector and its 128 magnitude values are used as the final feature.
960
R. Liu et al. 0.7
0.7 Unlab=10 Unlab=0
0.65
0.6 BEP(L*=6)
0.6 BEP(L*=3)
Unlab=10 Unlab=0
0.65
0.55 0.5 0.45
0.55 0.5 0.45
0.4
0.4
0.35
0.35
0
1
2 3 Iteration (λ=0.5)
4
5
0
1
2 3 Iteration (λ=0.5)
4
5
Fig. 2. Comparison with and without unlabeled data 0.7
BEP
0.6
L*=3,clu,λ=0.5 L*=6,clu,λ=0.5 L*=9,clu,λ=0.5 L*=3,ang,λ=0.8 L*=6,ang,λ=0.8 L*=9,ang,λ=0.8
0.5
0.4
0
1
2
3
4
5
Iteration
Fig. 3. Comparison of our algorithm and angle-diversity strategy, which are denoted as ’clu’ and ’ang’ respectively in above figure
Edge direction histogram is constructed in a 2D space: for each edge pixel pair, its absolute direction difference and spatial distance are quantified respectively into 18 and 6 bins to generate a 108 dimensional feature. To make the experimental results be clearly illustrated, we use precision-recall break-even point (BEP) instead of precision-recall curve as a measure of retrieval performance. BEP is defined as the point where precision and recall are equal. At first, the effect of the unlabeled data on retrieval performance is analyzed. It comprises two sets of tests: one is designed with 10 unlabeled images, while another without unlabeled image. In the latter test, however, unlabeled images are still adopted in the 0th round so that the SVM learner can be constructed with both positive and negative examples. Both tests are carried out by setting λ in Equ. 5 be 0.5, and the number of feedback images(L∗) be 3, 6. Besides this, 400 images that are near the current classification boundary are used in NCut clustering. The obtained BEP curves are depicted in Fig. 2. It is shown that unlabeled data is helpful in bettering the retrieval, and impressive improvement is achieved during the first feedback round especially for fewer feedback images. Next, we compare the proposed algorithm with the angle-diversity method[9] which was verified as an improvement to Tong’s SVMActive solution[6]. To make
SVM-Based Active Feedback in Image Retrieval
961
the comparison fair, 10 unlabeled data is also adopted in the angle-diversity strategy, which is obtained with the same process as introduced in section 3. This experiment is performed for L∗ =3, 6, and 9. Given each L∗ , different λ values are tested, and the best result is chosen. Fig. 3 illustrates the experimental results, where parameter λ in these two algorithms is set to be 0.5 and 0.8 respectively. It is clear that our method is consistently superior to angle-diversity strategy. Besides this, larger improvement is obtained by our algorithm in the first iteration, and this is just what we expected on relevance feedback techniques.
5
Conclusion
This paper presents a SVM based active relevance feedback technique for image retrieval. To solve the small sample size problem, two strategies are adopted: (1) a new active selection strategy is designed, which can be viewed as an improvement of Tong’s SVMactive simple algorithm by introducing the representative criterion. With this criterion, the redundancy between selected images is reduced, thus more information gain is achieved; (2) unlabeled data are adopted within the co-training framework.
References 1. Zhou, X.S., Huang, T.S.: Relevance feedback in image retrieval: a comprehensive review. Multimedia Systems 8(6), 536–544 (2003) 2. Smeulders, A., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. on PAMI 22(12), 1349–1380 (2000) 3. Rui, Y., Huang, T.S., Ortega, M., Mehrotra, S.: Relevance feedback: a power tool for interactive content-based image retrieval. IEEE Trans. on Circuits and Video Technology 8(5), 644–655 (1998) 4. Tao, D., Tang, X., Li, X., Rui, Y.: Direct kernel biased discriminant analysis: a new content-based image retrieval relevance feedback algorithm. IEEE Transactions on Multimedia 8(4), 716–726 (2006) 5. Wu, Y., Tian, Q., Huang, T.S.: Discriminant-EM algorithm with application to image retrieval. In: Proc. IEEE Int’l Conf. CVPR, pp. 222–227 (2000) 6. Tong, S., Chang, E.: Support vector machine active learning for image retrieval. In: Proc. ACM Multimedia (2001) 7. Tao, D., Tang, X.: Random sampling based SVM for relevance feedback image retrieval. In: Proc. IEEE Int’l Conf. on CVPR, pp. 647–652 (2004) 8. Cox, I.J., Miller, M.L., Minka, T.P., Papathomas, T.V., Yianilos, P.N.: The Bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments. IEEE Trans. on Image Processing 9(1), 20–37 9. Brinker, K.: Incorporating diversity in active learning with support vector machines. In: Proc. of the 20th Int’l Conf. on Machine Learning(ICML), pp. 59–66 (2003) 10. Zhou., Z.H.: Enhancing relevance feedback in image retrieval using unlabeled data. ACM Trans. on Information Systems 24(2), 219–244 (2006) 11. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. on PAMI 22(8), 888–905 (2000)
Hierarchical Classifiers for Detection of Fractures in X-Ray Images Joshua Congfu He1 , Wee Kheng Leow1 , and Tet Sen Howe2 1
Dept. of Computer Science, National University of Singapore 3 Science Drive 2, Singapore 117543 {hecongfu, leowwk}@comp.nus.edu.sg 2 Dept. of Orthopaedics, Singapore General Hospital Outram Road, Singapore 169608
[email protected] Abstract. Fracture of the bone is a very serious medical condition. In clinical practice, a tired radiologist has been found to miss fracture cases after looking through many images containing healthy bones. Computer detection of fractures can assist the doctors by flagging suspicious cases for closer examinations and thus improve the timeliness and accuracy of their diagnosis. This paper presents a new divide-and-conquer approach for fracture detection by partitioning the problem into smaller sub-problems in SVM’s kernel space, and training an SVM to specialize in solving each sub-problem. As the sub-problems are easier to solve than the whole problem, a hierarchy of SVMs performs better than an individual SVM that solves the whole problem. Compared to existing methods, this approach enhances the accuracy and reliability of SVMs.
1
Introduction
Fracture of the bone is a very serious medical condition. According to the International Osteoporosis Foundation [1], 1 in 3 women and 1 in 5 men above age 50 may experience osteoporotic fractures. 30–50% of women and 15–30% of men may suffer osteoporotic fractures in their lifetime. In particular, worldwide incidence of hip fractures can rise from 1.6 million to between 4.5 and 6.3 million by 2050, with more than 50% of all osteoporotic hip fractures occurring in Asia. In clinical practice, a tired radiologist has been found to miss fracture cases after looking through many images containing healthy bones. Computer detection of fractures can assist the doctors by flagging suspicious cases for closer examinations and directing the doctors attention to suspicious cases. It can thus improve the timeliness and accuracy of their diagnosis. Computer detection of fractures in x-ray images is a difficult and challenging problem. The femur can fracture in many ways with varying degrees of severity. While severe fractures cause drastic change to the shape of the femur, mild fractures do not change the femur’s shape and leave only very subtle signs in the
This research is supported by NMRC/0482/2000.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 962–969, 2007. c Springer-Verlag Berlin Heidelberg 2007
Hierarchical Classifiers for Detection of Fractures in X-Ray Images
(a)
(b)
(c)
(d)
(e)
(f)
963
Fig. 1. Healthy and fractured femurs. (a–c) Healthy femurs can have different appearances due to patients’ standing postures. (d) Severe fracture changes the femurs’ shape. (e, f) Mild fractures leave the femurs’ shape unchanged. Arrows indicate fractures.
x-ray images (Fig. 1). There is no single characteristic that can describe all kinds of fractures. X-ray images of healthy femurs also exhibit a significant amount of variation primarily due to the patients’ standing postures when the images are taken (Fig. 1). The unbalanced data problem, i.e., large difference in the proportions of healthy and fractured samples, further compounds to the problem’s difficulty. In a consecutive set of x-ray images of femurs (i.e., consecutive in the time that the patients took their x-ray images) that we collected from a local public hospital, only about 12% of them are fractured. When the training set is small, the difficulty becomes even more severe because there may not be enough samples to capture the whole range of possible variations. This paper presents a new approach for fracture detection by partitioning the classification problem into smaller sub-problems in SVM’s kernel space. A hierarchy of SVMs is trained so that each SVM specializes in solving a subproblem, which is easier to solve than the whole problem. Thus, the hierarchy of SVMs performs better than a single SVM solving the whole problem.
2
Related Work
Tian et al. [2] published the first research work on the detection of fractures in x-ray images by computing the angle between the neck axis and shaft axis. Subsequently, Lim et al. [3] Yap et al. [4] and Lum et al. [5] reported methods
964
J.C. He, W.K. Leow, and T.S. Howe
that detect femur fractures based on Gabor, Markov Random Field, and gradient intensity features extracted from the x-ray images. Three SVMs were trained to classify the samples each based on a different feature type. The individual SVM’s performance was not very good. By combining the decisions of the three SVMs, the overall accuracy and sensitivity (i.e, fracture detection rate) were improved. Combination of SVMs is a standard way to obtain a multi-class SVM from binary (two-class) SVMs [6,7,8]. Typically, each constituent SVM is trained to solve a one-vs-one problem, and they are combined using either a tree, a graph, a voting scheme, or other methods. Our method differs from these SVM combination methods in two important ways. Instead of partitioning a k-class problem into many one-vs-one sub-problems, our method partitions a binary (healthyvs-fractured) problem into several smaller 3-class (healthy, fractured, unknown) problems such that each is handled by a SVM. Moreover, the partitioning is performed based on estimations of the reliability of the SVMs.
3
Hierarchical SVMs
The guiding principle of our approach is divide and conquer or division of labor. A hierarchy of complementary SVMs are trained to each tackle a different sub-problem of the whole fracture detection problem. A well known divide-andconquer approach is to first cluster the input samples so that the samples in each cluster are more consistent with each other [9]. Then, a set of classifiers are trained to each classify only the samples in a different cluster. This approach is effective when the training set is large. The more complex the problem, the more clusters are needed to achieve good performance, and the larger the training set needs to be. In our investigation of this approach, we found that a large number of clusters are required to capture the large variations of both healthy and fractured samples. As a result, there are not enough training samples in each cluster to train a classifier. Instead of partitioning the problem in the feature space, our approach partitions it in SVM’s kernel space. This approach has two advantages. First, it is easier to separate the healthy and fractured samples in the SVM’s high-dimensional kernel space. Second, the partitioning performed by SVM is optimal. 3.1
Training Algorithm
The training of the hierarchical SVMs is guided by three principles: 1. Samples that can be reliably classified by a higher-level SVM should be handled by it. 2. Samples that cannot be reliably classified by a higher-level SVM should be passed to its child, a lower-level SVM. 3. The performance of a lower-level SVM on the samples passed to it should be better than the performance of its parent on these samples. The training algorithm begins with the top-level SVM S, which is given two data sets: the training set T and the validation set V . Its main stages are as follows:
Hierarchical Classifiers for Detection of Fractures in X-Ray Images
965
1. 2. 3. 4. 5.
Train SVM S on training set T . Run S on validation set V to obtain classification error rate E(S, V ). Based on E(S, V ), select a subset V of V . Create a new SVM S at the next level. Find a subset T of T that can be used to train S to achieve the performance of E(S , V ) < E(S, V ) and E(S , F ) ≤ E(S, F ), where F is the subset of fractured samples in V . That is, 1 − E(S , F ) is the sensitivity of S on V . 6. If S cannot achieve the above performance, then stop. 7. Otherwise, set S ← S , T ← T , V ← V , and continue at Step 3.
This algorithm uses probabilistic SVMs, such as Gini-SVMs [10], that produce classification results and probability estimates p. The p value ranges from 0 to 1, with 0.5 corresponding to ambiguous cases located on the decision surface in the kernel space. We shall call the side of the decision surface with p > 0.5 the positive side, and the side with p < 0.5 the negative side. After running a trained SVM on a sample set, each sample v will be assigned a probability estimate p(v), and the samples can be sorted in increasing order of p(v). There are two critical stages in the algorithm: Stage 3 and 5, which select appropriate subsets T and V to channel to the SVM S at the next level. In essence, V defines the sub-problem that S needs to solve, and T is the appropriate training set that can train S to solve the sub-problem. These stages will be discussed in more details in the following sections. 3.2
Selection of Validation Subset
Stage 3 of the training algorithm embodies the first two principles outlined in Section 3.1. It determines a subset V of V that the SVM S cannot reliably classify. This subset V is channeled to another SVM in the next level to achieve division of labor. Classification reliability is estimated based on two quantities: (1) the classification error rate E(S, V ) and (2) the probability estimates p(v) assigned to samples v in V by S. Let us compute the cumulative error rate c+ (p) from p = 1.0 towards 0.5 for the samples with p(v) > 0.5, and c− (p) from p = 0 towards 0.5 for the samples with p(v) < 0.5 (Fig. 2(b)). Then, the samples in the range p+ to p = 1 where c+ (p+ ) < E(S, V ) would have estimated error less than E(S, V ); similarly for the samples in the range p = 0 to p− where c− (p− ) < E(S, V ). That is, samples in the tail regions can be reliably classified. Therefore, samples in the middle range from p− to p+ should be selected to form the subset V . In the current implementation, c+ (p+ ) = c− (p− ) = ε called the error tolerance. 3.3
Selection of Training Subset
Stage 5 of the training algorithm embodies the third principle outlined in Section 3.1. Selection of appropriate training subset T is tricky because SVMs are very strong classifiers. Their accuracy on the training set is often close or equal to 100%. If the method for selecting validation subset is applied directly to the
966
J.C. He, W.K. Leow, and T.S. Howe
selection of training subset, very few training samples will be selected and they are not enough for training the lower-level SVM to achieve high performance. On the other hand, if the selection criterion is too loose and almost all the training samples are channeled to the lower-level SVM, then it would be solving the same problem as its parent and there would be no division of labor. So, the goal is to find a subset that is not too large and not too small. Let q(u) denote the probability estimate of sample u in T . Our method searches for the appropriate T iteratively: For q − from 0 to 0.5 in increments of Δq − , For q + from 1 to 0.5 in decrements of Δq + , Set T to contain all training samples between q − and q + . Train S on T and then test S on V . If E(S , V ) < E(S, V ) and E(S , F ) ≤ E(S, F ), then found T and return with success. Cannot find desired T . Return with failure. The increment Δq − and decrement Δq + are set as fixed proportions of the standard deviations of the distributions of training samples T + and T − in the positive and, respectively, negative side of the decision surface (Fig. 2(a)). 3.4
Testing Algorithm
The same testing algorithm is applied to the validation set during training and testing set at system test. Given a sample v, the following algorithm is applied: 1. For each SVM S from top to bottom of hierarchy, (a) Run SVM S on sample v to compute the probability estimate p(v). (b) If p(v) > p+ , then classify v as healthy and stop. (c) If p(v) < p− , then classify v as fractured and stop. (d) Otherwise, pass v to the child of S. 2. If with rejection, then classify v as unknown and stop. 3. (Without rejection) If p(v) > 0.5, then classify v as healthy and stop. 4. Otherwise, classify v as fractured and stop.
4
Experiments and Discussions
420 consecutive femur images were collected from a local public hospital. They were divided randomly into 200 training, 160 validation, and 60 testing samples. The percentages of fractured samples were kept roughly the same for all three sets at about 12%. Gabor and intensity gradient (IG) features as described in [3,5,4] were extracted as the features for classification. The following SVM configurations were trained and tested for comparison: – SVM: Single SVM trained on the training set and tested on the testing set. – SVM+: Single SVM trained on the combined training and validation set, and tested on the testing set. – H-SVM: Hierarchical SVM without rejection. – H-SVM−: Hierarchical SVM with rejection.
Hierarchical Classifiers for Detection of Fractures in X-Ray Images count 70
cumulative error 0.3
Level 1
60
Level 1 Level 2
Level 2
50
967
0.2
40 30 0.1
20 10 0
0
0.4
0.45
0.5
0.55
probability estimate
(a)
0.6
0.4
0.45
0.5 0.55 0.6 probability estimate
0.65
0.7
(b)
Fig. 2. Performance of SVMs in first two levels of the hierarchy. (a) Distributions of training samples over probability estimates. (b) Cumulative errors on validation sets.
Each of the above configuration was trained to classify Gabor and IG separately. Figure 2 shows the internal working of H-SVM. The curves on the left and right side of Fig. 2(a) show the distributions of samples on the corresponding sides of the SVMs’ decision surfaces. They show that the fracture detection problem is very difficult because all the samples fall within a narrow range of p = 0.4 to 0.55, and most of them are on the positive side. Fig. 2(b) shows that the cumulative error at the second level has a narrower spread than that in the first level. That is, errors of the level-2 SVM are focused within a narrower range, indicating that it solves the sub-problem better than its parent. Figure 3 shows the results of testing H-SVM on the validation and testing sets for both feature types. With very small error tolerance ε, no training and validation samples can be passed down to the lower-level SVMs, reducing HSVM to a single SVM. With very large ε, most, if not all, the training and validation samples are passed down to the lower-level SVMs defeating the divideand-conquer strategy. With an appropriate ε, there is a good division of labor. The trends of the error curves for validation set and testing set are similar although their minimum may not coincide at the same ε. In actual application, the error tolerance is selected as the ε that maximizes accuracy (i.e., 1 − error rate) and sensitivity on the validation set. The trained H-SVMs have a hierarchy of three levels for Gabor and four levels for IG (Table 1). Level-1 SVMs classify more than 70% of the testing samples and pass the remaining samples to the lower-level SVMs. They classify most of the healthy samples and sizable amounts of fractured samples. Therefore, they achieve higher accuracy but lower sensitivity compared to lower-level SVMs, just like single SVMs (Table 2). The samples processed by lower-level SVMs are more balanced, but the discrimination of healthy and fractured cases is still relatively difficult. So, they achieve higher sensitivity at the expense of lower accuracy.
968
J.C. He, W.K. Leow, and T.S. Howe error rate 0.14
error rate 0.1
Validation Test
0.13
0.09
0.12
0.08
0.11 0.1
0.07
0.09
0.06
0.08
Test Validation
0.05
0.07 0.06
0.04 0
0.05
0.1 0.15 error tolerance ε
0.2
(a)
0
0.1
0.2 0.3 0.4 error tolerance ε
0.5
0.6
(b)
Fig. 3. Test results on different feature types. (a) Gabor. (b) Intensity gradient (IG). Table 1. Performance of SVMs in the hierarchy. IG: Intensity gradient. accu: accuracy, sens: sensitivity, %class: percentage of testing samples classified by the SVMs in the respective levels, %frac: percentage of fractured testing samples classified.
level 1 2 3 4 overall
Gabor accu sens %class %frac 95.45% 50.00% 73.33% 28.57% 90.00% 0.00% 16.67% 14.29% 66.67% 75.00% 10.00% 57.14% — — — — 91.67% 57.14%
IG accu sens %class 96.15% 66.67% 86.66% 100% 100% 1.67% 75.00% 100% 6.67% 66.67% 100% 5.00% 93.33% 85.71%
%frac 42.86% 14.29% 28.57% 14.29%
Table 2 compares the performance of various SVM configurations on the testing set. The performance of SVM+ is better than that of SVM for IG but the converse is true for Gabor. This shows that having more training samples does not always improve the accuracy of single SVM. In comparison, H-SVM can use the training and validation sets optimally to achieve high performance. Its accuracy is as high as the larger accuracy between SVM and SVM+. Its sensitivity is also as high as those of SVM and SVM+ except for IG. By rejecting uncertain samples, H-SVM− achieves higher accuracy for Gabor and IG at the expense of lower sensitivity compared to H-SVM. The classification results based on Gabor and IG can be combined using a simple OR rule: classify a sample as fractured if it is classified as fractured using either Gabor or IG. For H-SVM−, the combined performance is better than that using single feature. Moreover, its rejection rate drops to 0. With feature combination, H-SVM− achieves a significantly higher accuracy compared to SVM, SVM+, and H-SVM, and the same sensitivity as SVM and H-SVM. In summary, the test results show that the overall performance can be optimized if an SVM can reject samples that it cannot classify reliably, and pass the samples to other SVMs to classify.
Hierarchical Classifiers for Detection of Fractures in X-Ray Images
969
Table 2. Test results on different feature types. IG: intensity gradient. Last column is the rejection rate of H-SVM−. OR: Combine Gabor and IG results using OR rule.
Gabor IG OR
5
accuracy sensitivity accuracy sensitivity accuracy sensitivity
SVM 91.67% 57.14% 90.00% 85.71% 86.67% 85.71%
SVM+ 90.00% 57.14% 93.33% 100% 88.33% 100%
H-SVM 91.67% 57.14% 93.33% 85.71% 90.00% 85.71%
H-SVM− 94.45% 33.33% 94.83% 83.33% 93.33% 85.71%
reject 10.00% 3.33% 0.00%
Conclusion
This paper presents a new divide-and-conquer approach for fracture detection by partitioning the problem in the kernel space of the SVM into smaller subproblems, and training an SVM to specialize in solving a sub-problem. Each subproblem is easier to solve than the whole problem. The training scheme ensures that lower-level SVMs always complement the performance of higher-level SVMs. As a result, the hierarchy of SVMs performs better than an individual SVM solving the whole problem. Compared to existing methods, this approach can enhance the accuracy and reliability of the SVMs.
References 1. IOF: Facts and statistics about osteoporosis and its impact. International Osteoporosis Foundation (2007), www.iofbonehealth.org/facts-and-statistics.html 2. Tian, T.P., Chen, Y., Leow, W.K., Hsu, W., Howe, T.S., Png, M.A.: Computing neck-shaft angle of femur for x-ray fracture detection. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 82–89. Springer, Heidelberg (2003) 3. Lim, S.E., Xing, Y., Chen, Y., Leow, W.K., Howe, T.S., Png, M.A.: Detection of femur and radius fractures in x-ray images. In: Proc. 2nd Int. Conf. on Advances in Medical Signal and Info. Proc. (2004) 4. Yap, W.H., Chen, Y., Leow, W.K., Howe, T.S., Png, M.A.: Detecting femur fractures by texture analysis of trabeculae. In: Proc. ICPR (2004) 5. Lum, V.L.F., Leow, W.K., Chen, Y., Howe, T.S., Png, M.A.: Combining classifiers for bone fracture detection in x-ray images. In: Proc. ICIP (2005) 6. Kreßel, U.: Pairwise classification and support vector machines. In: Sch¨ olkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods Support Vector Learning, pp. 255–268. MIT Press, Cambridge (1999) 7. Platt, J., Cristianini, N., Shawe-Taylor, J.: Large margin DAGs for multiclass classification. In: Solla, S.A., Leen, T.K., Muller, K.R. (eds.) Advances in Neural Information Processing Systems, MIT Press, Cambridge (2000) 8. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class support vector machines. IEEE Trans. on Neural Networks 13, 415–425 (2002) 9. Li, R., Leow, W.K.: From region features to semantic labels: A probabilistic approach. In: Proc Int. Conf. on Multimedia Modeling, pp. 402–420 (2003) 10. Chakrabartty, S., Cauwenberghs, G.: Gini-SVM. (bach.ece.jhu.edu/svm/ginisvm/)
An Efficient Nearest Neighbor Classifier Using an Adaptive Distance Measure Omid Dehzangi, Mansoor J. Zolghadri, Shahram Taheri, and Abdollah Dehzangi Department of Computer Science and Engineering, Eng. no. 2 Building, Molla Sadra St., Shiraz University, Shiraz, Iran {Dehzangi, jzolghadri, taheri, A.dehzangi}@cse.shirazu.ac.ir
Abstract. The Nearest Neighbor (NN) rule is one of the simplest and most effective pattern classification algorithms. In basic NN rule, all the instances in the training set are considered the same to find the NN of an input test pattern. In the proposed approach in this article, a local weight is assigned to each training instance. The weights are then used while calculating the adaptive distance metric to find the NN of a query pattern. To determine the weight of each training pattern, we propose a learning algorithm that attempts to minimize the number of misclassified patterns on the training data. To evaluate the performance of the proposed method, a number of UCI-ML data sets were used. The results show that the proposed method improves the generalization accuracy of the basic NN classifier. It is also shown that the proposed algorithm can be considered as an effective instance reduction technique for the NN classifier. Keywords: Pattern classification, nearest neighbor, adaptive distance metric, instance-weighting, data pruning.
1 Introduction As a simple but powerful supervised learning algorithm, the nearest neighbor has been used successfully on pattern classification problems [1,2,3]. In the basic form of this non-parametric classification system, a set of training patterns are presented and stored in the memory. A new pattern X is then classified using a distance function to determine how close it is to each stored instance, and choose the class of the nearest stored training instance of X. The basic NN classifier stores all of the training instances to use during generalization. The distance between an input query pattern and all stored patterns should be computed to classify the pattern. Instance reduction techniques [4,5] aim at reducing the size of training set to improve the generalization speed and storage requirement of the basic NN classifier. One other reason to reduce the original data set is to improve the prediction accuracy of the basic NN classifier by removing noisy instances. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 970–978, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Efficient Nearest Neighbor Classifier Using an Adaptive Distance Measure
971
In the literature, many instance selection algorithms have been proposed. These algorithms treat all instances in the reduced subset the same while generalizing. In the scheme proposed in this paper, a weight is associated with each stored instance. In the generalization phase, the weight of each instance is multiplied by the value of similarity between the query and that instance to obtain a weighted similarity value. The NN of a query pattern is identified as the instance having maximum value of weighted similarity. The overall scheme can be viewed as a form of locally adaptive distance metric. The performance of NN is dependant crucially on the choice of distance metric. In the past, many methods have been developed to locally adapt the distance metric. Examples of these include the flexible metric method proposed by Friedman [6], the discriminant adaptive method by Hasti and Tibshirani [7], and the adaptive metric method by Domeniconi et al. [8]. these methods estimate feature relevance locally at each query pattern. This leads to a weighted metric for computing the similarity between the query patterns and training data. In [10], similar to our proposed method, a simple locally adaptive distance measure is proposed that specifies a weight to each training instance. The method we propose in this paper uses a locally adaptive metric to improve the performance of the basic NN classifier and at the same time, it reduces the size of training set. The rest of this paper is organized as follows. In section 2, the adaptive weighted distance measure is described. In section 3, some instance selection algorithms are presented. In section 4, our proposed algorithm to learn the weight of each training instance is introduced. In section 5, experimental results are shown. Section 6 concludes this paper.
2 Adaptive Distance Measure Using Weighted Training Instances The nearest neighbor classifier assigns label of a test pattern according to the class label of its nearest training instance. To introduce the weighted version of nearest neighbor rule, here, the notation of basic nearest neighbor rule is briefly described. Assume that classification of patterns in an m-dimensional space is under investigation. Having a set of training instances {(X(i), C(k))}, where X(i), i=1,…,n are training feature vectors and C(k) ,k=1,…,C are the labels. NN rule finds the nearest neighbor of a new test pattern X, which denoted by X(w), using a distance function and assigns X to C(w) (the class label of the winner class). Different distance functions are used to find the nearest neighbor of each test pattern. As a commonly used distance measure, Euclidean distance has been used to measure the distance (i.e dissimilarity) between X and Y [11] in many applications: m
d (X ,Y ) =
∑ (x
i
- yi ) 2
(1)
i =1
In the first step, we changed the distance as a dissimilarity measure between query pattern X and jth instance X(j) to a similarity measure. This is done by a linear conversion as follows:
972
O. Dehzangi et al.
μ (X , X (j) ) = 1 - d (X , X (j) ) d max (X , X (j) )
(2)
Where, dmax(X,X(j)) is the maximum distance which can ever occur on the whole training set. As a result, instead of finding minimum distance pattern to X, we search for instance X(j) such that μ(X,X(j)) is maximized. This can be interpreted as normalizing the distance from [∞, 0] to a real number in the interval [0,1]. In Basic NN, the most similar pattern Xp to a query pattern X can be formally determined by:
p = argmax {μ ( X , X j ) | i = 1,...,n}
(3)
The NN rule assumes that all the stored instances are equally reliable and uses equation (3) to find the NN of a query pattern. This paper is based on weighted training instances. By assigning a weight wk to each instance Xk, the weights of the training instances are incorporated in the similarity metric to find the NN of a query pattern:
w = argmax{ μ ( X , X j ).w j | j = 1,..., n}
(4)
where wj is the weight assigned to jth instance by the learning algorithm which is introduced in section 4.
3 Instance Selection Algorithms In the field of automatic classification, where instance-based learning algorithms are used, many researchers have addressed the problem of training set reduction to solve either computation and/or the performance problem. According to their goals for reducing the training set, these algorithms may be divided into two main subgroups: noise filters and condensation methods. 3.1 Noise Filters
Noise filters aim at removing outliers and instances near boundary. They remove noisy instances (i.e., those that do not agree with their neighbors). These methods often produce a smoother decision boundary by removing the instances close to boundary. However, these algorithms do not remove internal points which are not necessarily contributed to the decision boundary. The aim of noise filters is mostly to improve the performance of the classifier by removing noisy instances. One early method of this kind is the Edited Nearest Neighbor (ENN) algorithm developed by Wilson [12].This algorithm finds a subset S of the training set T in the following way. Starting with S the same as T, each instance in S is removed if it does not agree with the majority of its k nearest neighbors. This algorithm removes noisy instances as well as instances near boundary. Repeated ENN applies the ENN algorithm repeatedly until all instances remaining have majority of their neighbors with the same class and no change is made in S. The effect of this is to widen the gap between classes and make the decision boundary smoother.
An Efficient Nearest Neighbor Classifier Using an Adaptive Distance Measure
973
All k-NN method proposed by Tomek [13] is an extension to ENN method. The ENN is repeated for all k (k=1, …, m) but it flags as “bad” any instance not classified correctly by its k nearest neighbors and after completing the loop all m times, all flagged instances are removed. 3.2 Condensation Methods
On the other hand, the aim of condensation methods is to remove those instances that do not affect the decision boundary (i.e mostly internal instances). The intuition behind them is that internal instances do not affect the decision boundary as much as border instances. Therefore, they try to find a significantly reduced subset by removing internal instances from the full training set. The aim of these methods is mostly to improve the efficiency of instance-based learning algorithms which use original instances from the training set as exemplars. These methods are mostly used to improve the speed and memory requirements of k-nearest neighbor classifiers (k-NN). During generalization, k-NN computes the distance between the input test pattern with each stored instance to predict the output class of the input pattern. In this case, condensation methods try to find a significantly reduced set of instances whose classification accuracy resulted by k-NN is as close as possible to those obtained using all training instances. Condensed Nearest Neighbor Rule (CNN) made by Hart [14] was one of the first reduction techniques. This algorithm begins by randomly selecting one instance per class from training set T and putting them in a new data set S. Each instance in T which is misclassified is added to S, thus ensuring that it will be classified correctly. This method is sensitive to noise and also the order of instance presentation affects the classification rate. Reduced nearest neighbor (RNN) introduced by Gates [15] starts with S equal to T and remove only those instances that do not decrease accuracy. Wilson and Martinez in [16] have provided a survey of algorithms used to reduce storage requirements in instance-based learning algorithms. They also proposed several reduction algorithms, DROP1-5. To describe their proposed methods, let’s define a set of instances for which instance P is one of the k nearest neighbors as P’s associates set (those instances depend on P). DROP1 starts with S=T and removes instance P if at least as many of its associates in S would be classified correctly without P (accuracy is checked on S). DROP2 is similar to DROP1 with the exception that the accuracy is checked on T instead of S and also S is sorted in an attempt to remove center points before border points. DROP3 uses DROP2 rule and additionally makes the use of ENN algorithm as a preprocessing step for noise removal. DROP4, changes the noise removal step of DROP3 to the following rule: Remove each instance only if it is misclassified by its k nearest neighbors, and it does not hurt the classification of other instances. This new rule is more conservative than the previous one and makes sure that it will not remove far too many instances in the noise-reduction pass. DROP5 changes the order of DROP2 and starts removing from border points to center points and then, iteratively applies DROP2 rule on the remaining set as long as any further improvement is being made. Theses methods showed good results based on [16]. In this work, we use the results taken by these methods as a comparison to our reduction ability.
974
O. Dehzangi et al.
4 Instance-Weight Learning In this section, an iterative learning scheme is proposed that attempts to specify the weight of each instance in the training set such that the number of misclassified training patterns is minimized. The basic element of this scheme is an algorithm that finds the optimal weight of each instance, assuming that the weight of all other instances in the training set is given and fixed. For an M-class problem, assume that N labeled instances {Xi, i=1, 2,…, n} from different classes are available. Initially, all instances are assumed to have a weight of one { wi =1.0, i=1,2,...,n}. The weight assigned to a training pattern can only affect the classification of other instances in the training set. The optimal weight wk of instance Xk from class T (given weights of all other patterns) is a real number in the interval [0, ∞] that minimizes the number of misclassified training data. For this purpose, in the first step, the instance is removed from the training set (i.e. wk = 0.0). In the second step, training patterns of class T that are classified correctly are removed from the training set. Notice that these patterns will be classified correctly regardless of the value of wk. In the third step, training patterns of other classes that are misclassified are also removed. Notice that these patterns will be misclassified regardless of the value of wk. At this stage, Representing the current training set by Γ, for each training pattern Xt ∈ Γ left in the training set, the following measure is used to calculate its score S(Xt).
S (X t ) =
max{μ (X t , X j ).w j | X j ∈ Γ, j ≠ t } j
μ (X t , X k )
(5)
where μ(Xi,Xj) represents the similarity of patterns Xi and Xj calculated using (2). The algorithm given in Table 1 can now be used to find the best threshold (i.e., weight) for pattern Xk. The purpose of the steps 1-3 in above procedure is to remove those instances that their classification do not depend on the value of wk. In other words, these patterns will be either classified correctly or misclassified regardless of the value of wk. The algorithm of Table 1 sorts the patterns in ascending order of their scores. Assuming that a set of m patterns (and their scores) is passed to this algorithm, a maximum of m+1 thresholds are examined to find the best threshold. Notice that for each specified threshold t, all training patterns X with S(X) < t are classified as class T. In this way, for each specific threshold, the number of misclassified patterns can be easily calculated. The best threshold (i.e. weight) is simply the one that minimizes the number of misclassified training data. The resulting threshold is considered as the weight of instance Xk and is used in the adaptive distance measure to find the nearest neighbor in (4). The proposed algorithm in Table 1 finds the best threshold by maximizing the accuracy which is determined according to TP1 and FP2 values which are defined as follows: TP: number of patterns of class T which are classified correctly. FP: number of patterns not of class T which are misclassified. 1 2
True Positive. False Positive.
An Efficient Nearest Neighbor Classifier Using an Adaptive Distance Measure
975
Table1. Algorithm for finding the best threshold Inputs: patterns Xt, scores S(Xt) Output: the value of best threshold (best-th) current = number of misclassified patterns corresponding to the threshold of th = 0 (i.e., classifying everything as class T ) optimum = current best-th = 0 rank the patterns in ascending order of their scores {assume that Xk and Xk+1 are two successive patterns in the list} for each different threshold th = (Score(Xk)+Score(Xk+1))/2 current = number of misclassified patterns corresponding to the specified threshold (i.e., all patterns Xt having Score(Xt) < th are classified as class T) if current < optimum then optimum = current best-th = th end if end for {assume that last is the score of last pattern in the list and τ is a small positive number} current = number of misclassified patterns corresponding to th = last + τ (i.e., classifying everything as class T) if current < optimum then optimum = current best-th = th end if return best-th
The accuracy is calculated below: Accuracy = TP - FP
(6)
The above procedure claculates the optimal weight of the instance Xk assuming that the weights of all other instances are given and fixed. The search for the best combination of instance weights is conducted by optimizing each rule in turn assuming that the order of the instances to be optimized is fixed. As the weight assigned to each instance can only affect the classification of other training instances, the algorithm of Table 1 assigns the weight of an instance to zero when no threshold could be found to improve the classification of other instances. In real data sets, the weight of vast majority of training instances will be set to zero by the learning algorithm. Since the instances having zero weights are not used in the generalization phase, the proposed algorithm can be considered as a serious instance reduction technique for the NN classifier.
5 Experimental Results In order to assess the performance of the proposed method, we used 10 data sets available from UCI-ML repository. Table 2 gives some statistics of the data sets we used in the experiments. Our objective in the first experiment is to investigate the effect of instance-weight learning algorithm of section 4 on the generalization ability
976
O. Dehzangi et al.
of NN classifier. Ten-fold cross-validation (10-CV) was used in our experiment. Each data set was divided into 10 partitions and 9 partitions (i.e., 90% of the data) was used as the training set. The algorithm of section 4 was then used to specify the weight of each instance in the reduced training set. The remaining partition (i.e., the other 10% of the data) was classified using the weighted instances. This procedure was repeated until all of the 10 partitions were used in the test phase. The whole 10-CV procedure was repeated 10 times with different initial partitioning of the data set. Table 2. Some statistics of the data sets used in our experiments Data set Wine Glass Sonar Image Segmentation Cancer Wisconsin Bupa Pima Diabetes Thyroid Iris Balance
# of Attributes 13 9 60 15 9 6 8 5 4 4
# of patterns 178 214 208 210 683 345 768 215 150 625
# of Classes 3 6 2 7 2 2 2 3 3 3
The results are evaluated from two aspects: classification accuracy and reduction ability, which are reported in Tables 3 and 4 respectively. The purpose of our first experiment is to investigate the effect of the learning algorithm on classification accuracy of the learned classifier. Table 3 shows the classification rates of the NN rule using the Euclidean distance measure and our proposed method using the adaptive distance measure. We report on classification accuracies of basic NN in comparison to our proposed method. Table 3. Generalization accuracy of basic NN compared to the proposed scheme Data set
Basic NN
Adaptive NN
Sonar Wine Glass Cancer (W) Bupa Pima Thyroid Iris Image Balance Ave. Acc.(%)
86.75 94.91 72.15 94.54 62.41 70.94 96.14 94.23 91.37 81.31 84.47
86.25 95.54 72.05 95.15 67.13 76.68 96.44 93.31 94.15 87.61 86.43
An Efficient Nearest Neighbor Classifier Using an Adaptive Distance Measure
977
In the last row of this table, we report on average generalization accuracy of each method over all 10 data sets. On average, our novel adaptive distance metric has improved the prediction accuracy of the basic NN by approximately 2% over the 10 data sets. In the second experiment, reduction ability of the proposed method is verified and compared with DROP3-5 as other effective instance selection algorithms proposed in the past. In Table 4, we report on generalization accuracy and percentage of original training set retained by each method on various data sets used in this paper. The average accuracy and reduction percentage of each method over the 10 data sets is reported in the last row of this table. Table 4. Accuracy and storage percentage of DROP2-DROP5 and the proposed method Data Set
DROP3
DROP4
DROP5
Adaptive NN
Sonar
Acc. 80.38
% 22.65
Acc. 82.74
% 22.65
Acc. 85.62
% 19.34
Acc. 86.25
% 14.17
Wine
94.97
11.11
94.97
11.11
97.19
5.99
95.54
6.78 22.37
Glass
67.32
19.99
65.91
21.96
69.24
20.09
72.05
Cancer
95.42
3.74
95.28
3.78
95.28
3.62
95.15
1.15
Bupa
59.09
24.28
58.22
27.57
59.41
26.12
67.13
18.26
Pima
72.65
15.06
72.39
17.80
69.80
17.91
76.68
11.63
Thyroid
95.43
17.67
93.23
17.98
94.76
18.52
96.44
6.78
Iris
94.00
9.93
94.00
10.37
91.33
8.59
93.31
6.42
Image
91.04
11.43
92.38
12.74
88.85
10.96
94.15
15.97
Balance
83.57
12.78
82.23
11.72
80.49
12.65
87.61
6.45
Ave. (%)
83.39
14.87
83.14
15.77
83.21
14.38
86.43
11.02
As it can be seen, the average accuracy of our proposed method is about 3% higher than best of DROP3-5 methods. The percentage of the original training set retained using the proposed method is also about 3.5% lower than best of DROP3-5 methods. That is, our proposed method simultaneously achieves better performance in terms of accuracy and reducing the size of training data.
6 Conclusions In this paper, we proposed a method of learning the weight of each instance in the training set. This learning scheme attempts to minimize the number of misclassified patterns. The weight assigned to each training instance was then incorporated in the adaptive distance metric to find the nearest instance of a query pattern to predict the class label of that pattern. For empirical evaluation of the proposed scheme, we used 10 real world data sets from the UCI-ML repository. we evaluated the performance of our proposed scheme in terms of generalization accuracy and reducing the size of original training set. The experimental results showed that the proposed scheme is more effective than other
978
O. Dehzangi et al.
instance selection algorithms proposed in the past (DROP3-5) both in terms of prediction accuracy and reducing the size of training set.
References 1. Ghosh, A.K., Chaudhuri, P., Murthy, C.A.: On aggregation and visualization of nearest neighbor classifiers. IEEE Trans. Patt. Anal. Mach. Intel. 27, 1592–1602 (2005) 2. Ghosh, A.K., Chaudhuri, P., Murthy, C.A.: Multi-scale classification using nearest neighbor density estimates. IEEE Trans. Sys. Man Cyber. 36, 1139–1148 (2006) 3. Dasarathy, B.V.: Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, CA (1991) 4. Wilson, D.R., Martinez, T.R.: Reduction Techniques for Exemplar-Based Learning Algorithms. Mach. Lear. 38, 257–286 (2000) 5. Jankowski, N., Grochowski, M.: Comparison of Instances Selection Algorithm I. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 598–603. Springer, Heidelberg (2004) 6. Friedman, J.: Flexible metric nearest-neighbor classification. Technical Report 113, Stanford University Statistics Department (1994) 7. Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Trans. Patt. Anal. Mach. Intel. 18, 607–615 (1996) 8. Domeniconi, C., Peng, J., Gunopulos, D.: Locally adaptive metric nearest neighbor classification. IEEE Trans. Patt. Anal. Mach. Intel. 24, 1281–1285 (2002) 9. Wang, J., Neskovic, P., Cooper, L.N.: Improving nearest neighbor rule with a simple adaptive distance measure. Patt. Rec. Lett. 28, 207–213 (2007) 10. Cover, T.M., Hart, P.E.: Nearest Neighbor Pattern Classification. Trans. Info. Theo. 13, 21–27 (1967) 11. Wilson, D.: Asymptotic properties of nearest neighbor rule using edited data. IEEE Trans. Info. Theo. 18, 431–433 (1972) 12. Tomek, I.: An experiment with edited nearest neighbor rule. IEEE Trans. Sys. Man Cyber. 6, 448–452 (1972) 13. Hart, P.E.: The Condensed Nearest Neighbor Rule. IEEE Trans. Info. Theo. 14, 515–516 (1968) 14. Gates, G.W.: The Reduced Nearest Neighbor Rule. Trans. Info. Theo. 18, 431–433 (1972) 15. Wilson, D.R., Martinez, T.R.: An Integrated Instance-Based Learning Algorithm. Comp. Intel. 16, 1–28 (2000)
Accurate Identification of a Markov-Gibbs Model for Texture Synthesis by Bunch Sampling Georgy Gimel’farb and Dongxiao Zhou Department of Computer Science, Tamaki Campus, The University of Auckland, Auckland 1, New Zealand
[email protected],
[email protected] http://www.tcs.auckland.ac.nz/~georgy
Abstract. A prior probability model is adapted to a class of images by identification, or parameter estimation from training data. We propose a new and accurate analytical identification of a generic Markov-Gibbs random field (MGRF) model with multiple pairwise interaction and use it for structural analysis and synthesis of textures.
1
Introduction
Prior MGRF models of spatially homogeneous images are used for years [2, 15]. Such a model describes each class of images in terms of explicit geometry and quantitative strength of inter-pixel interaction. A number of other image priors (see e.g. [12, 13]) have been proposed recently to more closely reflect actual probability distributions of natural images. But much simpler MGRFs with pairwise interaction still lead not infrequently to effective solutions to many practical image analysis and synthesis problems. This paper proposes a new analytical identification of the MGRF with more accurate than before [5] estimates of Gibbs potentials and interaction geometry from first- and second-order statistics of a training image. Each training image is more adequately assigned to aperiodic (stochastic) or nearly periodic (regular) textures, and its more adequate structural description in terms of arbitraryshaped textons, or texels, and their spatial placement rules enhances our bunch sampling technique for fast realistic texture synthesis-via-analysis [16]. Section 2 derives the new accurate analytical approximation of the potentials for a generic MGRF. Section 3 discusses the improved structural descriptions of textures, presents synthetic stochastic and regular textures obtained with the enhanced bunch sampling, and compares it to other texture synthesis methods.
2
Potential Estimates for a Generic Markov-Gibbs Model
Let g : R → Q be a digital image with a finite set, Q = {0, 1, . . . , Q − 1}, of scalar signals q (grey values or colour indices) on a finite arithmetic lattice, R = {(x, y) : x = 0, 1, . . . , M − 1; y = 0, 1, . . . , N − 1}, of size M N where (x, y) are pixel coordinates. Spatially homogeneous pairwise interaction is described by W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 979–986, 2007. c Springer-Verlag Berlin Heidelberg 2007
980
G. Gimel’farb and D. Zhou
a fixed neighbourhood set N of coordinate offsets for neighbours interacting with a pixel; each pixel (x, y) ∈ R interacts with the subset Ex,y of the neighbouring pixels, Ex,y = {(x + ξ, y + η), (x − ξ, y − η) : (ξ, η) ∈ N } ∧ R. Gibbs potential functions Vξ,η : Q2 → (−∞, ∞) of cooccurrences (gx,y = q, gx+ξ,y+η = s); (q, s) ∈ Q2 describe the strength of interaction between each pixel pair ((x, y); (x + ξ, y + η)) ∈ Ex,y ; (x, y) ∈ R. The entire interaction is T given by the potential |N |Q2 -vector V = Vξ,η (q, s) : (q, s) ∈ Q2 ; (ξ, η) ∈ N . T Let F(g) = ρξ,η fξ,η (q, s|g) : (q, s) ∈ Q2 ; (ξ, η) ∈ N be a vector of scaled empirical probabilities (relative frequencies) fξ,η (q, s|g) of the neighbouring sig|Cξ,η | nal co-occurrences. Here, ρξ,η = |R| is the relative size of the set Cξ,η of all pixel pairs in R with the offset (ξ.η), and (q,s)∈Q2 fξ,η (q, s|g) = 1 for all (ξ, η) ∈ N . Given N and V, a generic MGRF with multiple pairwise interaction is specified with a Gibbs probability distribution (GPD) [15]: −1 ◦ T ◦ T (1) PN ,V (g ) = exp |R|V F(g ) exp |R|V F(g) g∈G
where G = Q|R| is the parent population of the images. To identify the MGRF of Eq. (1), both the characteristic neighbourhood N and the potential V are to be estimated from a training image g ◦ . Given g ◦ and N , the specific log-likelihood of the potential is: V|g◦ =
1 1 log PN ,V (g ◦ ) = VT F(g ◦ ) − ln exp |R|VT F(g) |R| |R|
(2)
g∈G
An analytical approach similar to that in [5] yields Proposition 1. Let the log-likelihood gradient, ∇V|g◦ = F(g ◦ ) − E{F(g)|V}), and the Hessian matrix, ∇2 V|g◦ = −CF(g)|V , where E {. . . |V} and CF(g)|V are the mathematical expectation and covariance matrix, respectively, of the scaled empirical probability vectors under the GPD of Eq. (1), be known for an image g ◦ and potential V0 . Then the first approximation of the MLE: V∗ = V0 + λ∗ ∇V0 |g◦ ; λ∗ =
∇T V0 |g◦ ∇V0 |g◦ ◦ ∇T V0 |g◦ CF(g)|V0 ∇V0 |g
(3)
maximises the second-order Taylor series expansion of the log-likelihood 1 T V|g◦ ≈ V0 |g◦ + ∇T V0 |g◦ (V − V0 ) − (V − V0 ) CF(g)|V (V − V0 ) 2 along the gradient from V0 . The approximate MLEs of Eq. (3) are centred: ∗ (q,s)∈Q2 Vξ,η (q, s) = 0 for (ξ, η) ∈ N . The proof is straightforward. The vector E{F(g)|V} of the scaled marginal co-occurrence probabilities and the covariance matrix CF(g)|V are explicitly known if the MGRF of Eq. (1) is
Accurate Identification of a Markov-Gibbs Model
981
an independent random field (IRF). The solution in [5] assumes the simplest IRF (denoted IRF0 below) with zero potential V0 = 0 and equal cooccurrence probabilities, pξ,η (q, s) = Q12 ; (q, s) ∈ Q2 . In this case ∇0|g◦ = Fcn (g ◦ ) and CF(g)|0 ≈ QQ−1 4 Diag [ρξ,η u : (ξ, η) ∈ N ] T ◦ 2 where Fcn (g ) = ρξ,η fcn;ξ,η (q, s) : (q, s) ∈ Q ; (ξ, η) ∈ N is a vector of the scaled centred probabilities fcn;ξ,η (q, s) = fξ,η (q, s|g ◦ ) − Q12 for the image g ◦ such that (q,s)∈Q2 fcn;ξ,η (q, s|g ◦ ) = 0 for all (ξ, η) ∈ N ; u is the Q2 -vector with unit components, and Diag[. . .] denotes a diagonal matrix. 2
Corollary 1. The approximate MLE of Eq. (3) in the vicinity of V0 = 0 is V∗ =
◦ ◦ FT cn (g )Fcn (g ) Fcn (g ◦ ) T ◦ Fcn (g )Cind Fcn (g ◦ )
≈ Q2 Fcn (g ◦ )
(4)
Proof is straightforward. The right-hand simplification holds if Q 1 and the lattice R is sufficiently large to make ρξ,η ≈ 1 for all (ξ, η) ∈ N . A considerably more accurate approximation of the MLE is obtained if, instead of the IRF0 , the starting point V0 in Proposition 1 will relate to the IRFg◦ with the same marginal signal distribution, Fpix = [fpix (q|g ◦ ) : q ∈ Q]T , as the training image g ◦ . It is easy to show that the IRFg◦ is represented by the GPD similar to Eq. (1) but with the centred pixel-wise potential Q-vector
T 1 ◦ Vpix = Vpix (q) = ln fpix (q|g ◦ ) − Q ln f (κ|g ) : q ∈ Q : pix κ∈Q
−|R| T Pirf (g ◦ ) = exp |R|Vpix Fpix (g ◦ ) exp (V (q)) pix q∈Q
(5)
Proposition 2. Assume that s∈Q fξ,η (q, s|g) = s∈Q fξ,η (s, q|g) = fpix (q|g) for all (ξ, η) ∈ N and g ∈ G. Then the potential of the pairwise interaction V0 = Virf:ξ,η (q, s) =
1 2ρξ,η |N |
T (Vpix (q) + Vpix (s)) : (q, s) ∈ Q ; (ξ, η) ∈ N 2
reduces the MGRF of Eq. (1) to the IRFg◦ with the marginal signal probability distribution Fpix (g ◦ ). Proof. The assumption holds precisely for the marginal signal co-occurrence and marginal signal probability distributions and therefore it is asymptotically valid for the empirical distributions, too, providing the lattice R is sufficiently large to ignore deviations due to border effects. In this case the normalised exponent of the MGRF is as follows: ρξ,η Virf:ξ,η (q, s)fξ,η (q, s|g) = Vpix (q)fpix (q|g) (ξ,η)∈N
(q,s)∈Q2
q∈Q
982
G. Gimel’farb and D. Zhou
With V0 from Proposition 2, the log-likelihood gradient ∇V0 |g◦ = Δ(g ◦ ) where T Δ(g ◦ ) = ρξ,η (fξ,η (q, s|g ◦ ) − fpix (q|g ◦ )fpix (s|g ◦ )) : (q, s) ∈ Q2 ; (ξ, η) ∈ N is the nQ2 -vector of the scaled differences between the empirical cooccurrence probability for g ◦ and the cooccurrence probability for the IRFg◦ . The covariance matrix CF(g)|V0 is closely approximated by the diagonal matrix Cirf = Diag ρξ,η varq,s : (q, s) ∈ Q2 ; (ξ, η) ∈ N where varq,s is the variance of the latter probability: varq,s = fpix (q|g ◦ )fpix (s|g ◦ ) (1 − fpix (q|g ◦ )fpix (s|g ◦ )). Proposition 3. The first approximation of the potential MLE in the vicinity of the point V0 from Proposition 2 in the potential space is V∗ = V0 + λ∗ Δ(g ◦ ) with the maximising factor ρ2ξ,η Δ2ξ,η;q,s T ◦ ◦ Δ (g )Δ(g ) (ξ,η)∈N (q,s)∈Q2 ∗ λ = T ◦ = (6) Δ (g )Cirf Δ(g ◦ ) ρ3ξ,η varq,s Δ2ξ,η;q,s (ξ,η)∈N
(q,s)∈Q2
The approximate estimate of Proposition 3 coincides with the actual potential for all the IRFs. Therefore, it is considerably closer to the actual MLE for a generic MGRF than the approximation of Corollary 1.
3
Experimental Results and Conclusions
The potential of Proposition 3 results in a more informative model-based interac tion map (MBIM) [5] of energies eξ,η (g ◦ ) = ρξ,η (q,s)∈Q2 Vξ,η (q, s))Fξ,η (q, s|g ◦ ) over the neighbourhoods: MBIM(g ◦ ) = {eξ,η (g ◦ ) : (ξ, η) ∈ W} where W = {(ξ, η) : |ξ| ≤ ξmax ; |η| ≤ ηmax } is a set of all inter-pixel coordinate offsets such that the longest anticipated interaction is below (ξmax , ηmax ), i.e. N ⊂ W. The grey-coded MBIMs for several training stochastic and regular textures from [1, 10] are shown in Fig. 1 (the darker the pixel, the higher the energy). Visual pattern of a texture depends mostly on a small characteristic group of neighbours with higher energies. The MBIM with the potential of Proposition 3 discriminates more distinctly between the characteristic and non-characteristic energies so that simple energy thresholds, e.g. heuristic [5] or unimodal thresholding [11], produce a notably better characteristic neighbourhood N to derive texels and their placement rules for bunch sampling [16]. Also, local energy clusters in the (ξ, η)-plane of a MBIM allow us to easily discriminate between stochastic (aperiodic) and regular (periodic) textures. Figure 2 presents synthetic textures obtained by the improved bunch sampling from different regular and stochastic training images. Visual fidelity is preserved in most of these synthetic textures. However, the bunch sampling averages local geometric and signal deviations typical for weakly homogeneous textures such that their geometric structure varies across the image. As a result, it produces only their rectified (precisely periodic) versions and fails dramatically on textures
Accurate Identification of a Markov-Gibbs Model
983
Bark0009
D004
D012
D024
D029
D066
D105
Metal0005
a:|N | = 17
7
20
5
10
32
22
4
b:|N | = 20 D001
10 D006
22 D020
10 D034
13 D052
32 D076
45 D077
5 D101
a:|N | = 14
19
30
4
17
24
11
40
b:|N | = 27
52
56
24
35
34
20
62
Fig. 1. Stochastic (upper) and regular (bottom) Brodatz [1] and MIT VisTex [10] textures: training samples 128 × 128, their grey-coded MBIMs 85 × 85 (ξmax = ηmax = 42), and neighbourhoods N estimated by thresholding of their MBIMs for the potential estimates of Corollary 1 (a) and Proposition 3 (b). Note that the estimate (a) fails for D034 and separates much less neatly the higher energies in other textures.
with aperiodic irregular shapes and/or arbitrary placement of local elements when only the pairwise interaction is simply inadequate. Today’s most popular texture synthesis techniques [3, 4, 6, 7, 14] are based on non-parametric seamless replication of image patches. The bunch sampling resembles them in that signals retrieved from the training image are used to
984
G. Gimel’farb and D. Zhou
D006
D014
D020
D034
D052
D101
D102
Tile0007
D004
D024
Bark0009
Water0002
Flower0004
Metal0005
Grass0001
Fabric0016
Cans
Weave
Dots
Floor
Flowers
Flora
Knit
Design
Fig. 2. Bunch sampling synthesis of regular and stochastic textures from [1, 7, 10] (the training 128 × 128 and synthetic 360 × 360 images)
Accurate Identification of a Markov-Gibbs Model
Training image
Placement grid
Bunch sampling
Method in [4]
Method in [3]
Method in [14]
Method in [8]
Method in [6]
985
Fig. 3. Bunch sampling vs. other methods (the images are taken in part from [8])
form synthetic textures. But instead of a texture model, these methods rely on local neighbourhood matching to constrain the selection of the training signals and replicate the texture features. Therefore, these methods are unaware of the global structure of a texture and encounter problems in determining the proper size and shape of the local neighbourhood for capturing features at various scales. In this case, only an interaction with the user can provide an adequate neighbourhood that notably varies from one texture to another. If a texture is built sequentially, pixel by pixel, the accumulated errors eventually may destroy the desired pattern. Due to only local constraints, the non-parametric methods may outperform the bunch sampling in catching local variations of stochastic textures but are weak in reproducing regular textures. The basic idea of an alternative synthesis of regular textures in [9] is similar to the bunch sampling. But the periodicity of a training texture is recovered from translational symmetries of an autocorrelation function that describes statistics of pairwise signal products over the image. As a result, the estimated interaction structure is less definite than with more general statistics of pairwise signal co-occurrences. Another difference from the bunch sampling is that large image tiles are used in [9] as construction units for texture synthesis. Each tile is cut out from the training image and then placed into the synthetic texture in line with the estimated placement grid. As a result, the overlapping regions have to be blended at seams to avoid visual disruption. Figure 3 compares results of the synthesis of the colour texture “Mesh” using the bunch sampling, the non-parametric sampling in [3, 4, 6, 14], and the method proposed in [8]. It is worthy to note that only the bunch sampling discovers the periodicity of this texture; the placement grid found is also shown in Fig. 3. Therefore, this training image is perceived as a perfect regular texture only by the
986
G. Gimel’farb and D. Zhou
bunch sampling, while all other methods consider it as a weakly-homogeneous one due to intricate local neighbourhoods. Each natural texture mixes both global and local features, so that neither the model-based bunch sampling nor the non-parametric approaches can solve alone a rich variety of texture analysis and synthesis problems. Future solutions should combine global and local texture representations. At the same time, the derived accurate identification of a generic MGRF prior produces more adequate global descriptions of the training images and is simple enough to be of practical interest by itself.
References 1. Brodatz, P.: Textures: A Photographic Album for Artists and Designers. Dover, New York (1966) 2. Cross, G.R., Jain, A.K.: Markov random fields texture models. IEEE Trans Pattern Anal. Machine Intell. 5, 25–39 (1983) 3. Efros, A.A., Freeman, W.T.: Image quilting for texture synthesis and transfer. In: Proc. ACM Computer Graphics Conf. SIGGRAPH 2001, pp. 341–346. ACM Press, N.Y (2001) 4. Efros, A.A., Leung, T.K.: Texture synthesis by non-parametric sampling. In: Proc 7th Int Conf Computer Vision (ICCV 1999), pp. 1033–1038. IEEE CS Press, Los Alamitos (1999) 5. Gimel’farb, G.L.: Image Textures and Gibbs Random Fields. Kluwer Academic, Dordrecht (1999) 6. Kwatra, V., Schidl, A., Essa, I.A., Turk, G., Bobick, A.: Graphcut textures: Image and video synthesis using graph cuts. In: Proc. ACM Computer Graphics Conf. SIGGRAPH 2003, pp. 277–286. ACM Press, N.Y (2003) 7. Liang, L., Liu, C., Shum, H.Y.: Real-time texture synthesis by patch-based sampling. Technical Report MSR-TR-2001-40, Microsoft Research (2001) 8. Liu, Y., Lin, W.-C.: Deformable texture: the irregular-regular-irregular cycle. In: Proc. 3rd Int Workshop Texture Analysis and Synthesis (Texture 2003), pp. 65–70. Heriot-Watt Univ., Edinburgh (2003) 9. Liu, Y., Tsin, Y., Lin, W.: The promise and perils of near-regular texture. Int. J. Computer Vision 62, 145–159 (2005) 10. Picard, R., Graszyk, C., Mann, S., et al.: VisTex Database. MIT Media Lab, Cambridge, Mass., USA (1995) 11. Rosin, P.: Unimodal thresholding. Pattern Recognition 34, 2083–2096 (2001) 12. Roth, S., Black, M.J.: Fields of Experts: A framework for learning image priors. In: Proc. IEEE CS Conf. Computer Vision Pattern Recognition (CVPR 2005), pp. 860–867. IEEE CS Press, Los Alamitos (2005) 13. Srivastava, A., Liu, X., Grenander, U.: Universal analytical forms for modeling image probabilities. IEEE Trans. Pattern Anal. Machine Intell. 24, 1200–1214 (2002) 14. Wei, L., Levoy, M.: Fast texture synthesis using tree-structured vector quantization. In: Proc. ACM Computer Graphics Conf SIGGRAPH 2000, pp. 479–488. ACM Press / Addison Wesley Longman, New York (2000) 15. Winkler, G.: Image Analysis, Random Fields and Dynamic Monte Carlo Methods. Springer, Berlin (1995) 16. Zhou, D., Gimel’farb, G.: Bunch sampling for fast texture synthesis. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 124–131. Springer, Heidelberg (2003)
Texture Defect Detection Michal Haindl, Jiˇr´ı Grim, and Stanislav Mikeˇs Institute of Information Theory and Automation Academy of Sciences CR, 182 08 Prague, Czech Republic {haindl,grim,xaos}@utia.cas.cz
Abstract. This paper presents a fast multispectral texture defect detection method based on the underlying three-dimensional spatial probabilistic image model. The model first adaptively learns its parameters on the flawless texture part and subsequently checks for texture defects using the recursive prediction analysis. We provide colour textile defect detection results that indicate the advantages of the proposed method.
1
Introduction
Traditional manual inspection of material surfaces is labor- and cost-intensive and offers major bottleneck in the high-speed production lines [1]. Many defects are very difficult to detect manually; it is estimated [2] that a highly trained human operator can detect about 60% to 70% of leather material defects. The advantages of automated visual inspection are well known; repeatability, reliability and accuracy. Unfortunately very few practical automated inspection systems for automated inspection of textile surfaces are available mainly due to their computational costs [3]. Texture imperfections are either non-textured or different textured patches that locally disrupt the homogeneity of a texture image [4]. Quality is a topical issue [5] in manufacturing, designed to ensure that defective products are not allowed to reach the customer. Since in many areas, the quality of a surface is best characterized by its texture, texture analysis plays an important role in automatic visual inspection of surfaces. The major textile texture defects reported by [5] were, missing threads (causing dark lines on the image), gathered knots and oil stains (causing small dark regions on the image), gathered threads (causing dark curves on the image), and tiny holes on the fabrics. Due to the nature of weaving process, majority of the defects on the textile web occurs along two directions i.e. horizontal and vertical [1]. Defect detection in textured materials can be very subjective task, since defects can be very subtle and not well localized in space, which may lead to a small modification in the power spectrum. The conventional approach [4] is to compute a texture features in a local subwindow and to compare them with the reference values representing a perfect pattern. The method [5] preprocesses a gray level textile texture with histogram modification and median filtering. The image is subsequently thresholded using the 2D CAR model predictor and finally smoothed with another median filter run. Another approach for detection of gray level textured defects using linear W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 987–994, 2007. c Springer-Verlag Berlin Heidelberg 2007
988
M. Haindl, J. Grim, and S. Mikeˇs
FIR filters with optimized energy separation was proposed in [1]. Similarly the defect detection [2] is based on a set of optimized filters applied to wavelet subbands and tuned for a defect type. Method [3] uses translation invariant 2-D RISpline wavelets for textile surface inspection. The gray level texture is removed using the wavelet shrinkage approach and defects are subsequently detected by simple thresholding. Contrary to above approaches the presented method uses the multispectral information. The organization of this paper is as follows. First, a concise description of the underlying texture model as well as the model selection criterion is given. The third section summarizes the core part of the detection algorithm followed by the experimental results and conclusions sections.
2
Texture Representation
We assume that multispectral textured image can be represented by an adaptive 3D causal simultaneous autoregressive model [6]: Xr = As Xr−s + r , (1) s∈Irc
where r is a white Gaussian noise vector with zero mean, and a constant but unknown covariance matrix Σ and r, s are multiindices. The noise vector is uncorrelated with data from a causal neighbourhood Irc , ⎛ s1 ,s2 ⎞ 1 ,s2 a1,1 , . . . , as1,d ⎜ ⎟ .. . . .. As1 ,s2 = ⎝ ⎠ ., . , . s1 ,s2 s1 ,s2 ad,1 , . . . , ad,d are d×d parameter matrices where d is the number of spectral bands. r, r−1, . . . is a chosen direction of movement on the image lattice (e.g. scanning lines rightward and top to down). This model can be analytically estimated using numerically robust recursive statistics hence it is exceptionally well suited for possibly real-time texture defect detection applications. The model adaptivity is introduced using the standard exponential forgetting factor technique in parameter learning part of the algorithm. The model can be written in the matrix form Xr,i = γZr + r ,
(2)
where γ = [ A1 , . . . , Aη ] , η = card(Irc ) is a 1 × dη parameter matrix and Zr is a corresponding vector of Xr−s . To evaluate conditional mean values E{Xr |X (r−1) } the one-step-ahead prediction posterior density p(Xr | X (r−1)) is needed. If we assume the normal-gamma parameter prior for parameters in (1) (alternatively we can assume the Jeffrey’s parameter prior) this posterior density has the form of Student’s probability density 1
p(Xr |X (r−1) ) =
−1
2 Γ ( β(r)−dη+3 ) π − 2 λ(r−1) 2
−1 Γ ( β(r)−dη+2 ) (1 + ZrT Vzz(r−1) Zr ) 2 2 1
Texture Defect Detection
1+
(Xr − γˆr−1 Zr )T λ−1 ˆr−1 Zr ) (r−1) (Xr − γ
989
− β(r)−dη+3 2 ,
−1 1 + ZrT Vzz(r−1) Zr
(3)
with β(r) − dη + 2 degrees of freedom, where the following notation is used: β(r) = β(0) + r − 1 , −1 T γˆr−1 = Vzz(r−1) Vzx(r−1) , ˜ T Vxx(r−1) V˜zx(r−1) Vr−1 = +I , V˜zx(r−1) V˜zz(r−1) V˜uw(r−1) =
r−1
(4) (5) (6)
Uk WkT ,
(7)
k=1 −1 T λ(r) = Vx(r) − Vzx(r) Vz(r) Vzx(r) .
(8)
where β(0) > 1 and U, W denote either X or Z vector, respectively. If β(r − 1) > η then the conditional mean value is E{Xr |X (r−1) } = γˆr−1 Zr
(9)
and it can be efficiently computed using the following recursion T γˆrT = γˆr−1 +
−1 Vz(r−1) Zr (Xr − γˆr−1 Zr )T −1 1 + ZrT Vz(r−1) Zr
.
The selection of an appropriate model support (Irc ) is important to obtain good restoration results. If the contextual neighbourhood is too small it can not capture all details of the random field. Inclusion of the unnecessary neighbours on the other hand adds to the computational burden and can potentially degrade the performance of the model as an additional source of noise. The optimal Bayesian decision rule for minimizing the average probability of decision error chooses the maximum posterior probability model, i.e., a model Mi corresponding to maxj {p(Mj |X (r−1) )} . If we assume uniform prior for all tested support sets (models) the solution for the optimal model support (Irc ) can be found c analytically. The most probable model given past data is the model Mi (Ir,i ) for which i = arg maxj {Dj } . 1 α(r) ln |Vz(r−1) | − ln |λ(r−1) | 2 2
dη α(r) β(0) − dη + 2 + ln π ln Γ ( ) − ln Γ ( ) , 2 2 2
Dj = −
(10)
where α(r) = β(r) − dη + 2.
3
Defect Detection
Single multispectral pixels are classified as belonging to the defective area based on their corresponding prediction errors. If the prediction error is larger than the adaptive threshold
990
M. Haindl, J. Grim, and S. Mikeˇs mask 1
mask 2
Fig. 1. Defect masks used in this study for experimental texture mosaics
2.7 |E{Xr−i | X (r−i−1),(s−i−1) } − Xr−i | (11) l i=1 l
|E{Xr | X (r−1),(s−1) } − Xr | >
then the pixel r is classified as a detected defect pixel. l in (11) is a process history length of the adaptive threshold and the constant 2.7 was found experimentaly. The one-step-ahead predictor E{Xr | X (r−1),(s−1) } = γˆs−1 Xr
(12)
differs from the corresponding predictor (9) in using parameters γˆs−1 which were learned only in the flawless texture area (s < r). The whole algorithm is extremely fast because the adaptive threshold is updated recursively: l 2.7 |r+1 | > |r−l + |r | − |r−l | , l i=1 where r is the prediction error r = E{Xr | X (r−1),(s−1)} − Xr and γˆs−1 is the parametric matrix which is not changing. Hence the algorithm can be easily applied in real time surface quality control.
4
Experimental Results
The presented method was tested on the set of artificially damaged 512 × 512 colour textile textures, so the ground truth (Fig.1) for every pixel is well known and cannot be influenced by a subjective evaluation. All tested images are colour (d = 3) however is obvious that the method allows any number of spectral bands. The performance of the algorithm is tested using the usual recall (r), precision (p), and the type II (II) error criteria. Let us denote the number of defect pixels nd , number of pixels interpreted as defect pixels ni and the number of correctly interpreted defect pixels nc . The performance criteria are then as follows: r=
nc , nd
p=
nc , ni
II =
ni − n c . n − nd
Texture Defect Detection mosaic
detected defect
991
prediction error map
Fig. 2. Defect detection on texture mosaics (a-c) using the defect mask Fig.1-left skin
detected disease
prediction error map
Fig. 3. Monitoring of the pemphigus vulgaris skin disease progress
Recall estimates the probability that the reference pixels will be correctly assigned, precision is the defect accuracy estimate relative to the error due to wrong assignment and the type II error estimates the probability of the commission error. All these criteria have range 0; 1. Single defects were simulated by replacing irregular parts of textile textures with different but as similar as
992
M. Haindl, J. Grim, and S. Mikeˇs mosaic a
detected defect
mosaic b
detected defect
mosaic c
detected defect
mosaic d
detected defect
mosaic e
detected defect
prediction error map
Fig. 4. Defect detection on texture mosaics (a-e) using the defect mask Fig.1-right Table 1. Performance criteria mosaic (row) Fig.2 - a Fig.2 - b Fig.2 - c Fig.4 - a Fig.4 - b Fig.4 - c Fig.4 - d Fig.4 - e Fig.5 - a Fig.5 - b
recall r 0.22 1.00 0.11 1.00 0.93 0.92 0.34 0.93 0.06 0.13
precision p 0.01 0.70 0.62 0.70 0.10 0.19 0.11 0.71 0.00 0.00
II 0.09 0.00 0.00 0.00 0.02 0.01 0.01 0.00 0.45 0.21
possible textile texture. All textures are from our large (more than 1000 high resolution colour textures categorized into 10 thematic classes) colour texture database. All results presented are without any postprocessing such as isolated defect pixels filtering to demonstrate basic performance of the presented method.
Texture Defect Detection
993
Table 2. Time performance on the HP 9000/785 Unix machine η 3 8
mosaic
learning [s] 1 12
detected defect
detection [s] 7 15
prediction error map
Fig. 5. Failures on highly structured textures
Figs.2,4 exhibit correct defect detection which is also clearly visible on the corresponding prediction maps. Tab.1 presents robust performance with high recall values even for hardly visible defects (Fig.4-b,e). Even for lower recall values (Fig.2-a,c, Fig.4-d) the defect is clearly outlined. Both precision and type II criteria are expectedly low respectively high in failure examples (Fig.5). Finally, the method was successfully evaluated on skin disease treatment progress monitoring application. Fig. 3 illustrates a patient with pemphigus vulgaris skin disease and its automatically detected regions which are subsequently compared with previous checking to monitor a disease treatment efficiency. Defects were detected using simple models with causal neighbourhoods containing either 3 or 8 sites (η ∈ {3, 8}) (Tab.2) and adaptive learning on uncorrupted quarter of every texture mosaic. Processing time in Tab.2 is for unoptimized code and can be easily further decreased. Fig.5 indicates type of highly structured textures which are out of the means of the presented simple probabilistic model. Although even on these examples the defect was correctly detected, the method simultaneously detects also large textile design patterns which cannot be distinguished from the defect. A possible solution is to filter out these design artefacts using prior information such as regularity, size or spectral content.
994
5
M. Haindl, J. Grim, and S. Mikeˇs
Conclusions
Most published texture defect detection methods does not use the multispectral information. Our method takes advantage of both multispectral as well as the spatial information. The method is simple, extremely fast and robust in comparison with these alternative methods. The presented method results are encouraging, all simulated defects on fine granularity textile textures were correctly localized as well as sick skin patches in real dermatology application. The method will fail on highly structured textures due to limited low frequencies modelling power of the underlying probabilistic model. The presented method can be easily generalized for gradually changing (e.g. illumination, colour, etc.) texture defect detection by exploiting its adaptive learning capabilities.
Acknowledgments This research was supported by the EC project FP6-507752 MUSCLE, grants A2075302, 1ET400750407 of the Grant Agency of the Academy of Sciences CR ˇ and partially by the grants MSMT 1M0572,2C06019.
References 1. Kumar, S., Hebert, M.: Discriminative random fields: A discriminative framework for contextual interaction in classification. In: iccv03, Nice, France, pp. 1150–1157 (2003) 2. Sobral, J.L.: Optimised filters for texture defect detection. In: International Conference on Image Processing. vol. III, 565–568 ( 2005) 3. Fujiwara, H., Zhang, Z., Toda, H., Kawabata, H.: Textile surface inspection by using translation invariant wavelet transform. In: IEEE Int. Symp. on Computational Intelligence in Robotics and Automation, vol. 3, pp. 1427–1432. IEEE, NJ, New York (2003) 4. Chetverikov, D.: Detecting defects in texture. In: International Conference on Pattern Recognition, pp. 61–63 (1988) 5. Meylani, R., Ertuzun, A., Ercil, A.: A comparative study on the adaptive lattice filter structures in the context of texture defect detection. In: ICECS ’96, vol. II, pp. 976–979 (1996) ˇ 6. Haindl, M., Simberov´ a, S.: A Multispectral Image Line Reconstruction Method. In: Theory & Applications of Image Analysis, pp. 306–315. World Scientific Publishing Co., Singapore (1992) 7. Kittler, J.V., Marik, R., Mirmehdi, M., Petrou, M., Song, J.: Detection of defects in colour texture surfaces. In: IAPR Workshop on Machine Vision Applications, pp. 558–567 (1994) 8. Mirmehdi, M., Marik, R., Petrou, M., Kittler, J.V.: Iterative morphology for fault detection in stochastic textures. Electronics Letters 32, 443–444 (1996) 9. Mitra, P., Murthy, C., Pal, S.: Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 301–312 (2002)
Extracting Salient Points and Parts of Shapes Using Modified kd-Trees Christian Bauckhage Deutsche Telekom Laboratories 10587 Berlin, Germany http://www.telekom.de/laboratories
Abstract. This paper explores the use of tree-based data structures in shape analysis. We consider a structure which combines several properties of traditional tree models and obtain an efficiently compressed yet faithful representation of shapes. Constructed in a top-down fashion, the resulting trees are unbalanced but resolution adaptive. While the interior of a shape is represented by just a few nodes, the structure automatically accounts for more details at wiggly parts of a shape’s boundary. Since its construction only involves simple operations, the structure provides an easy access to salient features such as concave cusps or maxima of curvature. Moreover, tree serialization leads to a representation of shapes by means of sequences of salient points. Experiments with a standard shape database reveal that correspondingly trained HMMs allow for robust classification. Finally, using spectral clustering, tree-based models also enable the extraction of larger, semantically meaningful, salient parts of shapes.
1 Motivation and Background Shapes and silhouettes provide significant cues for the human visual system. As their impact on human perception is well documented in a long line of psychological and neurophysiological studies (cf. e.g. [1,2,3,4]), it comes with little surprise that shape processing has been extensively investigated in computer vision, too [5,6]. Although an exhaustive review of related approaches is beyond the scope of this paper, it is worthwhile pointing out an intriguing phenomenon: While established shape representation techniques such as moments, skeletons, or shock graphs [7,8,9] consider rather intuitive planar geometry, recent contributions argue for more sophisticated and more complex mathematical models [10,11]. Such rigorous approaches based on conformal mappings or Riemannian manifold embeddings are certainly appealing for they might lead to a more general understanding of shape perception. However, with respect to real-time constraints which are important for most practical applications of computer vision, they yet seem too involved. What is needed in practice are methods that allow for fast computation and useful characterization alike. The work presented here originated from an earlier project on gait analysis where real-time shape processing was pivotal [12]. We proposed a representation especially tailored to the needs in classification: Iterative fitting of 2D lattices copes with shapes of different topologies, is storage and time efficient, and yields consistent vector space embeddings of two-dimensional objects [13]. In [14], we demonstrated that the lattice W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 995–1002, 2007. c Springer-Verlag Berlin Heidelberg 2007
996
C. Bauckhage
fitting mechanism can be interpreted as the growing of a widely branching tree of depth two. This led to an improved approach which combines properties of kd-trees [15,16], R-trees [17,18] and regression tress [19,20]. In this paper, we refer to these trees as modified kd-trees and summarize their construction in section 2. Experiments reported in [14] underlined that modified kd-trees provide efficient shape compression: even for compression rates of more than 90%, the average reconstruction error never exceed 5%. Other experiments in [14] considered shape classification. Since information contained in the leaf nodes may be naturally interpreted as a signature, i.e. an unordered set of weighted vectors, it allows for applying the Earth Mover’s Distance [21]. However, since this approach requires several prototypes per class and treats classification as a flow allocation problem over bipartite graphs, it is of limited real-time capability. In this paper, we take a different point of view. Instead of translating leaf node information into signatures, we apply tree-serialization and represent shapes by means of sequences of feature vectors. Consequently, efficient methods from statistical learning become applicable. In particular, in section 3, we will consider the use of Hidden Markov Models and show that they perform as well as the approach in [14] albeit much faster. Moreover, compared to features extracted from usual kd-trees, features resulting from modified kd-trees yield better recognition accuracy and simultaneously decrease computation time. This is a consequence of their property of allocating more resources to the representation of salient parts of shape boundaries. Finally, in section 4, we briefly sketch how applying spectral clustering to the data stored in a modified kd-tree allows for efficiently extracting salient subparts of shapes.
2 Modified kd-Trees for Shape Representation Towards the problem of shape representation and classification, we consider a data structure that combines properties of kd-trees, R-trees, and regression tress. Being a method for organizing data in higher dimensional spaces [15], kd-trees facilitate tasks such as nearest neighbor searches [16]. Using splitting planes that are perpendicular to one of the coordinate axes, they partition the space in a top-down manner. Originally a tool in computer aided design [17], R-trees have also become popular for indexing high dimensional data [18]. Given a clustered sample, an R-tree recursively combines the data into multidimensional minimum bounding rectangles in a bottom-up fashion. Regression trees [19,20] are used for multidimensional function approximation. They provide piecewise constant approximations by recursively partitioning a set of samples. Given binary images such as shown throughout this paper, we assume a shape S to be a set of L pixels, S = {pk ∈ R2 | k = 1, . . . , L} and defne |S| = L. The minimum bounding rectangle or bounding box of S is denoted by B(S); its center is μ and its width and height are called w and h, respectively. With these definitions, a binary tree containing a compact representation of S results from the procedure in Fig. 1. From the figure we see that, if |S| is less than a given threshold or if it equals the area w · h of its bounding box, the current tree node is a leaf. Otherwise, B(S) will be split at a point c. Note that at this point of the procedure different approaches may apply [16]. For the results reported below, we proceeded as follows: if the box’s width
Extracting Salient Points and Parts of Shapes
997
BuildTree(S, l) compute the bounding box B(S) and its corresponding μ, w, and h create a new node N N.data = (μ, w, h) if |S| ≤ or |S| = w · h or l = lmax N.leftchild = ∅ N.rightchild = ∅ else determine split dimension d ∈ {x, y} determine split point c determine sets S1 = {p ∈ S | pd < cd } and S2 = {p ∈ S | pd ≥ cd } N.leftchild = BuildTree(S1 , l + 1) N.rightchild = BuildTree(S2 , l + 1)
Fig. 1. Recursive procedure to compute a tree-based representation of a shape S
... (a) l = 0
(b) l = 1
(c) l = 2
(d) l = 3
(e) l = 4
(f) l = 8
Fig. 2. Recursive growth of a modified kd-tree; with an increasing depth l, the collection of bounding boxes stored in the tree better approximates the shape. Also, ceasing to sprout further subtrees once a homogenous region has been reached, provides an efficient representation.
exceeds its height, the pixel set S is split into a left and a right part; if the hight exceeds the width, S is split into an upper and a lower part. In both cases, the median of the pixel distributions along the splitting direction serves as the split point and subtrees are computed for the resulting subsets. This process recurses until no further split is possible or a desired depth level is reached. Given the repeated computation of bounding boxes, the algorithm’s relation to Rtrees is obvious. Its ties to regression are due to the condition that determines leaf nodes. If |S| = wh, S completely fills its bounding box and the splitting process has produced a piecewise constant region. Finally, as for usual kd-trees, splits are performed along the coordinate axes. Contrary to the unmodified case, however, the termination condition in the modified procedure causes storage efficient albeit unbalanced trees to result. Figures 2 and 3 exemplify this effect. While the former shows the the growth of a modified kd-tree, the latter shows the approximation of a shape by means of a usual kd-tree. The modified version stops sprouting subtrees once the splitting process has reached a homogenous, most likely interior region of the shape. The usual procedure, however, keeps splitting sets of pixels as long as they contain more than one element. Consequently, shape representation based on usual kd-trees will inevitably produce 2lmax bounding boxes and cannot achieve compression rates as reported in [14].
998
C. Bauckhage
... (a) l = 0
(b) l = 1
(c) l = 2
(d) l = 3
(e) l = 4
(f) l = 8
Fig. 3. Recursive growth of a usual kd-tree; the procedure does not stop once a homogenous region has been reached but continues and thus yields a balanced but less efficient representation
... (a) l = 4
(b) l = 5
(c) l = 6
(d) l = 7
(e) l = 8
(f) l = 11
... (g) l = 4
(h) l = 5
(i) l = 6
(j) l = 7
(k) l = 8
(l) l = 11
Fig. 4. Center points of bounding boxes in shape approximating trees of depth l = 4, 5, . . . , 11. (a) – (f) points obtained from a usual kd-tree; (g) – (l) points obtained from a modified version.
Figs. 4 and 5 illustrate another consequence of the different splitting behaviors. Both figures plot the center points {μi }i=1,...,N of bounding boxes assigned to the leafs on the deepest level lmax of a tree, where lmax varies from 4 to 11. For common kd-trees, the number of leafs grows exponentially and all leafs always reside on the deepest level. Accordingly, for an increasing value of lmax , the set of points {μi }i=1,...,N almost densely covers the underlying shape (see Figs. 4(f) and 5(f)). Concerning modified kd-trees, however, the number N of leafs found on the deepest level first increases but then decreases again (see Figs. 4(k)–(l) and 5(k)–(l)). In our experiments, we observed N to peak for lmax ≈ 8. Moreover, since the creation of subtrees stops if a bounding box covers a homogenous region, bounding boxes found on deeper levels tend to be located at the shape boundary. This is simply because, compared to the interior, it generally requires many more splits until homogeneous rectangular regions are found at the boundary. What is even more, if one keeps increasing lmax , bounding boxes attached to leaf nodes on deeper levels automatically concentrate at the least homogeneous parts of the boundary which correspond to concave cusps or points of high curvature. This is an intriguing property of modified kd-trees, since points of concavity or high curvature play a crucial role in how humans perceive shapes [3]. In a series of experiments, we therefore investigated whether salient features automatically extracted from modified kd-trees can offer benefits for automatic shape recognition as well. The results reported next suggest that this is indeed the case.
Extracting Salient Points and Parts of Shapes
999
... (a) l = 4
(b) l = 5
(c) l = 6
(d) l = 7
(e) l = 8
(f) l = 11 ...
(g) l = 4
(h) l = 5
(i) l = 6
(j) l = 7
(k) l = 8
(l) l = 11
Fig. 5. Another example of the effect demonstrated in Fig. 4
3 HMM-Based Shape Recognition Recognition In this section, we describe the use of Hidden Markov Models for shape recognition. This idea is motivated by the fact that data stored in a tree can be consistently serialized and thus transformed into a sequence of feature vectors. Since modified kd-trees provide very good approximations of shapes, sequences thus obtained should distinctly characterize shapes. In our experiments, we serialized the trees considered in this paper using pre-ordered traversal. Collecting the centroids of the bounding boxes stored in the leafs on the deepest level provided us with sequences of 2d vectors {μi }i=1,...,N . These were transformed into a canonical, scale invariant representation using the transformation i (μx − vx )/w ˜i = μ (1) (μiy − vy )/h
avg. rec. rate
where v indicates the location of the bottom left corner of the initial bounding box of a shape S and w and h denote its width and height, respectively. We considered a standard testbed for shape matching introduced by Kimia et al.[8] and focused on six classes of animal shapes. For each class, a third of the shapes was
modified kD tree classical kD tree
0.9 0.8 0.7 0.6 3
4
5
6 7 8 depth of tree
9
10
11
Fig. 6. Recognition rates in shape classification based on features extracted from tree models
1000
C. Bauckhage
Table 1. Confusion matrix resulting from using a features in modified kd-tree of depth 11 bird cat dog elephant horse rabbit bird cat dog elephant horse rabbit
1.00 0.00 0.05 0.00 0.00 0.00
0.00 0.94 0.00 0.00 0.00 0.00
0.00 0.00 0.72 0.00 0.33 0.00
0.00 0.06 0.00 1.00 0.00 0.00
0.00 0.00 0.23 0.00 0.67 0.00
0.00 0.00 0.00 0.00 0.00 1.00
used for training a corresponding HMM. In order to provide sufficient training data, we increased each training set by applying modest affine transformation to the shapes, leading to a total of 600 training samples. For testing we considered two thirds of the shapes which, after correspondingly increasing the set, led to a total of 850 test samples. In our implementation we made use of Kevin Murphy’s HMM toolbox1. For the results reported here, we considered HMMs of Q = 8 states with a mixture of 2 Guassians per state. Since the mixtures were learned using randomly seeded expectation maximization, we report recognition rates averaged over 5 trials. Figure 6 plots recognition rates with respect to the maximum depth level of the trees used for shape representation. Generally, for increasing values of the maximum depth lmax , the features extracted from modified kd-trees yield higher accuracies than the ones obtained from usual trees. This is noteworthy since usual kd-trees provide many more feature points, however, most of which do cover the shapes’ interior (see again Fig. 4 and 5). According to the figure, the considerably smaller sets of points resulting from modified kd-trees seem to contain much less redundant information. Also, for the features extracted from modified kd-trees we notice two peaks in the recognition rates achieved for lmax = 8 and lmax = 11. This is interesting, for these depth levels provide a rather dense sampling of the shapes boundary or a set of highly salient points, respectively (see again Fig. 4 and 5). In the light of recognition rates of a 100% reported in recent contributions on HMM-based shape classification [22], we feel urged to point out that, in the experiments reported here, we dealt with shapes of high inter-class similarity. This is corroborated in Tab. 1 which shows the confusion matrix obtained from classification based on modified kd-trees of maximum depth lmax = 11. From the table it appears that visually distinct shapes such as those of birds, rabbits, and elephants are recognized with an accuracy of 100%. Visually more ambiguous shapes such as those of cats, dogs, and horses are more often confused, thus decreasing the overall performance to about 80%. This, however, accords with other recent results on HMM-based shape recognition, which also considered ambiguous silhouettes [23]. Finally, note that an overall recognition rate of 80% corresponds to the results obtained from the more involved approach in [14]. However, whereas classification based on the Earth Mover’s Distance is of complexity O(L3 ), where L denotes the sequence length, using a HMM for sequence classification requires only O(LQ2 ), where Q counts the number of states and Q L. Compared to [14], HMM-based shape recognition thus gains two orders of magnitude in terms of processing time. 1
http://www.cs.ubc.ca/ murphyk/Software/HMM/hmm.html
Extracting Salient Points and Parts of Shapes
1001
4 Summary and Outlook This paper investigated the use of modified kd-trees for shape encoding and classification. Trees obtained from recursive, axis perpendicular bounding box splitting provide a representation that is resolution adaptive and thus storage efficient. Since the structure automatically accounts for geometric details at a shape’s boundary, it provides information on salient features such as concave cusps or maxima of curvature. In addition, since the derivation of these trees only relies on basic computational geometry and does not require complex optimization steps, the approach is also time efficient. Given a tree representation of a shape, tree traversal leads to shape representations by means of sequences of salient points. Experimental results obtained on a standard shape database indicate that correspondingly trained HMMs enable robust classification, even if there are considerable inter class similarities. Currently, we explore a different use of the information contained in the trees considered in this paper. First experiences demonstrate that, in addition to shape classification, shape partitionings contained in a tree representation allow for clustering shapes into semantically meaningful, salient subparts. Figure 7 shows an example of such clusters which were determined automatically. The bounding boxes in this example resulted from constructing a modified kd-tree of depth lmax = 11. From the matrix of geometric adjacencies of the boxes, we computed a similarity matrix where the similarity of two boxes was a function of the shortest path between them. We then applied spectral clustering using the algorithm by Ng et al. [24]. The resulting clusters apparently correspond to the cat’s body parts. Again, producing this result did not require involved geometric reasoning or optimization. We attribute the “salient part finding” ability of this simple approach to the fact that modified kd-trees emphasize the representation of salient parts of shape boundaries. Therefore, relations among corresponding bounding boxes abound in he similarity matrix and influence its eigenvector decomposition. Similar to human perception [3], concave cusps and points of high curvature thus guide the segmentation into parts. In future work, we will exploit this in devising part-based recognition algorithms.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 7. Clusters resulting from spectral clustering of the bounding box adjacency matrix
References 1. Wertheimer, M.: Untersuchungen zur Lehre von der Gestalt I. Psychologische Forschung 1, 47–58 (1922) 2. Zusne, L.: Visual Perception of Form. Academic Press, New York (1970)
1002
C. Bauckhage
3. Hoffman, D.D.: Visual Intelligence. W.W. Norton, New York (1998) 4. Tarr, M.J., B¨ulthoff, H.H.: Object Recognition in Man, Monkey, and Machine. MIT Press, Cambridge (1999) 5. Loncaric, S.: A survey of shape analysis techniques. Pattern Recogn. 31(8), 983–1001 (1998) 6. Costa, L., Cesar, R.M.: Shape Analysis and Classification. CRC Press, Boca Raton (2000) 7. Jain, A.K., Vailaya, A.: Shape-based retrieval: A case study with trademark image databases. Pattern Recogn. 31(9), 1369–1390 (1998) 8. Sharvit, D., Chan, J., Tek, H., Kimia, B.B.: Symmetry-based indexing of image databases. J. Vis. Commun. Image R. 9(4), 366–380 (1998) 9. Siddiqi, K., Shokoufandeh, A., Dickinson, S., Zucker, S.: Shock graphs and shape matching. Int. J. Comput. Vision 35(1), 13–32 (1999) 10. Klassen, E., Srivastava, A., Mio, W., Joshi, S.H.: Analysis of planar shapes using geodesic paths on shape space. IEEE Trans. Pattern Anal. Machine Intell. 26(3), 372–383 (2004) 11. Sharon, E., Mumford, D.: 2d-shape analysis using conformal mapping. In: Proc. CVPR. vol. II, pp. 350–357 (2004) 12. Bauckhage, C., Tsotsos, J., Bunn, F.: Automatic detection of abnormal gait. Image Vision Comput. (to appear 2007) 13. Bauckhage, C., Tsotsos, J.: Bounding box splitting for robust shape classification. In: Proc. ICIP. vol. 2, pp. 478–481 (2005) 14. Bauckhage, C.: Tree-based signatures for shape classification. In: Proc. ICIP, pp. 2105–2108 (2006) 15. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Comm. of the ACM 18(9), 509–517 (1975) 16. Kubica, J., Masiero, J., Moore, A., Jedicke, R., Connolly, A.: Variable kd-tree algorithms for spatial pattern search and discovery. In: Proc. NIPS, pp. 691–698 (2005) 17. Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 47–57 (1984) 18. Manolopoulos, Y., Nanopoulos, A., Papadopoulos, A.N., Theodoridis, Y.: R-Trees: Theory and Applications. Springer, Heidelberg (2005) 19. Morgan, J., Sonquist, J.: Problems in the analysis of survey data and a proposal. J. Am. Stat. Assoc. 58, 415–434 (1963) 20. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth (1984) 21. Rubner, Y., Tomasi, C., Guibas, L.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40(2), 99–121 (2000) 22. Bicego, M., Murino, V.: Investigating hidden markov models’ capabilities in 2d shape classification. IEEE Trans. Pattern Anal. Machine Intell. 26(2), 281–286 (2004) 23. Thakoor, N., Gao, J.: Shape classifier based on generalized probabilistic descent method with hidden markov descriptor. In: Proc. ICCV. vol. 1, pp. 495–502 (2005) 24. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Proc. NIPS, pp. 849–856 (2001)
Author Index
Abbott, Derek 878 Acevedo, Daniel 895 Ahmad, Munir 270 Altamirano, Leopoldo 117 ´ Alvarez–Franco, Fernando J. 694 Anderson, Mats 229 Ar, Ilktan 555 Arribas, Juan Ignacio 189, 768 Athanasiadis, Emmanouil 864 Atkinson, Gary A. 466 Attia, Mohamed 522 Azpiroz, Fernando 293 Baba, Takayuki 954 Bach, Meritxell 229 Bala, Piotr 245 Batenburg, K. Joost 563 Bauckhage, Christian 995 Beck, Cornelia 53 Ben-Shahar, Ohad 13 Benavent, Xaro 895 B´er´eziat, Dominique 482 Bezerianos, Anastasios 197 Bican, Jakub 751 Bigun, Josef 360 Bock, R¨ udiger 165 Boluda, Jose A. 77 Borgefors, Gunilla 253 Bougioukos, Panagiotis 197, 221 Bultelle, Matthieu 229 C´ ardenes, Rub´en 229 Cashman, Peter 229 Castel´ an, Mario 399 Castellanos, Germ´ an 840 Cavouras, Dionisis 197, 221, 864 Chi, Ying 229 Chlebiej, Michal 245 Compa˜ n, Patricia 141 Constantinopoulos, Constantinos 596 Costaridou, Lena 237 Danev, Boris 809 Danı¸sman, Kenan 366 Daskalakis, Antonis 197, 221
de Luis, Rodrigo 229 De Neve, Wesley 848 De Schrijver, Davy 848 De Silva, P. Ravindra S. 326 de Ves, Esther 895 De Wolf, Koen 848 De Zutter, Saar 848 Debled-Rennesson, Isabelle 474 Dehzangi, Abdollah 970 Dehzangi, Omid 970 Deinzer, Frank 759 Delmas, Patrice 776 Dickens, Matthew P. 342 Ding, Feng 205 Ditrich, Frank 490 Drakopoulos, Vassileios 645 Droege, Detlev 817 Dubuisson, S´everine 482 Eastwood, Brian 125 Ebner, Marc 441 El-Mahallawy, Mohamed Emms, David 823 Entacher, Karl 36 Entrena, Luis 391 Erat, Murat 366 Erg¨ un, Salih 366
522
Faas, Frank G.A. 718 Faraj, Maycel Isaac 360 Feldmann, Tobias 759 Ferguson, Bradley 878 Fern´ andez-Nofrer´ıas, Eduard 285 Fleischer, Stefan 921 Florea, Laura 278 Flusser, Jan 450, 751, 856 Franke Stenport, Victoria 253 Fritts, Jason 587 Gallardo–Caballero, Ram´ on 694 Gallego, Antonio Javier 141 Garcia, Esteban 117 Garcia, Miguel Angel 912 Garc´ıa–Orellana, Carlos J. 694 Gedda, Magnus 173
1004
Author Index
Georgiadis, Pantelis 197 Ghita, Ovidiu 374 Giannarou, Stamatia 710 Gil, Debora 213 Gimel’farb, Georgy 776, 979 G´ omez, Francisco 181 Gomez Agis, Jacobo 903 Gonz` alez, Jordi 45 Gonzalez-Diaz, Rocio 506 Gonz´ alez–Velasco, Horacio M. Gottbehuet, Thomas 53 Grazzini, Jacopo 636, 742 Grest, Daniel 101 Grim, Jiˇr´ı 987 Guan, De-Jun 539
Jiang, Zhuhan 69 Jim´enez, Mar´ıa Jos´e Johansson, Carina B.
694
Haindl, Michal 987 Hajder, Levente 109, 498 Hanbury, Allan 424 Hancock, Edwin R. 342, 399, 466, 823 Hanseg˚ ard, Jøger 157 Haxhimusa, Yll 653 He, Joshua Congfu 962 Heiss-Czedik, Dorothea 514 Hern` andez, Aura 213 Hern´ andez, Jorge 416 Heynderickx, Ingrid 334 Higashi, Masatake 326 Hlav´ aˇc, V´ aclav 93, 929 Ho, Wai Han 351 Hong, Xin 604 Hornegger, Joachim 165 Horv´ ath, Peter 702 Howe, Tet Sen 205, 962 Hsu, Wen-Hsing 937 Huang, Hu-Jie 539 Huang, Hui-Yu 937 Huang, Zhuan Qing 69 Huben´ y, Jan 309 Huber-M¨ ork, Reinhold 514 Hugueny, Samuel 317 Igual, Laura 293 Imiya, Atsushi 61 Ion, Adrian 653 Iregui, Marcela 181 Jafari Moghadam, Pooria Jermyn, Ian H. 702 Jiang, Xiaoyi 921
871
506 253
Kagadis, George 221 Kalatzis, Ioannis 197, 221, 864 Kalogeropoulou, Christina 237 Kameda, Yusuke 61 Kamei, Toshio 809 Kampel, Martin 547 Kanak, Alper 366, 383 Karsligil, M. Elif 555 Kaˇs´ık, Marek 309 Kasparek, Tomas 301 Kautsky, Jaroslav 856 Kayaoglu, Mehmet 366 Kaz´ o, Csaba 498 Kerdels, Jochen 612 Kittler, Josef 929 Klette, Reinhard 661, 726 Kober, Vitaly 903 Korfiatis, Panayiotis 237 Kostopoulos, Spiros 197, 221 Kˇr´ıˇzek, Pavel 929 Kropatsch, Walter G. 653 Krˇsek, Pˇremysl 261 Krueger, Volker 101 Kubias, Alexander 759 Kumar, Dinesh Kant 832 Lambacher, Stephen G. 326 Lambert, Peter 28, 848 Lef`evre, S´ebastien 579 Lemaˆıtre, C´edric 686 Lenz, Christian 36 Leow, Wee Kheng 205, 962 Letscher, David 587 Li, Fajie 661, 726 Li, Gang 13 Likas, Aristidis 596 Lindblad, Joakim 253 Lindoso, Almudena 391 Liu, Hantao 334 Liu, Rujie 954 Liu-Jimenez, Judith 391 Mac´ıas–Mac´ıas, Miguel 694 Maggini, Marco 408 Manousopoulos, Polychronis 645 Marcotegui, Beatriz 424
Author Index Markowska-Kaczmar, Urszula Marras, Ioannis 229 Martens, Ga¨etan 28 Mart´ınez, Fernando 840 Maˇska, Martin 309 Masumoto, Daiki 954 Matas, Jiˇr´ı 686 Mauri, Josepa 285 Mayer, Konrad 514 Mazurek, Przemyslaw 149 McClean, Sally 604 Medrano, Belen 506 Meier, J¨ org 165 Melacci, Stefano 408 Melendez, Jaime 912 Michelson, Georg 165 Mikeˇs, Stanislav 987 Mirzaei, Abdolreza 734, 784 Miteran, Johel 686 Mohr, Daniel 432 Molina, Rafael 141 Moradi, Mohamad H. 871 Morris, John 776 Morrow, Philip 604
531
Nadarajan, Gayathri 133 Nagata, Shigemi 954 Neumann, Heiko 53 Ng, Brian W.-H. 878 Nguyen, Thanh Phuong 474 Nhung, Ngo Phuong 945 Nikiforidis, George 197, 221, 864 Nowi´ nski, Krzysztof 245 Ny´ ul, L´ aszl´ o G. 165 Orderud, Fredrik 157 Orozco, Javier 45 Ortiz, Francisco 458 Osano, Minetada 326 Pal´ agyi, K´ alm´ an 628 Pardo, Fernando 77 Paulus, Dietrich 759, 817 Peltier, Samuel 653 Penz, Harald 514 ´ Pernek, Akos 498 Peters, Gabriele 612 Phuong, Tu Minh 945 Pizlo, Zygmunt 1 Poppe, Chris 28
Prieto, Flavio 416 Puig, Domenec 912 Raamana, Pradeep Reddy 101 Rabben, Stein I. 157 Radeva, Petia 213, 285, 293 Rahmati, Mohammad 734 Ramirez, Berenice 801 Ravazoula, Panagiota 221 Real, Pedro 506 Rekik, Wafa 482 Renouf, Arnaud 133 Rink, Karsten 571 Roca, F. Xavier 45 Rodriguez-Leor, Oriol 213 Romero, Eduardo 181 Rosin, Paul L. 620, 677 Rotger, David 285 Rousson, Mika¨el 317 Ruedin, Ana 895 Safabakhsh, Reza 784 Sakellaropoulos, Philippos 237 Salazar, Augusto 416, 840 S´ anchez, Luis 840 S´ anchez-Ferrero, Gonzalo V. 189 Sarti, Lorenzo 408 Sarve, Hamid 253 Sas, Jerzy 531 Schwarz, Daniel 301 ´ lo, Piotr 245 Scis Scotney, Bryan 604 Segu´ı, Santi 293 Seidel, Martin 36 Seijas, Leticia 895 Shih, Weir-Sheng 937 Shorin, Al 776 Sijbers, Jan 563 ˇ Siler, Ondˇrej 261 Skiadopoulos, Spyros 237 ˙ So˜ gukpınar, Ibrahim 383 Soille, Pierre 636, 742 Song, Ge 669 ˇ Sorel, Michal 450 ˇ Spanˇ el, Michal 261 Spyridonos, Panagiota 864 ˇ Sroubek, Filip 856 ˇ Stancl, V´ıt 261 Stathaki, Tania 710 S ¸ tefan, Ion 278
1005
1006
Author Index
Stejskal, Stanislav 309 Stricker, Didier 20 Su, Tong-Hua 539 Suesse, Herbert 490 Sutherland, Alistair 374 Svensson, Stina 173 Svoboda, David 309 ˇ Svub, Miroslav 261 Taheri, Shahram 970 Taylor II, Russell M. 125 Theoharis, Theoharis 645 Thurau, Christian 93 Todd-Pokropek, Andrew 270 T¨ onnies, Klaus 571 Trist´ an, Antonio 768 Uehara, Yusuke 954 Uhl, Andreas 36 Van de Walle, Rik 28, 848 van Vliet, Lucas J. 718 Verity, Dominic 351 Vertan, Constantin 278 Villagr´ a, Carlos 141 Vitri` a, Jordi 293 Vrabl, Andreas 514
Wachenfeld, Steffen 921 Wang, Hong 669 Wang, Yang 85 Wang, Yuehong 954 Watters, Paul 351 Weghorn, Hans 832 Weiglmaier, Rudolf 36 Whelan, Paul F. 374 Wientapper, Folker 20 Wild, Marie 886 Wilson, Richard C. 823 Wuest, Harald 20 Yau, Wai Chee 832 Yin, Xiao-Xia 878 Yu, Dahai 374 Zaboli, Hamidreza 734, 784 Zachmann, Gabriel 432 Zaharieva, Maia 547 Zambanini, Sebastian 547 Zhang, Tian-Wen 539 Zheng, Guoyan 792 Zhou, Dongxiao 979 Zimmermann, Michal 309 Zolghadri, Mansoor J. 970 Zucker, Steven W. 13 ˇ c, Joviˇsa 677 Zuni´